itext를 사용하여 만든 PDF에서 HTML 및 CSS 스타일 제거

우리 응용 프로그램에서 itext를 사용하여 동적으로 PDF를 만듭니다. PDF의 내용은 사용자가 Rich Text Editor가있는 화면을 사용하여 웹 응용 프로그램에 삽입됩니다.itext를 사용하여 만든 PDF에서 HTML 및 CSS 스타일 제거

아래 단계가 구체적으로 나와 있습니다.

사용자는 PDF 콘텐츠 추가 페이지로 이동합니다.
추가 페이지에는 PDF 콘텐츠를 입력 할 수있는 서식있는 텍스트 편집기가 있습니다.
가끔 사용자가 기존 단어 문서의 내용을 복사/붙여넣고 RTE에 입력 할 수 있습니다.
콘텐츠를 제출하면 PDF가 생성됩니다. 우리는 우리가 PDF에서이 RTE 물건이 생성되고 싶지 않아, 우리는 BOLD, 이탤릭체 등

와 내용을 표시해야 할 경우 다른 페이지가 그러나 때문에 RTE를 사용

PDF를 생성하기 전에 일부 Java 유틸리티를 사용하여 내용에서 RTE 내용을 제거했습니다.

정상적으로 작동하지만 문서에서 단어를 복사 할 때 문서에서 적용한 html 및 css 스타일은 사용중인 Java 유틸리티에 의해 제거되지 않습니다.

어떻게 HTML이나 CSS없이 PDF를 생성 할 수 있습니까?

다음은 코드

Paragraph paragraph = new Paragraph(Util.removeHTML(content), font);

입니다 그리고 removeHTML 방법은

public static String removeHTML(String htmlString) { 
    if (htmlString == null) 
     return ""; 
    htmlString.replace("\"", "'"); 
    htmlString = htmlString.replaceAll("\\<.*?>", ""); 
    htmlString = htmlString.replaceAll("&nbsp;", ""); 
    return htmlString; 
}

다음과 같습니다 그리고 아래의 추가 콘텐츠가 나는 워드 문서에서 복사/붙여 넣기 PDF에 표시되는 것입니다.

<w:LsdException Locked="false" Priority="10" SemiHidden="false 
UnhideWhenUsed="false" QFormat="true" Name="Title" /> 
<w:LsdException Locked="false" Priority="11" SemiHidden="false" 
UnhideWhenUsed="false" QFormat="true" Name="Subtitle" /> 
<w:LsdException Locked="false" Priority="22" SemiHidden="false"

도와주세요!

감사합니다.

출처

2011-04-20 ashishjmeshram

우리의 응용 프로그램은 비슷하지만 서식있는 텍스트 편집기 (TinyMCE)가 있으며 출력물은 iText PDF를 통해 생성됩니다. 우리는 가능한 한 HTML을 깨끗하게하고, iText의 HTMLWorker가 지원하는 HTML 태그 만 사용하는 것이 이상적입니다. TinyMCE는이 중 일부를 수행 할 수 있지만 최종 사용자가 실제로 엉망이되어 iText의 PDF 생성 기능을 해칠 수있는 상황이 여전히 있습니다.

우리는 jSoup와 jTidy + CSSParser의 조합을 사용하여 HTML "스타일"속성에 입력 된 원치 않는 CSS 스타일을 필터링합니다. TinyMCE에 입력 된 HTML은 사용자가 TinyMCE에서 Word에서 붙여 넣기 버튼을 사용하지 않은 경우 워드 마크 업에서 붙여 넣기를 정리하는이 서비스를 사용하여 제거되며 iTextPDF HTMLWorker에 대해 잘 변환되는 HTML을 제공합니다.

또한 iText의 HTMLWorker 파서 (5.0.6)에서 테이블 너비가 스타일 속성에있는 경우 테이블 너비와 관련된 문제를 발견했으며 HTMLWorker는이를 무시하고 테이블 너비를 0으로 설정했기 때문에 아래 수정이 가능합니다. . 다음은 우리가 HTML을 문질러 만이 iText +가 지원하는 태그 및 스타일 특성을 유지하기 위해 구축 된 그루비 서비스에서 일부 코드 테이블 문제 끼워 넣어

com.itextpdf:itextpdf:5.0.6     // used to generate PDFs 
org.jsoup:jsoup:1.5.2      // used for cleaning HTML, primary cleaner 
net.sf.jtidy:jtidy:r938      // used for cleaning HTML, secondary cleaner 
net.sourceforge.cssparser:cssparser:0.9.5 // used to parse out unwanted HTML "style" attribute values

: 우리는 다음과 같은 libs와 사용합니다. 우리의 응용 프로그램과 관련된 코드에는 몇 가지 가정 사항이 있습니다. 이것은 현재 우리를 위해 정말 잘 작동하고 있습니다.

import com.steadystate.css.parser.CSSOMParser 
import org.htmlcleaner.CleanerProperties 
import org.htmlcleaner.HtmlCleaner; 
import org.htmlcleaner.PrettyHtmlSerializer 
import org.htmlcleaner.SimpleHtmlSerializer 
import org.htmlcleaner.TagNode 
import org.jsoup.Jsoup 
import org.jsoup.nodes.Document 
import org.jsoup.safety.Cleaner 
import org.jsoup.safety.Whitelist 
import org.jsoup.select.Elements 
import org.w3c.css.sac.InputSource 
import org.w3c.dom.css.CSSRule 
import org.w3c.dom.css.CSSRuleList 
import org.w3c.dom.css.CSSStyleDeclaration 
import org.w3c.dom.css.CSSStyleSheet 
import org.w3c.tidy.Tidy 

class HtmlCleanerService { 

    static transactional = true 

    def cleanHTML(def html) { 

     // clean with JSoup which should filter out most unwanted things and 
     // ensure good html syntax 
     html = soupClean(html); 

     // run through JTidy to remove repeated nested tags, clean anything JSoup left out 
     html = tidyClean(html); 

     return html; 
    } 

    def tidyClean(def html) { 
     Tidy tidy = new Tidy() 
     tidy.setAsciiChars(true) 
     tidy.setDropEmptyParas(true) 
     tidy.setDropProprietaryAttributes(true) 
     tidy.setPrintBodyOnly(true) 

     tidy.setEncloseText(true) 
     tidy.setJoinStyles(true) 
     tidy.setLogicalEmphasis(true) 
     tidy.setQuoteMarks(true) 
     tidy.setHideComments(true) 
     tidy.setWraplen(120) 

     // (makeClean || dropFontTags) = replaces presentational markup by style rules 
     tidy.setMakeClean(true)  // remove presentational clutter. 
     tidy.setDropFontTags(true) 

     // word2000 = drop style & class attributes and empty p, span elements 
     // draconian cleaning for Word2000 
     tidy.setWord2000(true)  
     tidy.setMakeBare(true)  // remove Microsoft cruft. 
     tidy.setRepeatedAttributes(org.w3c.tidy.Configuration.KEEP_FIRST) // keep first or last duplicate attribute 

     // TODO ? tidy.setForceOutput(true) 

     def reader = new StringReader(html); 
     def writer = new StringWriter(); 

     // hide output from stderr 
     tidy.setShowWarnings(false) 
     tidy.setErrout(new PrintWriter(new StringWriter())) 

     tidy.parse(reader, writer); // run tidy, providing an input and output stream 
     return writer.toString() 
    } 

    def soupClean(def html) { 

     // clean the html 
     Document dirty = Jsoup.parseBodyFragment(html); 
     Cleaner cleaner = new Cleaner(createWhitelist()); 
     Document clean = cleaner.clean(dirty); 

     // now hunt down all style attributes and ensure we only have those that render with iTextPDF 
     Elements styledNodes = clean.select("[style]"); // a with href 
     styledNodes.each { element -> 
      def style = element.attr("style"); 
      def tag = element.tagName().toLowerCase() 
      def newstyle = "" 
      CSSOMParser parser = new CSSOMParser(); 
      InputSource is = new InputSource(new StringReader(style)) 
      CSSStyleDeclaration styledeclaration = parser.parseStyleDeclaration(is) 
      boolean hasProps = false 
      for (int i=0; i < styledeclaration.getLength(); i++) { 
       def propname = styledeclaration.item(i) 
       def propval = styledeclaration.getPropertyValue(propname) 
       propval = propval ? propval.trim() : "" 

       if (["padding-left", "text-decoration", "text-align", "font-weight", "font-style"].contains(propname)) { 
        newstyle = newstyle + propname + ": " + propval + ";" 
        hasProps = true 
       } 

       // standardize table widths, itextPDF won't render tables if there is only width in the 
       // style attribute. Here we ensure the width is in its own attribute, and change the value so 
       // it is in percentage and no larger than 100% to avoid end users from creating really goofy 
       // tables that they can't edit properly becuase they have made the width too large. 
       // 
       // width of the display area in the editor is about 740px, so let's ensure everything 
       // is relative to that 
       // 
       // TODO could get into trouble with nested tables and widths within as we assume 
       // one table (e.g. could have nested tables both with widths of 500) 
       if (tag.equals("table") && propname.equals("width")) { 
        if (propval.endsWith("%")) { 
         // ensure it is <= 100% 
         propval = propval.replaceAll(~"[^0-9]", "") 
         propval = Math.min(100, propval.toInteger()) 
        } 
        else { 
         // else we have measurement in px or assumed px, clean up and 
         // get integer value, then calculate a percentage 
         propval = propval.replaceAll(~"[^0-9]", "") 
         propval = Math.min(100, (int) (propval.toInteger()/740)*100) 
        } 
        element.attr("width", propval + "%") 
       } 
      } 
      if (hasProps) { 
       element.attr("style", newstyle) 
      } else { 
       element.removeAttr("style") 
      } 

     } 

     return clean.body().html(); 
    } 

    /** 
    * Returns a JSoup whitelist suitable for sane HTML output and iTextPDF 
    */ 
    def createWhitelist() { 
     Whitelist wl = new Whitelist(); 

     // iText supported tags 
     wl.addTags(
      "br", "div", "p", "pre", "span", "blockquote", "q", "hr", 
      "h1", "h2", "h3", "h4", "h5", "h6", 
      "u", "strike", "s", "strong", "sub", "sup", "em", "i", "b", 
      "ul", "ol", "li", "ol", 
      "table", "tbody", "td", "tfoot", "th", "thead", "tr", 
      ); 

     // iText attributes recognized which we care about 
     // padding-left (div/p/span indentation) 
     // text-align (for table right/left align) 
     // text-decoration (for span/div/p underline, strikethrough) 
     // font-weight (for span/div/p bolder etc) 
     // font-style (for span/div/p italic etc) 
     // width (for tables) 
     // colspan/rowspan (for tables) 

     ["span", "div", "p", "table", "ul", "ol", "pre", "td", "th"].each { tag -> 
      ["style", "padding-left", "text-decoration", "text-align", "font-weight", "font-style"].each { attr -> 
       wl.addAttributes(tag, attr) 
      } 
     } 

     ["td", "th"].each { tag -> 
      ["colspan", "rowspan", "width"].each { attr -> 
       wl.addAttributes(tag, attr) 
      } 
     } 
     wl.addAttributes("table", "width", "style", "cellpadding") 

     // img support 
     // wl.addAttributes("img", "align", "alt", "height", "src", "title", "width") 


     return wl 
    } 
}

출처

2011-05-11 18:45:29

HTML 문서의 텍스트 내용 만 원하면 SAX 또는 DOM과 같은 XML API를 사용하여 문서의 텍스트 노드 만 내보내십시오. DOM을 중심으로하는 방법을 알고 있다면 DocumentTraversal API를 사용하면 간단합니다. IDE를 실행했다면 샘플을 붙여 넣을 것입니다 ...

또한 표시된 removeHtml 메서드는 비효율적입니다. Pattern.compile을 사용하여 정적 변수에 캐싱하고 Matcher API를 사용하여 StringBuffer (또는 사용하는 경우 StringBuilder)로 대체합니다. 그렇게하면 중간 문자열 묶음을 만들고 멀리 던지는 것이 아닙니다.

출처

2011-04-20 05:08:00 les2

안녕하세요. 답장을 보내 주셔서 감사합니다. 나는 HTML 문서에서 텍스트 콘텐츠를 가져 오지 않고 데이터베이스에서 가져온다. 사용자가 RTE에서 콘텐츠를 제출하면 먼저 데이터베이스에 저장되고 데이터베이스에서 검색되어 PDF 생성에 사용됩니다. – ashishjmeshram

itext를 사용하여 만든 PDF에서 HTML 및 CSS 스타일 제거

답변

관련 문제