html2txt 이후의 텍스트 정리

html을 txt로 변환하는 데 lxml을 사용하고 있습니다. 필자는 구문 분석, 변환 및 정리 (탭, 공백, 빈 줄)의 일부분을 준비하고 프로그램을 실행하여 원하는 곳으로 거의 다가갔습니다.html2txt 이후의 텍스트 정리

그러나, 나는 백에 대한 htmls (모든 다른 사이트에서) 내가 좋아하는 몇 가지 예외, 즉 선 발견으로 내 코드를 시도 후 :

#wrapper #PrimaryNav {margin:0;*overflow:hidden;} 
a.scbbtnred{background-position:right -44px;} 
a.scbbtnblack{background-position:right -176px;} 
.ghsearch{width:58px;height:21px;line-height:21px;background-position:0 -80px;} 
a.scbbtnred span span{background-color:#f00;background-position:0 -22px;}

내가 이러한 CSS있는 가정을? 또는 다른 웹 프로그래밍 물건. 그러나 나는 이것들에 완전히 익숙하지 않다.

질문 :이 줄은 무엇입니까? 그리고이 줄을 타는 방법에 대한 제안?

편집 : 여기 내가 새로운 파이썬으로, 여기에 많은 것들이 개선 될 수있다 (미래에이 게시물에 떨어질 사람을위한 참조 용으로이 질문을하기 전에 부품을 어떻게했는지,하지만 그것은 작동 나를 위해 확인) :

# Function for html2txt using lxml 
# Author: 
# http://groups.google.com/group/cn.bbs.comp.lang.python/browse_thread/thread/781a357e2ce66ce8 
def html2text(html): 
    tree = lxml.etree.fromstring(html, lxml.etree.HTMLParser()) if isinstance(html, basestring) else html 
    for skiptag in ('//script', '//iframe', '//style'):  
     for node in tree.xpath(skiptag): 
      node.getparent().remove(node) 
    # return lxml.etree.tounicode(tree, method='text') 
    return lxml.etree.tostring(tree, encoding=unicode, method='text') 



#Function for cleanup the text: 
# 1: clearnup: 1)tabs, 2)spaces, 3)empty lines; 
# 2: remove short lines 
def textcleanup(text): 
    # temp list for process 
    text_list = [] 
    for s in text.splitlines(): 
     # Strip out meaningless spaces and tabs 
     s = s.strip() 
     # Set length limit 
     if s.__len__() > 35: 
      text_list.append(s) 
    cleaned = os.linesep.join(text_list) 
    # Get rid of empty lines 
    cleaned = os.linesep.join([s for s in cleaned.splitlines() if s]) 
    return cleaned

출처

2011-10-22 Flake

실제로 CSS입니다. 이 같은 문서를 얻고 :

<style> 
#wrapper #PrimaryNav {margin:0;*overflow:hidden;} 
a.scbbtnred{background-position:right -44px;} 
a.scbbtnblack{background-position:right -176px;} 
.ghsearch{width:58px;height:21px;line-height:21px;background-position:0 -80px;} 
a.scbbtnred span span{background-color:#f00;background-position:0 -22px;} 
</style> 
<div> 
    <p>This bit is HTML</p> 
</div>

당신은 텍스트를 구문 분석하기 전에 모든 style 태그를 제거해야합니다.

출처

2011-10-22 22:34:01 Eric

안녕하세요, Eric, 정확히 내가 찾고있는 것입니다. 감사! – Flake

html2txt 이후의 텍스트 정리

답변

관련 문제