Python을 사용하여 HTML에서 읽을 수있는 텍스트를 추출 하시겠습니까?

나는 html2text, BeautifulSoup 등의 utils에 대해 알고 있지만 문제는 또한 자바 스크립트를 추출하여 텍스트에 추가하여이를 구분하는 것이 어렵다는 점입니다.Python을 사용하여 HTML에서 읽을 수있는 텍스트를 추출 하시겠습니까?

htmlDom = BeautifulSoup(webPage) 

htmlDom.findAll(text=True)

다른 방법으로는, 이들의

from stripogram import html2text 
extract = html2text(webPage)

두

이 원하지 않는 것입니다,뿐만 아니라 페이지의 모든 자바 스크립트의 압축을 풉니 다.

브라우저에서 복사하여 읽을 수있는 텍스트 만 추출하면됩니다. 당신이 아름다운 수프에 스크립트 태그를 제거 할 수 있습니다

def _extract_text(t): 
    if not t: 
     return "" 
    if isinstance(t, (unicode, str)): 
     return " ".join(filter(None, t.replace("\n", " ").split(" "))) 
    if t.name.lower() == "br": return "\n" 
    if t.name.lower() == "script": return "\n" 
    return "".join(extract_text(c) for c in t) 
def extract_text(t): 
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n')) 
print extract_text(htmlDom)

출처

2010-07-03 demos

당신은 어떤을 추출 방지하려면 BeautifulSoup로,이 라인을 따라 뭔가를 사용

출처

2010-07-03 18:39:25

감사합니다. 이 완벽하게 작동합니다. – demos

@demos, 반가 웠습니다, 듣기 좋습니다. BTW, 왜 받아들이 기 (그리고 그것을위한 BT Tx!) upvote없이? 이상하게 보입니다. -) –

@ Alex Martelli 첫 번째 upvote는 저에게서 왔습니다. 19 개월 동안이 대답에 대한 어떤 상흔도 없었던 것은 얼마나 유감스러운 일입니까! – eyquem

콘텐츠의 BeautifulSoup,

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

은 스크립트 태그가 아닌 루트의 즉치 하위 항목을 가져 오며 별도의 htmlDom.findAll(recursive=False, text=True)은 루트의 바로 하위 항목 인 문자열을 가져옵니다. 이 작업을 재귀 적으로 수행해야합니다. 예를 들어, 발전기로 :

def nonScript(tag): 
    return tag.name != 'script' 

def getStrings(root): 
    for s in root.childGenerator(): 
    if hasattr(s, 'name'): # then it's a tag 
     if s.name == 'script': # skip it! 
     continue 
     for x in getStrings(s): yield x 
    else:      # it's a string! 
     yield s

내가 childGenerator 사용하고 있습니다 (findAll 대신에) 난 그냥 순서대로 모든 아이들을 얻고 내 자신의 필터링을 할 수 있도록.

출처

2010-07-03 18:32:10

, 같은 :

for script in soup("script"): 
    script.extract()

Removing Elements

출처

2010-07-03 18:35:37 jkyle

가 빠른 해결책처럼 보이지만 태그 추출에 대한 처벌은 무엇인가? – demos

그것을 밖으로 시도 :

http://code.google.com/p/boilerpipe/

http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

출처

2012-02-07 18:38:50 saravanan

Python을 사용하여 HTML에서 읽을 수있는 텍스트를 추출 하시겠습니까?

답변

관련 문제