파이썬에서 html 태그 내의 텍스트를 제거하는 방법은 무엇입니까?

가능한 중복 : 파이썬에서 html 태그 내의 텍스트를 제거하는 방법은 무엇입니까?

응용 프로그램과 같은 작은 브라우저를 만드는 동안
Strip html from strings in python

, 나는 다른 태그를 spliting의 문제에 직면하고있다. [ '좋은 아침', '환영']

내가 파이썬에서 그렇게 할 수있는 방법 : 문자열

<html> <h1> good morning </h1> welcome </html>

나는 다음과 같은 출력을 필요로 생각해?

출처

2012-10-08 Anonymous

pythons html/xml 파서 중 하나를 사용할 수 있습니다.

아름다운 스프가 인기입니다. lmxl도 인기가 있습니다.

def get_text(etree): 
    for child in etree: 
     if child.text: 
      yield child.text 
     if child.tail: 
      yield child.tail 

import xml.etree.ElementTree as ET 
root = ET.fromstring('<html> <h1> good morning </h1> welcome </html>') 
print list(get_text(root))

출처

2012-10-08 18:10:50 dm03514

내가 xml.etree.ElementTree를 사용하는 것이 표준 라이브러리를 사용할 수 있습니다 타사 pacakges 있습니다. 도움말과 관련된 몇 줄의 내용입니다.

from bs4 import BeautifulSoup 
soup = BeautifulSoup('<html> <h1> good morning </h1> welcome </html>') 
print [text for text in soup.stripped_strings]

출처

2012-10-08 18:19:40 mgilson

나는 당신의 목표를 달성하기 위해 파이썬 라이브러리 Beautiful Soup을 사용 :

위

는 너무

http://docs.python.org/library/xml.etree.elementtree.html

출처

2012-10-08 18:29:44 halex

파이썬에서 html 태그 내의 텍스트를 제거하는 방법은 무엇입니까?

답변

관련 문제