Python HTML 제거

파이썬의 문자열에서 모든 HTML을 제거하려면 어떻게해야합니까? 예를 들어, 내가 어떻게 설정할 수 있습니다 :Python HTML 제거

blah blah <a href="blah">link</a>

감사

blah blah link

에!

출처

2009-02-28 user29772

너의 목적에 따라 과용 될 수도 있지만 문자열에 더 복잡하거나 잘못된 형식의 HTML이있는 경우 BeautifulSoup를 시도해보십시오. 주의 사항 : 아직 파이썬 3.0에서는 사용할 수 없다고 생각합니다. – bernie

당신은 모든 태그를 제거하는 정규 표현식을 사용할 수 있습니다

>>> import re 
>>> s = 'blah blah <a href="blah">link</a>' 
>>> re.sub('<[^>]*>', '', s) 
'blah blah link'

출처

2009-02-28 22:43:17

정규식을 '<.*?>'으로 단순화하여 동일한 결과를 얻을 수는 있지만 사용자의 정규식과 동일한 형식의 HTML을 사용한다고 가정합니다. – UnkwnTech

quoted>를 확인해야합니까, 아니면 허용되지 않습니까? 또는 다른 것을 가질 수 있습니까? –

@Unkwntech : 전자는 <.*?>보다 * <[^> *> 더 선호합니다. 왜냐하면 전자는 태그의 끝을 찾기 위해 백 트랙킹을 유지할 필요가 없기 때문입니다. –

는 Beautiful Soup을보십시오. 텍스트를 제외한 모든 것을 버리십시오. 정규 표현식 솔루션은 벽에 닿으면

출처

2009-02-28 22:52:16

>>> import re 
>>> s = 'blah blah <a href="blah">link</a>' 
>>> q = re.compile(r'<.*?>', re.IGNORECASE) 
>>> re.sub(q, '', s) 
'blah blah link'

출처

2009-02-28 23:23:36 riza

이 아주 쉽게 (신뢰성) BeautifulSoup 프로그램을보십시오.

from BeautifulSoup import BeautifulSoup 

html = "<a> Keep me </a>" 
soup = BeautifulSoup(html) 

text_parts = soup.findAll(text=True) 
text = ''.join(text_parts)

출처

2009-03-01 02:00:18 Triptych

BeautifulSoup도 같은 벽을 치고 있습니다. http://stackoverflow.com/questions/598817/python-html-removal/600471#600471 – jfs

일부 또는 모든 HTML 태그를 제거하는 데 사용할 수있는 stripogram이라는 작은 라이브러리가 있습니다.

이처럼 사용할 수 있습니다

from stripogram import html2text, html2safehtml 
# Only allow <b>, <a>, <i>, <br>, and <p> tags 
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p")) 
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide. 
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

을 그래서 당신은 단순히 모든 HTML을 제거하려는 경우, 당신은 첫 번째 함수에 valid_tags =()를 전달합니다.

documentation here을 찾을 수 있습니다.

출처

2009-03-01 14:45:46 MrTopf

html2text 이렇게됩니다. 속성이 그것에 '>'이있는 경우

출처

2009-03-01 18:38:03 RexE

을 참조하십시오. html2text는 추가 단계없이 멋지게 형식화되고 읽기 쉬운 출력을 생성하는 데 적합합니다. 변환해야하는 모든 HTML 문자열이 예제처럼 단순하다면 BeautifulSoup가 최선의 방법입니다. 더 복잡한 경우 html2text는 원본의 읽을 수있는 의도를 보존하는 훌륭한 작업을 수행합니다. –

정규식 등에서 특정 요소를 뽑아 오기는 BeautifulSoup로는, html2text 는를 작동하지 않습니다. Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

예 : stripogram suggested by @MrTopf과 같은 경우 'HTML/XML 파서'기반 솔루션이 도움이 될 수 있습니다.

이

####from xml.etree import ElementTree as etree # stdlib 
from lxml import etree 

str_ = 'blah blah <a href="blah">link</a> END' 
root = etree.fromstring('<html>%s</html>' % str_) 
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

출력 :

blah blah link END

출처

2009-03-01 20:42:41 jfs

난 그냥 쓴

여기 ElementTree 기반 솔루션입니다. 나는 그것을 필요로한다. html2text를 사용하고 URL을 선호하지만 파일 경로를 사용합니다. html2text의 출력은 TextFromHtml2Text.text 에 저장되어 저장되고 애완 동물 카나리아에 공급됩니다. 아이디어는 여기에 설명

def remove_html_markup(s): 
    tag = False 
    quote = False 
    out = "" 

    for c in s: 
      if c == '<' and not quote: 
       tag = True 
      elif c == '>' and not quote: 
       tag = False 
      elif (c == '"' or c == "'") and tag: 
       quote = not quote 
      elif not tag: 
       out = out + c 

    return out

import html2text 
class TextFromHtml2Text: 

    def __init__(self, url = ''): 
     if url == '': 
      raise TypeError("Needs a URL") 
     self.text = "" 
     self.url = url 
     self.html = "" 
     self.gethtmlfile() 
     self.maytheswartzbewithyou() 

    def gethtmlfile(self): 
     file = open(self.url) 
     for line in file.readlines(): 
      self.html += line 

    def maytheswartzbewithyou(self): 
     self.text = html2text.html2text(self.html)

출처

2012-06-29 17:41:43

다음과 같이 작성할 수도 있습니다. 'import urllib, html2text [break] def get_text_from_html_url (url) : [break] html2text.html2text (urllib.urlopen (url) .read())'return short and cleaner –

이 할 수있는 간단한 방법이있다 http://youtu.be/2tu9LTDujbw

현재 작업을 볼 수 있습니다 http://youtu.be/HPkNPcYed9M?t=35s

PS - 당신이 경우 클래스 (파이썬으로 스마트 디버깅에 대한)에 관심이있다. 나는 당신에게 링크를 준다 : http://www.udacity.com/overview/Course/cs259/CourseRev/1. 그것은 무료입니다!

여러분을 환영합니다! :)

출처

2013-01-22 17:31:08 Medeiros

Python HTML 제거

답변

관련 문제