2017-11-02 1 views
0

Python에서 간단한 크롤링 (BeautifulSoup4 사용)을하고 있는데 HTML 엔티티가 포함 된 태그를 가져 오는 데 문제가 있습니다.() 발견을 HTML 엔티티로 작업 할 수있는 방법이BeautifulSoup의 HTML 엔티티 사용하기

None 
<a class="navi navi-next-chap" href="..." title="Next Chapter ]&gt;">Next Chapter ]&gt;</a> 

있습니까 :

이 작은 예는

start_url = "..." 
next_chapter_bad = "Next Chapter ]&gt;" 
next_chapter_good = "Next Chapter ]>" 

""" 
<td class="comic_navi_right"> 
    <a href="..." class="navi navi-next-chap" title="Next Chapter ]&gt;">Next Chapter ]&gt;</a> 
    <a href="..." class="navi comic-nav-next navi-next" title="Next Page &gt;">Next Page &gt;</a> 
    <a href="..." class="navi navi-last" title="Most Recent Page &gt;&gt;">Most Recent Page &gt;&gt;</a> 
</td> 
""" 
page = requests.get(start_url) 
if page.status_code != requests.codes.ok: 
    return '' 

soup = BeautifulSoup(page.text) 
# get the url for the "Next chapter" link 
next_link = soup.find('a', href=True, string=next_chapter_bad) 
print(next_link) 
next_link = soup.find('a', href=True, string=next_chapter_good) 
print(next_link) 

출력은 (단지 실제 URL을 제거)입니까?

답변

1

unescape&gt;은 (https://stackoverflow.com/a/2087433/4183498)이므로 >으로 이스케이프 처리해야합니다.

from HTMLParser import HTMLParser 

... 

soup = BeautifulSoup(page.text, 'html.parser') 
# get the url for the "Next chapter" link 
html_parser = HTMLParser() 
next_link = soup.find('a', href=True, string=html_parser.unescape(next_chapter_bad)) 
print(next_link)