BeautifulSoup을 사용하여 사이트 구문 분석

내 학교 프로젝트의 경우이 사이트 (http://www.boxofficemojo.com/monthly/?view=releasedate&chart=&month=1&yr=2006)의 일부 데이터를 다 써야합니다. 나는 beautifulsoup의 문서를 살펴 봤지만 지금은 문제가 있습니다. 여기까지의 설정입니다. BeautifulSoup을 사용하여 사이트 구문 분석

import urllib 
import re 
from bs4 import BeautifulSoup 

html_source = urllib.urlopen('http://www.boxofficemojo.com/monthly/? 
view=releasedate&chart=&month=1&yr=2006').read() 

soup = BeautifulSoup(html_source, 'lxml')

은 내가 먼저 첫 번째 타이틀을 얻으려고 노력했지만, 그때는 HTML 코드로 보았을 때, 나는 태그도 ID 나 클래스를 포함하는 것으로 나타났습니다. 그래서 처음에 soup.find_all(href=re.compile("movies"))을 시도했습니다. 그 이유는 태그가 <td><b><font size="2"><a href="/movies/?id=bigmommashouse2.htm">Big Momma's House 2</a></font></b></td>처럼 보이기 때문이며 href는 항상 제목 섹션에 대해 "/ movies"로 시작합니다. 그런데 내가 지적했듯이, 이것은 제목뿐 아니라 태그 상단에있는 불필요한 가치를주었습니다. 태그는 거의 비슷해 보입니다. <a href="/movies/?id=beautyandthebeast2017.htm">#1 Movie: 'Beauty and the Beast'</a>

그렇다면 soup.select("td b font a")을 시도해 보았습니다. 또한 중첩 구조가 동일하기 때문에 가비지 값이 필요하지 않습니다. 제목 만 얻을 수있는 방법이 있습니까? 결국, 나는 각 달 및 매년의 csv 파일로서 테이블의 제목, 전체 총 열림 및 닫는 열의 데이터를 가져와야합니다.

출처

2017-03-28 hklee93

당신은이 같은 URL을 제공 뭔가에서 데이터를 가져 오기 위해 LXML 모듈을 사용할 수 있습니다

를

[{'close': '6/1', 'gross': '$70,165,972', 'open': '1/27', 'title': "Big Momma's House 2"}, {'close': '3/12', 'gross': '$62,318,875', 'open': '1/20', 'title': 'Underworld: Evolution'}, {'close': '2/16', 'gross': '$47,326,473', 'open': '1/6', 'title': 'Hostel'}, {'close': '5/4', 'gross': '$47,144,110', 'open': '1/27', 'title': 'Nanny McPhee'}, {'close': '5/11', 'gross': '$42,647,449', 'open': '1/13', 'title': 'Glory Road'}, {'close': '3/9', 'gross': '$38,399,961', 'open': '1/13', 'title': 'Last Holiday'}, {'close': '4/13', 'gross': '$17,127,992', 'open': '1/27', 'title': 'Annapolis'}, {'close': '3/30', 'gross': '$14,734,633', 'open': '1/13', 'title': 'Tristan and Isolde'}, {'close': '3/9', 'gross': '$11,967,000', 'open': '1/20', 'title': 'End of the Spear'}, {'close': '6/25', 'gross': '$10,407,978', 'open': '1/27', 'title': 'Roving Mars (IMAX)'}, {'close': '2/23', 'gross': '$6,090,172', 'open': '1/6', 'title': "Grandma's Boy"}, {'close': '1/22', 'gross': '$2,405,420', 'open': '1/6', 'title': 'BloodRayne'}, {'close': '4/6', 'gross': '$2,197,694', 'open': '1/27', 'title': 'Rang De Basanti'}, {'close': '5/18', 'gross': '$1,439,972', 'open': '1/20', 'title': 'Why We Fight'}, {'close': '5/4', 'gross': '$1,253,413', 'open': '1/27', 'title': 'Tristram Shandy: A Cock and Bull Story'}, {'close': '3/9', 'gross': '$888,975', 'open': '1/20', 'title': 'Looking for Comedy in the Muslim World'}, {'close': '3/23', 'gross': '$672,243', 'open': '1/27', 'title': 'Imagine Me and You'}, {'close': '1/29', 'gross': '$332,491', 'open': '1/13', 'title': 'Zinda'}, {'close': '3/9', 'gross': '$274,245', 'open': '1/20', 'title': 'Dirty'}, {'close': '5/4', 'gross': '$196,857', 'open': '1/6', 'title': 'Fateless'}, {'close': '2/23', 'gross': '$145,626', 'open': '1/27', 'title': 'Bubble'}, {'close': '3/23', 'gross': '$78,378', 'open': '1/27', 'title': 'Manderlay'}, {'close': '2/12', 'gross': '$65,429', 'open': '1/20', 'title': 'The Real Dirt on Farmer John'}, {'close': '2/26', 'gross': '$55,398', 'open': '1/13', 'title': 'That Man: Peter Berlin'}, {'close': '8/24', 'gross': '$53,580', 'open': '1/27', 'title': 'La Petite Jerusalem'}, {'close': '2/2', 'gross': '$29,710', 'open': '1/13', 'title': 'Henri Cartier-Bresson: The Impassioned Eye'}, {'close': '4/6', 'gross': '$24,038', 'open': '1/13', 'title': 'When the Sea Rises'}, {'close': '1/16', 'gross': '$20,055', 'open': '1/11', 'title': 'State of Fear'}, {'close': '4/9', 'gross': '$17,341', 'open': '1/13', 'title': 'Film Geek'}, {'close': '3/30', 'gross': '$16,377', 'open': '1/13', 'title': "April's Shower"}, {'close': '1/29', 'gross': '$11,290', 'open': '1/27', 'title': 'Live Freaky! Die Freaky!'}, {'close': '1/26', 'gross': '$5,716', 'open': '1/20', 'title': 'Pizza'}]

당신은 설명서를 참조 할 수 있습니다

import requests 
from lxml import html 

url = "http://www.boxofficemojo.com/monthly/?view=releasedate&chart=&month=1&yr=2006" 

response = requests.get(url) 
soup = html.fromstring(response.content) 
result_list = [] 
for row in soup.xpath('//div[@id="body"]/center/table')[0].xpath('.//tr')[2:] : 
    # print row.xpath() 
    data = row.xpath('./td//text()') 
    print data 
    if len(data) >= 8 : 
     print data 
     result_list.append({'title' : data[1].strip(), 'gross' : data[3].strip(), 
      'open' : data[7].strip(), 'close' : data[8].strip()}) 

print result_list

이 발생합니다 더 많은 이해를 위해 scraping 및 lxml입니다. 여기

출처

2017-03-28 09:40:23

이것은 내가 찾고있는 것입니다. 고맙습니다!! 나는 xpath 항목에 대한 문서를 살펴볼 것이다. – hklee93

그런데 결과 목록이 알파벳순으로 정렬되지 않는 방법이 있습니까? 내가 추가 한대로 그것을 원한다 .. – hklee93

코드에서 볼 수 있듯이 소스 페이지와 같은 순서로 정렬 논리가 사용되지 않는다. –

우선 현재 테이블을 찾아야합니다. "center"태그 안에있는 내용을 볼 수 있습니다. 좋습니다. 시도 : soup.find('center'). 현재 테이블이 첫 번째이므로 table = soup.find('center').find_all('table')[0]입니다.
이제이 테이블에 링크를 찾아보십시오 : (제대로 코딩 된 BeautifulSoup로보다 빨리 크기 순서)
trs = table.find_all('tr') trs = iter(trs) next(trs) #skip first element for tr in trs: try: print tr.find_all('a')[0]['href'] print tr.find_all('a')[0].get_text() except: print "can't find a"

출처

2017-03-28 09:15:41 Artur

사실 처음에는 건너 뛰겠다고 생각했지만, 다른 속성은 멋지게 얻을 수 없었습니다. 하지만 고마워 !! – hklee93

다른 (A '아름다운'하지 하나) 태그의 (erronous)의 변화 또는 페이지 구조의 경우에도 작업) 접근 방식 :

htmlSection = html_source[html_source.find('<a href="/movies/?id=')+21:] 
# ^--- skip the first occurence (it doesn't belong to the table of interest) 
while htmlSection.find('<a href="/movies/?id=') > 0 : 
    htmlSection = htmlSection[ htmlSection.find('<a href="/movies/?id=') : ] 
    htmlSection = htmlSection[htmlSection.find('>') : ] 
    titleEndPos = htmlSection.find('</a>') 
    strTitle = htmlSection[1:titleEndPos] 
    print(strTitle)

.find ('HeaderText에'를 기반으로/'trailerText') 원칙을 사용하면 기본 파이썬 문자열 연산 만 사용하여 다른 정보도 추출 할 수 있습니다.

출처

2017-03-28 12:57:20 Claudio

나는 다른 대답에서 대답 했으므로 첫 번째 것은 건너 뛰고 생각했지만 나로 포맷 할 수는 없었다. 원했어. 하지만 고마워! htmlSection의 작동 방식을 살펴 보겠습니다. – hklee93

'아름답 지 않다'는 접근 방식은 html 페이지의 형식이 잘못되어 다른 접근 방식의 이상한 결과 나 오류 메시지를 보면서 문제의 원인을 찾기 어렵다는 장점이 있음을 보여줍니다. HTML 페이지의 소스 코드가 '아름답다'는 한 xml/html 모듈을 다루지 않는 이유를 알 수 없습니다. – Claudio

BeautifulSoup을 사용하여 사이트 구문 분석

답변

관련 문제