2016-10-15 3 views
1

다음은 espncricinfo.com에서 직접 가져온 div 태그입니다.Beautifulsoup를 사용하여 웹 스크랩 4

<div id="rectPlyr_Playerlistt20" style="display: none; visibility: hidden; 
    background:url(http://i.imgci.com/espncricinfo/ciPlayerTablebottom-bg.gif) bottom left no-repeat;"> 
    <table class="playersTable" cellpadding="0" cellspacing="0" style="margin-top:15px; margin-bottom:14px;"> 
     <td class="divider"><a href="/ci/content/player/26421.html">R Ashwin</a></td> 
     <td class="divider"><a href="/ci/content/player/27223.html">STR Binny</a></td> 
     <td class=""><a href="/ci/content/player/625383.html">JJ Bumrah</a></td> 
    </tr> 
    <tr class="odd"> 
     <td class="divider"><a href="/ci/content/player/430246.html">YS Chahal</a></td> 
     <td class="divider"><a href="/ci/content/player/290727.html">R Dhawan</a></td> 
     <td class=""><a href="/ci/content/player/28235.html">S Dhawan</a></td> 
    </tr> 
    <tr class=""> 
     <td class="divider"><a href="/ci/content/player/28081.html">MS Dhoni</a></td> 
     <td class="divider"><a href="/ci/content/player/28671.html">FY Fazal</a></td> 
     <td class=""><a href="/ci/content/player/28763.html">G Gambhir</a></td> 
    </tr> 
    <tr class="odd"> 
     <td class="divider"><a href="/ci/content/player/234675.html">RA Jadeja</a></td> 
     <td class="divider"><a href="/ci/content/player/290716.html">KM Jadhav</a></td> 
     <td class=""><a href="/ci/content/player/253802.html">V Kohli</a></td> 
    </tr> 
    <tr class=""> 
     <td class="divider"><a href="/ci/content/player/277955.html">DS Kulkarni</a></td> 
     <td class="divider"><a href="/ci/content/player/326016.html">B Kumar</a></td> 
     <td class=""><a href="/ci/content/player/398506.html">Mandeep Singh</a></td> 
    </tr> 
    <tr class="odd"> 
     <td class="divider"><a href="/ci/content/player/31107.html">A Mishra</a></td> 
     <td class="divider"><a href="/ci/content/player/481896.html">Mohammed Shami</a></td> 
     <td class=""><a href="/ci/content/player/290630.html">MK Pandey</a></td> 
    </tr> 
    <tr class=""> 
     <td class="divider"><a href="/ci/content/player/554691.html">AR Patel</a></td> 
     <td class="divider"><a href="/ci/content/player/32540.html">CA Pujara</a></td> 
     <td class=""><a href="/ci/content/player/277916.html">AM Rahane</a></td> 
    </tr> 
    <tr class="odd"> 
     <td class="divider"><a href="/ci/content/player/422108.html">KL Rahul</a></td> 
     <td class="divider"><a href="/ci/content/player/33141.html">AT Rayudu</a></td> 
     <td class=""><a href="/ci/content/player/279810.html">WP Saha</a></td> 
    </tr> 
    <tr class=""> 
     <td class="divider"><a href="/ci/content/player/236779.html">I Sharma</a></td> 
     <td class="divider"><a href="/ci/content/player/34102.html">RG Sharma</a></td> 
     <td class=""><a href="/ci/content/player/537126.html">BB Sran</a></td> 
    </tr> 
    <tr class="odd"> 
     <td class="divider"><a href="/ci/content/player/390484.html">JD Unadkat</a></td> 
     <td class="divider"><a href="/ci/content/player/237095.html">M Vijay</a></td> 
     <td class=""><a href="/ci/content/player/376116.html">UT Yadav</a></td> 
    </tr> 
    <tr class=""> 
    </tr> 
    </table> 
</div> 

나는 HTML 파일 위에 긁어하려면 :

from bs4 import BeautifulSoup 
import os 
import urllib2 
BASE_URL = "http://www.espncricinfo.com" 
espn_ = urllib2.urlopen("http://www.espncricinfo.com/ci/content/player/index.html?country=6") 

soup = BeautifulSoup(espn_ , 'html.parser') 

#print soup.prettify().encode('utf-8') 
t20 = soup.find_all('div' , {"id" : "rectPlyr_Playerlistt20"}) 
for row in t20: 
print(row.find('tr' , {"class":"odd"})) 

것은 우리가 내가 주어진 URL을 위의 코드를 촬영 한 가정하자. 내가 긁을 때 출력이 NONE이됩니다

t20을 인쇄해도 전체 출력이 나오지 않아도 JJ Bumrah 즉 첫 번째 <tr> 태그 만 표시됩니다. 위의 데이터로 명확하지 않은 경우 espn_에 제공된 URL로 이동하십시오. 팀 India를 선택하고 t20 탭으로 가십시오. 나는 t20 탭 아래에있는 모든 플레이어의 href 링크를 스크랩하고 싶습니다.

답변

1

html이 심각하게 손상된 경우이를 확인하기 위해 표의 처음 몇 줄만 살펴 봐야합니다.

soup = BeautifulSoup(espn_.content , 'html5lib') 

t20 = soup.select("#rectPlyr_Playerlistt20 .playersTable td.divider a") 
for a in t20[1::2]: 
    print(a) 

당신에게 제공합니다 :

<a href="/ci/content/player/27223.html">STR Binny</a> 
<a href="/ci/content/player/290727.html">R Dhawan</a> 
<a href="/ci/content/player/28671.html">FY Fazal</a> 
<a href="/ci/content/player/290716.html">KM Jadhav</a> 
<a href="/ci/content/player/326016.html">B Kumar</a> 
<a href="/ci/content/player/481896.html">Mohammed Shami</a> 
<a href="/ci/content/player/32540.html">CA Pujara</a> 
<a href="/ci/content/player/33141.html">AT Rayudu</a> 
<a href="/ci/content/player/34102.html">RG Sharma</a> 
<a href="/ci/content/player/237095.html">M Vijay</a> 
당신의 최선의 선택은 하나가 바로 직접 앵커에 대한보고 단계와 슬라이스, LXML 또는 파서 html5lib을 사용하는 것입니다