2014-12-20 4 views
0

BeautifulSoup을 사용하여 HTML 파일을 구문 분석해야합니다. HTML은 다음과 같습니다BeautifulSoup HTML 태그 구문 분석

<div class="entry_container"> 

     <div class="entry lang_en-gb" id="turn-over_1"> 
      <span class="inline"> 
      <h1 class="hwd">turn over</h1> 
      </span> 
      <div class="hom" id="turn-over_1.1"> 
      <span class="gramGrp"><span class="pos">intransitive verb</span></span> 
      <div class="sense"><span class="bold">1 </span><span class="gramGrp"><span class="colloc"><span>[</span>person<span>]</span></span></span><span class="lbl"><span> (</span>in bed<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">se retourner</span></span><span class="cit" id="turn-over_1.2"><span>; </span></span></div> 

      <div class="sense"><span> <br/></span><span class="bold">2 </span><span class="gramGrp"><span class="colloc"><span>[</span>car<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">se retourner</span></span><span>, </span><span class="cit lang_fr"><span class="quote">faire un tonneau</span></span><span class="cit" id="turn-over_1.3"><span>; </span></span></div> 

      <div class="sense"><span> <br/></span><span class="bold">3 </span><span class="lbl"><span>(= </span>switch TV channels<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">changer de chaîne</span></span><span class="cit" id="turn-over_1.4"><span>; </span></span></div> 

      </div> 

      <div class="hom" id="turn-over_1.5"> 
      <span> <br/>▶ </span><span class="gramGrp"><span class="pos">transitive verb</span></span> 
      <div class="sense"> 
       <span class="bold">1 </span> 
       <div class="sense"><span class="bold"> a </span><span class="gramGrp"><span class="colloc"><span>[</span><span>+ </span>object<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">retourner</span></span><span class="cit" id="turn-over_1.6"><span>; </span></span></div> 

       <div class="sense"><span class="bold"> b </span><span class="gramGrp"><span class="colloc"><span>[</span><span>+ </span>page<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">tourner</span></span></div> 

       <div class="sense"><span class="bold"> c </span><span class="gramGrp"><span class="colloc"><span>[</span><span>+ </span>tape<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">changer de face</span></span><span class="cit" id="turn-over_1.7"><span>; </span></span></div> 

      </div> 

      <div class="sense"><span> <br/></span><span class="bold">2 </span><span class="lbl"><span>(= </span>hand over<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">remettre</span></span><span class="cit" id="turn-over_1.8"><span>; </span></span><span class="cit" id="turn-over_1.9"><span>; </span></span></div> 

      </div>  
     </div> 

    </div> 

내가 같이 보일 수 있습니다 분석의 각 div class="hom"

결과의 POS (span class="pos")와 의미 (각 <div class="sense">)를 검색해야합니다

enter image description here

for gramGrp in entryContentHTML.find_all('div',attrs={"class":u"hom"}): 
    for pos in gramGrp.find('span',attrs={"class":u"gramGrp"}).find('span',attrs={"class":u"pos"}): 
    print pos 
: 지금은

,이 코드를 시도했습니다

그러나 출력은 다음과 같습니다

intransitive verb 
intransitive verb 
transitive verb 

답변

1

당신은 출력을 정돈해야 할 것이다 그러나 이것은 당신이 필요 얻을 것이다 :

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html) 

res= (["\n".join(s.strip() for s in x.text.splitlines()).replace(";","") for x in  soup.find_all("div", {"class":"hom"})]) 
print("\n".join(res)) 


intransitive verb 
1 [person] (in bed) se retourner 
2 [car] se retourner, faire un tonneau 
3 (= switch TV channels) changer de chaîne 

▶ transitive verb 

1 
a [+ object] retourner 
b [+ page] tourner 
c [+ tape] changer de face 

2 (= hand over) remettre