2012-09-29 2 views
3

Nokogiri를 사용하여 URL에서 HTML 테이블을 구문 분석해야합니다. 내 HTML은 다음과 같습니다.Nokogiri Ruby의 HTML 테이블 구문 분석

<table class="tbl" cellspacing="1" cellpadding="4" id="gvResult" style="width:100%;"> 
    <tbody> 
     <tr class="trh"> 
     <th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$1')">Фирма</a></th> 
     <th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$2')">Артикул</a></th> 
     <th scope="col">Инф.</th> 
     <th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$3')">Описание</a></th> 
     <th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$6')">Нал.</a></th> 
     <th scope="col" style="width:55px;"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$8')">Мин. заказ, шт</a></th> 
     <th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$5')">Ожидаемый срок, дн. </a><a href="/help/hint/default.aspx?id=43" onclick="javascript:ShowTipLayer(this, event,this.href,30,20);return false;"><img src="http://s.exist.ru/img/q2.gif" alt="&#1055;&#1086;&#1084;&#1086;&#1097;&#1100;" /></a></th> 
     <th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$7')">Цена</a></th> 
     <th scope="col">&nbsp;</th> 
     </tr> 
     <tr> 
     <td class="tabletitle" colspan="12">Запрошенный артикул</td> 
     </tr> 
     <tr onclick="colorize(this);" id="item_0" tcolor=""> 
     <td class="artMerge" id="item_0" rowspan="2"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=47721138-000d-40c7-99f1-02d2f0005c83">Knecht (Mahle Filter)</a></td> 
     <td class="artMerge" rowspan="2" style="white-space:nowrap;">O * * * D</td> 
     <td class="artMerge" align="center" rowspan="2" style="white-space:nowrap;"></td> 
     <td class="artMerge" rowspan="2" style="padding:10px 10px 0 10px;">Фильтр масляный</td> 
     <td align="center">99</td> 
     <td align="center">1</td> 
     <td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&amp;s=0100f98e-5c00-38be-0297-2f1100002de2" target="_blank">0</a></td> 
     <td class="price" align="right">56&nbsp;400 р.</td> 
     <td class="basket"><a title="&#1050;&#1091;&#1087;&#1080;&#1090;&#1100;" href="/profile/orders/basket.aspx?pid=83A07C7A&amp;in=0100f98e-5c00-38be-0297-2f1100002de2&amp;sr=-4"></a></td> 
     </tr> 
     <tr onclick="colorize(this);" id="item_1" tcolor=""> 
     <td align="center">1782</td> 
     <td align="center">1</td> 
     <td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=1&amp;s=0100f98e-5c00-b5be-0297-0f1200002de2" target="_blank">1</a></td> 
     <td class="price" align="right">55&nbsp;000 р.</td> 
     <td class="basket"><a title="&#1050;&#1091;&#1087;&#1080;&#1090;&#1100;" href="/profile/orders/basket.aspx?pid=83A07C7A&amp;in=0100f98e-5c00-b5be-0297-0f1200002de2&amp;sr=-4"></a></td> 
     </tr> 
     <tr> 
     <td class="tabletitle" colspan="12">Аналоги (заменители) для запрошенного артикула <a href="/news/newstext.aspx?id=1367" target="_blank"><img src="http://s.exist.ru/img/q2.gif" alt="&#1055;&#1086;&#1084;&#1086;&#1097;&#1100;" /></a></td> 
     </tr> 
     <tr onclick="colorize(this);" id="item_2" tcolor=""> 
     <td class="firmname" id="item_2"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=e0c712d8-215d-4000-9a64-f02c7200005c">Alco</a></td> 
     <td style="white-space:nowrap;">M * * * 5</td> 
     <td align="center" style="white-space:nowrap;"></td> 
     <td>Фильтр масляный</td> 
     <td align="center">1</td> 
     <td align="center">1</td> 
     <td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&amp;s=0100f98e-5c00-71fb-0241-720c00002c7d" target="_blank">0</a></td> 
     <td class="price" align="right">37&nbsp;700 р.</td> 
     <td class="basket"><a title="&#1050;&#1091;&#1087;&#1080;&#1090;&#1100;" href="/profile/orders/basket.aspx?pid=83A07C7A&amp;in=0100f98e-5c00-71fb-0241-720c00002c7d&amp;sr=-4"></a></td> 
     </tr> 
     <tr onclick="colorize(this);" id="item_3" tcolor=""> 
     <td class="firmname" id="item_3"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=b113459e-001c-4000-a33e-5021bd20005c">Bosch</a></td> 
     <td style="white-space:nowrap;">1 * * * 9</td> 
     <td align="center" style="white-space:nowrap;"></td> 
     <td>Фильтр масляный</td> 
     <td align="center">8</td> 
     <td align="center">1</td> 
     <td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&amp;s=0100f98e-5c00-3495-0002-bd11000021c2" target="_blank">0</a></td> 
     <td class="price" align="right">30&nbsp;200 р.</td> 
     <td class="basket"><a title="&#1050;&#1091;&#1087;&#1080;&#1090;&#1100;" href="/profile/orders/basket.aspx?pid=83A07C7A&amp;in=0100f98e-5c00-3495-0002-bd11000021c2&amp;sr=-4"></a></td> 
     </tr> 
     <tr onclick="colorize(this);" id="item_4" tcolor=""> 
     <td class="firmname" id="item_4"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=40f20c71-000a-400f-bd7d-02c720005c78">Champion</a></td> 
     <td style="white-space:nowrap;">X * * * 6</td> 
     <td align="center" style="white-space:nowrap;"></td> 
     <td>Фильтр масляный</td> 
     <td align="center">1</td> 
     <td align="center">1</td> 
     <td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&amp;s=0100f98e-5c00-713d-0142-720c00002cb0" target="_blank">0</a></td> 
     <td class="price" align="right">59&nbsp;500 р.</td> 
     <td class="basket"><a title="&#1050;&#1091;&#1087;&#1080;&#1090;&#1100;" href="/profile/orders/basket.aspx?pid=83A07C7A&amp;in=0100f98e-5c00-713d-0142-720c00002cb0&amp;sr=-4"></a></td> 
     </tr> 
     <tr onclick="colorize(this);" id="item_5" tcolor=""> 
     <td class="firmname" id="item_5"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=50d11138-0008-4436-b40f-02d2f0005cc4">Clean filters</a></td> 
     <td style="white-space:nowrap;">M * * * 0</td> 
     <td align="center" style="white-space:nowrap;"></td> 
     <td>Фильтр масляный</td> 
     <td align="center">100</td> 
     <td align="center">1</td> 
     <td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&amp;s=0100f98e-5c00-385f-012b-2f1100002df5" target="_blank">0</a></td> 
     <td class="price" align="right">32&nbsp;500 р.</td> 
     <td class="basket"><a title="&#1050;&#1091;&#1087;&#1080;&#1090;&#1100;" href="/profile/orders/basket.aspx?pid=83A07C7A&amp;in=0100f98e-5c00-385f-012b-2f1100002df5&amp;sr=-4"></a></td> 
     </tr> 
     <tr onclick="colorize(this);" id="item_6" tcolor=""> 
     <td class="firmname" id="item_6"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=e0cb777d-0ecd-4000-8f30-802be890005c">Filtron</a></td> 
     <td style="white-space:nowrap;">O * * * 1</td> 
     <td align="center" style="white-space:nowrap;"></td> 
     <td>Фильтр масляный</td> 
     <td align="center">10</td> 
     <td align="center">1</td> 
     <td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&amp;s=0100f98e-5c00-b781-0297-e80c00002be2" target="_blank">0</a></td> 
     <td class="price" align="right">29&nbsp;000 р.</td> 
     <td class="basket"><a title="&#1050;&#1091;&#1087;&#1080;&#1090;&#1100;" href="/profile/orders/basket.aspx?pid=83A07C7A&amp;in=0100f98e-5c00-b781-0297-e80c00002be2&amp;sr=-4"></a></td> 
     </tr> 
     <tr onclick="colorize(this);" id="item_7" tcolor=""> 
     <td class="firmname" id="item_7"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=64e20c71-000c-40c6-8e51-02c720005c0f">Fram</a></td> 
     <td style="white-space:nowrap;">C * * * O</td> 
     <td align="center" style="white-space:nowrap;"></td> 
     <td>Фильтр масляный</td> 
     <td align="center">31</td> 
     <td align="center">1</td> 
     <td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&amp;s=0100f98e-5c00-7151-00e8-720c00002c21" target="_blank">0</a></td> 
     <td class="price" align="right">45&nbsp;500 р.</td> 
     <td class="basket"><a title="&#1050;&#1091;&#1087;&#1080;&#1090;&#1100;" href="/profile/orders/basket.aspx?pid=83A07C7A&amp;in=0100f98e-5c00-7151-00e8-720c00002c21&amp;sr=-4"></a></td> 
     </tr> 
     <tr onclick="colorize(this);" id="item_8" tcolor=""> 
     <td class="firmname" id="item_8"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=a1135e09-11ba-4000-a216-d02b3670005c">Hengst</a></td> 
     <td style="white-space:nowrap;">E * * * 8</td> 
     <td align="center" style="white-space:nowrap;"></td> 
     <td>Фильтр масляный</td> 
     <td align="center">10</td> 
     <td align="center">1</td> 
     <td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&amp;s=0100f98e-5c00-35e4-0024-361100002b92" target="_blank">0</a></td> 
     <td class="price" align="right">48&nbsp;900 р.</td> 
     <td class="basket"><a title="&#1050;&#1091;&#1087;&#1080;&#1090;&#1100;" href="/profile/orders/basket.aspx?pid=83A07C7A&amp;in=0100f98e-5c00-35e4-0024-361100002b92&amp;sr=-4"></a></td> 
     </tr> 
     <tr onclick="colorize(this);" id="item_9" tcolor=""> 
     <td class="firmname" id="item_9"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" 
     ... 
     </tr> 
    </tbody> 
</table> 

또한 러시아어 기호가 있습니다.

내 루비 코드는 다음과 같습니다

html = open('http://exist.by/price.aspx?pcode=ox143d') 
    require 'nokogiri' 
    require 'pp' 
    doc = Nokogiri::HTML(html) 
    doc.encoding = 'utf-8' 

    rows = doc.xpath('//table[@id="gvResultTable"]/tbody/tr[@id="item_1"]') 
    @details = rows.collect do |row| 
     detail = {} 
     [ 
     [:firmname, 'td[1]/text()'], 
     [:price, 'td[8]/text()'], 
     ].each do |name, xpath| 
     detail[name] = row.at_xpath(xpath).to_s.strip 
     end 
     detail 
    end 
    pp @details 
    logger.warn("!!!!!!!!!!") 
    logger.warn(@details) 

내가 제대로 itemidtr에서 데이터를 가져 오는 방법을 모르겠어요.

+1

문제가있는 부분과 궁금한 점이 무엇인지 구체적으로 설명해 주실 수 있습니까? –

+0

@ 마크 토마스 나는 테이블에서 데이터를 가져와야한다. (코드에서 URL을 볼 수있다.) 거기에서 브랜드 가치를 얻고 내 앱 가격이다. 테이블 대구의 일부가 위에 있습니다. – byCoder

답변

3
  1. HTML에 결함이 있습니다. id 속성이 같은 요소가 두 개 이상 있습니다 (<tr onclick="colorize(this);" id="item_2" tcolor=""> <td class="firmname" id="item_2">).
  2. HTML의 table 요소의 id은 "gvResult"이며 Ruby 코드에서 Nokogiri는 "id=gvResultTable"테이블을 찾도록 요청하고 있습니다.
  3. Nokogiri는 문자열을 내부적으로 저장하기 위해 UTF-8 인코딩을 사용하므로 러시아어 문자에 문제가 없어야합니다. HTML을 제공

이 잘 작동 고정 할 수 있습니다

HTML :

<table id="gvResult"> 
    <tbody> 
    <tr id="item_1"> 
     <td class="firmname">Example1</td> 
     <td class="price">42.00</td> 
    </tr> 
    <tr id="item_2"> 
     <td class="firmname">Example2</td> 
     <td class="price">24.00</td> 
    </tr> 
    </tbody> 
</table> 

루비 :

require 'rubygems' 
require 'nokogiri' 
require 'pp' 

html = open('http://www.example.com/page') 

doc = Nokogiri::HTML(html) 
doc.encoding = 'utf-8' 

rows = doc.search('//tr[starts-with(@id, "item_")]') 
    @details = rows.collect do |row| 
     detail = {} 
     [ 
     [:firmname, 'td[1]/text()'], 
     [:price, 'td[2]/text()'], 
     ].each do |name, xpath| 
     detail[name] = row.at_xpath(xpath).to_s.strip 
     end 
     detail 
    end 
pp @details 

난 당신이 모든에서 데이터를 얻을 할 것으로 추정 id과 같은 tr 요소는 "item_\d+"이므로 doc.search('//tr[starts-with(@id, "item_")]')을 사용했습니다. 필요에 맞게 변경하십시오.

+0

나는 HTML을 변경할 수 없다 ... 이것은 나의 페이지가 아니다 – byCoder

+0

나는 "HTML이 고정 될 수있다 ....."라고 언급했다 :-) 코드는 여전히 작동 할 것이다. –

+0

좋아, 나중에 시도해 주셔서 감사합니다 ... – byCoder