2013-12-12 10 views
0

나는 치료에 초보자이며 hackernews를 크롤링하려고했습니다. 사이트에서 모든 링크와 제목을 얻을 수 있지만 빈 제목과 링크도 모두 데이터를 따라 크롤링됩니다. 이 문제를 피하는 방법 또는 어쩌면 내가 xpaths 선언 일부 오류가 있습니다.추가 정보를 크롤링하는 스 크롤 크롤러

spider.py

from scrapy.spider import BaseSpider 
from scrapy.selector import Selector 

from hn.items import HnItem 

class HNSpider(BaseSpider): 
    name = "hn" 
    allowed_domains = ["https://news.ycombinator.com/"] 
    start_urls = [ 
     "https://news.ycombinator.com/" 
    ] 

    def parse(self, response): 
     selector = Selector(response) 
     sites = selector.xpath('//td[@class="title"]') 
     items = [] 
     for site in sites: 
      item = HnItem() 
      item['title'] = site.xpath('a/text()').extract() 
      item['link'] = site.xpath('a/@href').extract() 
      items.append(item) 
     for item in items: 
      yield item 

출력은

2013-12-12 11:50:46+0530 [hn] DEBUG: Crawled (200) <GET https://news.ycombinator.com/> (referer: None) 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=11171475'], 
     'title': [u'Backpacker stripped of tech gear at Auckland Airport']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://sivers.org/ws'], 'title': [u'Why was this secret?']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.theatlantic.com/politics/archive/2013/12/how-americans-were-deceived-about-cell-phone-location-data/282239/'], 
     'title': [u'How Americans Were Deceived About Cell Phone Location Data']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.rockpapershotgun.com/2013/12/11/youtube-blocks-game-videos-industry-offers-help/'], 
     'title': [u'YouTube Blocks Game Videos, Industry Offers Help']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blog.fsck.com/2013/12/better-and-better-keyboards.html'], 
     'title': [u'Prototype ergonomic mechanical keyboards']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.timmins.net/2013/12/11/how-att-verizon-and-comcast-are-working-together-to-screw-you-by-discontinuing-landline-service/'], 
     'title': [u'How AT&T, Verizon, and Comcast are working together to screw you']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blog.samaltman.com/h5n1'], 'title': [u'H5N1']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.digitaltrends.com/gadgets/parents-dislike-infant-seat-ipad-mount/'], 
     'title': [u'Parents Revolt Over Fisher-Price Infant Seat With Face-Level iPad Mount ']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'https://www.fsf.org/news/reform-corporate-surveillance'], 
     'title': [u'Reform corporate surveillance']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://googledrive.blogspot.com/2013/12/newsheets.html?m=1'], 
     'title': [u'New Google Sheets: faster, more powerful, and works offline']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blogs.marketwatch.com/thetell/2013/12/11/fidelity-now-allows-clients-to-put-bitcoins-in-iras/'], 
     'title': [u'Fidelity now allows clients to put bitcoins in IRAs']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://bitmason.blogspot.ca/2013/09/what-are-containers-anyway.html'], 
     'title': [u'What are Linux containers and how did they come about?']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.cbc.ca/news/canada/ottawa/canada-post-to-phase-out-urban-home-mail-delivery-1.2459618'], 
     'title': [u'Canada Post to phase out urban home mail delivery']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.reuters.com/article/2013/12/11/fda-antibiotic-idUSL3N0JQ36T20131211'], 
     'title': [u'U.S. FDA to phase out some antibiotic use in animal production']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'https://lists.gnu.org/archive/html/guix-devel/2013-12/msg00061.html'], 
     'title': [u'GNU Guix 0.5 released']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'https://sites.google.com/site/ancientbharat/home'], 
     'title': [u'Ancient Indian Texts']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.creativebloq.com/responsive-design-tools-8134180'], 
     'title': [u'Responsive design tools']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.keacher.com/1216/how-i-introduced-a-27-year-old-computer-to-the-web/'], 
     'title': [u'How I introduced a 27-year-old computer to the web']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blog.sendtoinc.com/2013/12/11/silicon-valley-internship-j1-visa/'], 
     'title': [u'How to intern in Silicon Valley with a J1 visa']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'https://www.crowdtilt.com/campaigns/project-marilyn-part-i?utm_source=HackerNews&utm_medium=HNPost&utm_campaign=ProjectMarilyn'], 
     'title': [u'Project Marilyn Part I: Non-Patented Cancer Pharmaceutical']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://steamcommunity.com/groups/steamuniverse#announcements/detail/1930088300965516570'], 
     'title': [u'Steam Machines and Steam Controller shipping to beta participants December 13th']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blog.alexmaccaw.com/an-engineers-guide-to-stock-options'], 
     'title': [u'An Engineer\u2019s guide to Stock Options']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.vim3d.com/'], 
     'title': [u'Vim3D \u2013 A new 3D vi clone [video]']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://da-data.blogspot.com/2013/12/briefly-profitable-alt-coin-mining-on.html'], 
     'title': [u'Briefly profitable alt-coin mining on Amazon through better code']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blog.jetbrains.com/idea/2013/12/intellij-idea-13-brings-a-full-bag-of-goodies-to-android-developers/'], 
     'title': [u'IntelliJ IDEA 13 Brings a Full Bag of Goodies to Android Developers']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://crowdmed.theresumator.com/apply/'], 
     'title': [u'CrowdMed (YC W13) is hiring a VP of Marketing + Web Dev and Design Interns']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://jh3y.github.io/tyto/'], 'title': [u'Show HN: tyto']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.washingtonpost.com/blogs/the-switch/wp/2013/12/10/nsa-uses-google-cookies-to-pinpoint-targets-for-hacking/'], 
     'title': [u'NSA uses Google cookies to pinpoint targets for hacking']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'https://access.redhat.com/site/products/Red_Hat_Enterprise_Linux/Get-Beta?intcmp=70160000000cINoAAM'], 
     'title': [u'Red Hat Enterprise Linux 7 Beta']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://thenextweb.com/dd/2013/12/11/digia-releases-qt-5-2-android-ios-support-previews-windows-rt-launches-qt-mobile-edition/'], 
     'title': [u'Digia releases Qt 5.2 with Android and iOS support']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'news2'], 'title': [u'More']} 
2013-12-12 11:50:46+0530 [hn] INFO: Closing spider (finished) 

당신은 title[]link[]가 함께 모든 방법을 반복지고 있다는 출력에서 ​​발견했을 수도 있습니다.

이 문제를 해결하는 방법. 도와주세요.

답변

1

거의 그 일을하는 방법, 즉를 : scrapy 파이프 라인 (http://doc.scrapy.org/en/latest/topics/item-pipeline.html)으로

  1. 있습니다 당신은 그것을에는 제목이나 링크가없는 경우 항목을 삭제합니다 간단한 파이프 라인을 추가 할 수 있습니다.
    if "title" in item and "link" in item: 
        items.append(item)

: 제목이나 링크가없는 그것의 항목 컬렉션에 항목을 추가하지 않음으로써
from scrapy.exceptions import DropItem 
class DropEmptyPipeline(object): 
    def process_item(self, item, spider): 
     if "title" in item and "link" in item: 
      return item 
     else: 
      raise DropItem("Missing title or link in %s" % item) 
  • 관련 문제