사이트를 크롤링하기 위해 Scrapy를 시작하지만 코드를 테스트 할 때 해결 방법을 이해할 수없는 오류가 발생합니다. 그것은이 내 parse
기능과 콜백해야 할 수도 있습니다 보이는Scler로 크롤링 할 때 예외 오류가 발생합니다.
...
2012-12-18 02:07:19+0000 [dmoz] DEBUG: Crawled (200) <GET http://MYURL.COM> (referer: None)
2012-12-18 02:07:19+0000 [dmoz] ERROR: Spider error processing <GET http://MYURL.COM>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 368, in callback
self._startRunCallbacks(result)
File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.3-py2.7.egg/scrapy/spider.py", line 57, in parse
raise NotImplementedError
exceptions.NotImplementedError:
2012-12-18 02:07:19+0000 [dmoz] INFO: Closing spider (finished)
2012-12-18 02:07:19+0000 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 357,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 20704,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 12, 18, 2, 7, 19, 595977),
'log_count/DEBUG': 7,
'log_count/ERROR': 1,
'log_count/INFO': 4,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2012, 12, 18, 2, 7, 18, 836322)}
: 여기
오류 출력됩니다. 나는rule
을 제거하려고 시도했지만 작동했지만 단 하나의 URL 만 있으면 전체 사이트를 크롤링하는 것이 필요합니다.
다음은 올바른 방향의 모든 팁을 이해할 수있을 것이다 내 코드
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class DmozSpider(BaseSpider):
name = "dmoz"
start_urls = ["http://MYURL.COM"]
rules = (Rule(SgmlLinkExtractor(allow_domains=('http://MYURL.COM',)), callback='parse_l', follow=True),)
def parse_l(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class=\'content\']')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('//div[@class=\'gig-title-g\']/h1').extract()
item['link'] = site.select('//ul[@class=\'gig-stats prime\']/li[@class=\'queue \']/div[@class=\'big-txt\']').extract()
item['desc'] = site.select('//li[@class=\'thumbs\'][1]/div[@class=\'gig-stats-numbers\']/span').extract()
items.append(item)
return items
입니다.
고맙습니다.
두 번째 답변은 여기를 참조하십시오. http://stackoverflow.com/questions/5264829/why-does-scrapy-throw-an-error-for-me-when-trying-to-spider-and-parse-a- site – Pspi