-1
도움이 필요합니다. 특정 웹 사이트 (underminejournal)에 대한 크롤러를 만들고 싶었습니다. 나는 주로 콘솔에서 작동하기 때문에 콘솔 출력을 생성하기 위해이 데이터를 사이트에서 가져오고 싶습니다. 자주 바꾸지 않으려합니다. 다른 지점은 내가 데이터베이스 (SQL은 아무 문제가 없습니다)에 데이터를 밀어 싶습니다.치료 초보자는 예외가 발생합니다
# -*- coding: utf-8 -*-
import scrapy
class JournalSpider(scrapy.Spider):
name = "journal"
allowed_domains = ["theunderminejournal.com"]
start_urls = (
'theunderminejournal.com/#eu/eredar/item/124442',
)
def parse(self, response):
page = respinse.url.split("/")[-2]
filename = 'journal-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
pass
누군가가 힌트를 알고
2016-10-05 10:55:23 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine)
2016-10-05 10:55:23 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-10-05 10:55:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'}
2016-10-05 10:55:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-10-05 10:55:23 [boto] DEBUG: Retrieving credentials from metadata server.
2016-10-05 10:55:24 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-10-05 10:55:24 [boto] ERROR: Unable to read instance data, giving up
2016-10-05 10:55:24 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-10-05 10:55:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-10-05 10:55:24 [scrapy] INFO: Enabled item pipelines:
2016-10-05 10:55:24 [scrapy] INFO: Spider opened
2016-10-05 10:55:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-05 10:55:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-05 10:55:24 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
request = next(slot.start_requests)
File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442
2016-10-05 10:55:24 [scrapy] INFO: Closing spider (finished)
2016-10-05 10:55:24 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 710944),
'log_count/DEBUG': 2,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'start_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 704378)}
2016-10-05 10:55:24 [scrapy] INFO: Spider closed (finished)
내 거미는 이것이다 :하지만 크롤러를 실행하려고 할 때 어떻게 든 난 그냥이 표시 얻을, 튜토리얼 내가 생각하는 정말 도움이되지 않습니다?
편집
2016-10-05 11:21:35 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine)
2016-10-05 11:21:35 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-10-05 11:21:35 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'}
2016-10-05 11:21:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-10-05 11:21:35 [boto] DEBUG: Retrieving credentials from metadata server.
2016-10-05 11:21:36 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-10-05 11:21:36 [boto] ERROR: Unable to read instance data, giving up
편집상의 오류는 완전히 관련이 없으며 어딘가에 연결할 수없는'boto' 패키지 때문에 발생합니다. 당신은 그것을 무시할 수 있습니다. 거미 자체가 작동합니까? – Granitosaurus