2016-10-05 2 views
-1

도움이 필요합니다. 특정 웹 사이트 (underminejournal)에 대한 크롤러를 만들고 싶었습니다. 나는 주로 콘솔에서 작동하기 때문에 콘솔 출력을 생성하기 위해이 데이터를 사이트에서 가져오고 싶습니다. 자주 바꾸지 않으려합니다. 다른 지점은 내가 데이터베이스 (SQL은 아무 문제가 없습니다)에 데이터를 밀어 싶습니다.치료 초보자는 예외가 발생합니다

# -*- coding: utf-8 -*- 
import scrapy 


class JournalSpider(scrapy.Spider): 
    name = "journal" 
    allowed_domains = ["theunderminejournal.com"] 
    start_urls = (
     'theunderminejournal.com/#eu/eredar/item/124442', 
    ) 

    def parse(self, response): 
     page = respinse.url.split("/")[-2] 
     filename = 'journal-%s.html' % page 
     with open(filename, 'wb') as f: 
      f.write(response.body) 
      self.log('Saved file %s' % filename) 
     pass 

누군가가 힌트를 알고

2016-10-05 10:55:23 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine) 
2016-10-05 10:55:23 [scrapy] INFO: Optional features available: ssl, http11, boto 
2016-10-05 10:55:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'} 
2016-10-05 10:55:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-10-05 10:55:23 [boto] DEBUG: Retrieving credentials from metadata server. 
2016-10-05 10:55:24 [boto] ERROR: Caught exception reading instance data 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url 
    r = opener.open(req, timeout=timeout) 
    File "/usr/lib/python2.7/urllib2.py", line 429, in open 
    response = self._open(req, data) 
    File "/usr/lib/python2.7/urllib2.py", line 447, in _open 
    '_open', req) 
    File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain 
    result = func(*args) 
    File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open 
    raise URLError(err) 
URLError: <urlopen error timed out> 
2016-10-05 10:55:24 [boto] ERROR: Unable to read instance data, giving up 
2016-10-05 10:55:24 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-10-05 10:55:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-10-05 10:55:24 [scrapy] INFO: Enabled item pipelines: 
2016-10-05 10:55:24 [scrapy] INFO: Spider opened 
2016-10-05 10:55:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-10-05 10:55:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-10-05 10:55:24 [scrapy] ERROR: Error while obtaining start requests 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request 
    request = next(slot.start_requests) 
    File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests 
    yield self.make_requests_from_url(url) 
    File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url 
    return Request(url, dont_filter=True) 
    File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__ 
    self._set_url(url) 
    File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url 
    raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442 
2016-10-05 10:55:24 [scrapy] INFO: Closing spider (finished) 
2016-10-05 10:55:24 [scrapy] INFO: Dumping Scrapy stats: 
{'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 710944), 
'log_count/DEBUG': 2, 
'log_count/ERROR': 3, 
'log_count/INFO': 7, 
'start_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 704378)} 
2016-10-05 10:55:24 [scrapy] INFO: Spider closed (finished) 

내 거미는 이것이다 :하지만 크롤러를 실행하려고 할 때 어떻게 든 난 그냥이 표시 얻을, 튜토리얼 내가 생각하는 정말 도움이되지 않습니다?

편집

2016-10-05 11:21:35 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine) 
2016-10-05 11:21:35 [scrapy] INFO: Optional features available: ssl, http11, boto 
2016-10-05 11:21:35 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'} 
2016-10-05 11:21:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-10-05 11:21:35 [boto] DEBUG: Retrieving credentials from metadata server. 
2016-10-05 11:21:36 [boto] ERROR: Caught exception reading instance data 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url 
    r = opener.open(req, timeout=timeout) 
    File "/usr/lib/python2.7/urllib2.py", line 429, in open 
    response = self._open(req, data) 
    File "/usr/lib/python2.7/urllib2.py", line 447, in _open 
    '_open', req) 
    File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain 
    result = func(*args) 
    File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open 
    raise URLError(err) 
URLError: <urlopen error timed out> 
2016-10-05 11:21:36 [boto] ERROR: Unable to read instance data, giving up 

답변

0

ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442

URL이 항상 http:// 또는 https:// 중 하나를 시작해야 발생합니다.

start_urls = (
    'theunderminejournal.com/#eu/eredar/item/124442', 
    #^should be: 
    'http://theunderminejournal.com/#eu/eredar/item/124442', 
) 
+0

편집상의 오류는 완전히 관련이 없으며 어딘가에 연결할 수없는'boto' 패키지 때문에 발생합니다. 당신은 그것을 무시할 수 있습니다. 거미 자체가 작동합니까? – Granitosaurus

관련 문제