2017-11-21 5 views
3

다음 코드가 실행되면 오류없이 파일이 생성됩니다. 그러나 json 파일에는 저장되지 않습니다.CrawlerProcess가 CrawlSpider로 데이터를 저장하지 않음

데이터를 다운로드하는 데 방해가되었던 자동 스로틀을 사용 중지했지만 문제를 해결하지 못했습니다.

Scrapy의 == 1.4.0

class MySpider(CrawlSpider): 
    name = "spidy" 
    allowed_domains = ["cnn.com"] 
    start_urls = ["http://www.cnn.com"]  

    rules = [Rule(LinkExtractor(allow=['cnn.com/.+']), callback='parse_item', follow=True)]  

    def parse_item(self, response): 

     print('went to: {}'.format(response.url)) 

     yield {'url': response.url}   

FILE_NAME = 'my_data.json' 
SETTINGS = { 
      'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 
      'FEED_FORMAT': 'json', 
      'FEED_URI': FILE_NAME,   
      } 

process = CrawlerProcess(SETTINGS) 
process.crawl(MySpider) 
process.start() 

편집 : 로그에서 볼 수 있듯이 스크레이퍼가 데이터를 받고있다

:

2017-11-21 11:07:55 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot) 
2017-11-21 11:07:55 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 'FEED_URI': 'my_data.json', 'FEED_FORMAT': 'json'} 
2017-11-21 11:07:55 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.memusage.MemoryUsage', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats', 
'scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.feedexport.FeedExporter'] 
2017-11-21 11:07:55 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-11-21 11:07:55 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-11-21 11:07:55 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-11-21 11:07:55 [scrapy.core.engine] INFO: Spider opened 
2017-11-21 11:07:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-11-21 11:07:55 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6041 
2017-11-21 11:07:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com> (referer: None) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/us> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/specials/politics/congress-capitol-hill> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/specials/politics/president-donald-trump-45> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/specials/politics/us-security> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/specials/politics/trumpmerica> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/specials/politics/state-cnn-politics-magazine> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/specials/opinion/opinion-social-issues> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/specials/opinions/cnnireport> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/specials/vr/vr-archives> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/middle-east> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://imagesource.cnn.com> from <GET http://www.cnn.com/collection> 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnn.com/specials/politics/supreme-court-nine> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://transcripts.cnn.com/TRANSCRIPTS/> from <GET http://www.cnn.com/transcripts> 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.cnn.com/pf/> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.cnn.com/luxury/> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.cnn.com/data/markets/> (referer: http://www.cnn.com) 
2017-11-21 11:07:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.cnn.com/technology/> (referer: http://www.cnn.com) 
went to: http://www.cnn.com/us 
2017-11-21 11:07:56 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cnn.com/us> 
{'url': 'http://www.cnn.com/us'} 
2017-11-21 11:07:56 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.cnn.com/us> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 
2017-11-21 11:07:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.cnn.com/email/subscription> from <GET http://www.cnn.com/newsletters> 
... 

우리는 스크레이퍼를 볼 수는 방문입니다 URL, 페이지의 추가 URL 크롤링, 응답 URL 가져 오기 ("방문한 곳 :"참조), {{url ':}과 같은 데이터 반환 (예 : {'url ':'http://www.cnn.com/us '}

+0

@eLRuLL 편집을 참조하십시오. 로그를 추가했습니다. – ethanenglish

+0

@ethanenglish'FEED_URI'를 절대 경로'/ directory/subdirectory/file.json' 또는 실제 URI'file : /// directory/subdirectory/file.json'로 만들려고합니까? –

+0

@JonClements 제안에 감사드립니다. 나는 그것을 추가했지만 아아는 차이를 만들지 않는다. – ethanenglish

답변

0

코드가 잘 작동하므로 두 번 중지하거나 죽이면 json이 비어 있다고 가정합니다. 나는 두 가지를 바꿀 것이다.

json 대신 jsonlines을 1 회 사용하십시오. 이것은 내가 거미를 죽일지라도 너무 많은 아이템을 풀어주지는 않을 것입니다. 그런 다음 각 줄 자체가 유효한 JSON이므로 동일한 파일에 추가 할 수 있습니다. 당신이 잘못된 JSON을

둘째를 얻을 것이다 사이에서 프로그램을 중단 또한 경우에, 나는 항목 (기본 값은 100입니다)

from scrapy.crawler import CrawlerProcess 
from scrapy.spiders import CrawlSpider, Rule 

from scrapy.linkextractor import LinkExtractor 

class MySpider(CrawlSpider): 
    name = "spidy" 
    allowed_domains = ["cnn.com"] 
    start_urls = ["http://www.cnn.com"] 

    rules = [Rule(LinkExtractor(allow=['cnn.com/.+']), callback='parse_item', follow=True)] 

    def parse_item(self, response): 

     print('went to: {}'.format(response.url)) 

     yield {'url': response.url} 

FILE_NAME = 'my_data.jsonl' 
SETTINGS = { 
      'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 
      'FEED_FORMAT': 'jsonlines', 
      'FEED_URI': FILE_NAME, 
      'CONCURRENT_ITEMS': 1 
      } 

process = CrawlerProcess(SETTINGS) 
process.crawl(MySpider) 
process.start() 

후 더 많이 수출하고 있습니다, 그래서 낮은 값으로 동시 항목을 설정합니다 당신은 데이터가 잘 수출 얻을 않는 것을 발견 것이다

items exported

+0

정말 이상합니다. 코드가 작동해야하지만 scrapy가 파일을 작성하면 데이터를 가져 오지만 파일에 저장하지는 않습니다. 나는 여기서 완전히 잃어 버렸어. 설치 제거하고 모듈을 설치했으나 다른 가져 오기 방법을 시도했지만 아직 아무것도 사용하지 않았습니다. – ethanenglish

+0

어떤 OS를 사용하고 있습니까? –

+0

Mac Sierra 최신 버전 – ethanenglish

관련 문제