치료, 페이지를 크롤링 할 수 없음 : "TCP 연결 시간 초과 : 110 : 연결 시간이 초과되었습니다."

새로 프로그래밍치료, 페이지를 크롤링 할 수 없음 : "TCP 연결 시간 초과 : 110 : 연결 시간이 초과되었습니다."

동일한 웹 사이트에 속한 일부 도메인의 콘텐츠를 스크래핑 할 수 없습니다.

예를 들어, it.example.com, es.example.com, pt.example.com을 다룰 수는 있지만 fr.example.com 또는 us.example.com을 사용하여 동일한 작업을 수행하려고하면 :

2017-12-17 14:20:27 [scrapy.extensions.telnet] DEBUG: Telnet console 
listening on 127.0.0.1:6025 
2017-12-17 14:21:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages 
(at 
0 pages/min), scraped 0 items (at 0 items/min) 
2017-12-17 14:22:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages 
(at 
0 pages/min), scraped 0 items (at 0 items/min) 
2017-12-17 14:22:38 [scrapy.downloadermiddlewares.retry] DEBUG: 
Retrying 
<GET https://fr.example.com/robots.txt> (failed 1 times): TCP 
connection 
timed out: 110: Connection timed out.

는 여기에 내가 시도 무엇 스파이더 some.py

import scrapy 
import itertools 

class SomeSpider(scrapy.Spider): 
    name = 'some' 
    allowed_domains = ['https://fr.example.com'] 
    def start_requests(self): 
    categories = [ 'thing1', 'thing2', 'thing3',] 
      base = "https://fr.example.com/things?t={category}&p={index}" 

    for category, index in itertools.product(categories, range(1, 11)): 
     yield scrapy.Request(base.format(category=category, index=index)) 

def parse(self, response): 
    response.selector.remove_namespaces() 
    info1 = response.css("span.info1").extract() 
    info2 = response.css("span.info2").extract() 

    for item in zip(info1, info2): 
     scraped_info = { 
      'info1': item[0], 
      'info2': item[1] 
      } 

     yield scraped_info

에게있어 setting.py에서 설정 :

에 유래 어딘가에 찾을 수 (작동하지 않았다)

IP를 풀을 추가 (같은 도메인과 같은 문제)를 다른 IP에서 거미를 실행

USER_AGENT = 'Mozilla/5.0 (Macintosh; 게코 같은 인텔 맥 OS X 10_10_5) AppleWebKit/537.36 (KHTML) 크롬/55.0.2883.95 사파리/537.36 '

ROBOTSTXT_OBEY = 거짓

은 어떤 생각을 환영합니다!

출처

2017-12-17 Rawhide

오늘 브라우저에서이 URL을 확인하셨습니까? 어쩌면 서버에 문제가 있으며 작동하지 않습니다. – furas

먼저 브라우저를 확인한 다음 치료를 참조하십시오. 일부 사이트는 특정 국가의 IP 주소가 필요합니다 –

사이트가 온라인 상태이며 아무런 문제없이 현재 위치에서 액세스 할 수 있습니다. – Rawhide

scrapy 대신 requests 패키지가있는 페이지에 액세스하여 작동하는지보십시오.

import requests 

url = 'fr.example.com' 

response = requests.get(url) 
print(response.text)

출처

2017-12-17 14:32:21 laguittemh

그것은 매력처럼 작동했습니다. 필요한 정보를 긁어 내려면 스크립트에서 무엇을 변경해야합니까? – Rawhide

EDIT : 이유가 무엇이든, 다른 도메인에서 작동했던 것과 동일한 _base = "https://fr.example.com/things?t={category}&p={index}"_는 FR 및 미국에서는 사용하지 않았습니다. 방금 www를 추가했습니다. fr.example.com에 보내면 효과가있었습니다. _base = "https://www.fr.example.com/things?t={category}&p={index}"_가 있으면됩니다. 이유는 모르겠다. – Rawhide

@Rawhide 기쁜 내 대답은 여전히 유용했습니다. – laguittemh

치료, 페이지를 크롤링 할 수 없음 : "TCP 연결 시간 초과 : 110 : 연결 시간이 초과되었습니다."

답변

관련 문제