scrapy가 POST 요청을하지 않습니다.

AJAX로 일부 사이트를 처리해야하는 Scrapy 스파이더를 작성합니다. 이론적으로는 정상적으로 작동해야하며, Scrapy 쉘에서 수동으로 fetch()를 사용하면 정상적으로 작동하지만 "scrapy crawl ..."을 실행하면 로그에 POST 요청이 표시되지 않고 항목이 스크래핑되지 않습니다. . 문제의 원인은 무엇 일 수 있으며 무엇이 될 수 있습니까?scrapy가 POST 요청을하지 않습니다.

import scrapy 
    from scrapy import Request, FormRequest 
    import json 


class ExpertSpider(scrapy.Spider): 
    name = "expert" 
    allowed_domains = ["expert.fi"] 
    start_urls = (
     'http://www.expert.fi/', 
    ) 

def parse(self, response): 
    categories = response.xpath('//div[@id="categories-navigation"]//a/@href').extract() 
    for cat in categories: 
     yield Request(response.urljoin(cat), callback=self.parseCat) 

def parseCat(self, response): 
    catMenu = response.xpath('//div[@id="category-left-menu"]') 
    if catMenu: 
     subCats = catMenu.xpath('.//a[@class="category"]/@href').extract() 
     for subCat in subCats: 
      yield Request(response.urljoin(subCat), callback=self.parseCat) 
    else: 
     self.parseProdPage(response) 
     print "I`ve reached this point" # debug 

def parseProdPage(self, response): 
    catId = response.css... 
    url = 'https://www.expert.fi/Umbraco/Api/Product/ProductsByCategory' 

    data = dict() 
    ... 
    jsonDict = json.dumps(data) 

    heads = dict() 
    heads['Content-Type'] = 'application/json;charset=utf-8' 
    heads['Content-Length'] = len(jsonDict) 
    heads['Accept'] = 'application/json, text/plain, */*' 
    heads['Referer'] = response.url 

    return Request(url=url, method="POST", body=jsonDict, headers=heads, callback=self.startItemProc) 

def startItemProc(self, response): 
    resDict = json.loads(response.body) 

    item = dict() 
    for it in resDict['Products']: 
     # Product data 
     ... 
     item['Category Path'] = it['Breadcrumb'][-1]['Name'] + ''.join([' > ' + crumb['Name'] 
                   for crumb in it['Breadcrumb'][-2::-1]]) 
     # Make the new request for delivery price 
     url = 'https://www.expert.fi/Umbraco/Api/Cart/GetFreightOptionsForProduct' 
     data = dict() 
     ... 
     jsonDict = json.dumps(data) 

     heads = dict() 
     heads['Content-Type'] = 'application/json;charset=utf-8' 
     heads['Content-Length'] = len(jsonDict) 
     heads['Accept'] = 'application/json, text/plain, */*' 
     heads['Referer'] = item['Product URL'] 

     req = Request(url=url, method="POST", body=jsonDict, headers=heads, callback=self.finishItemProc) 
     req.meta['item'] = item 
     yield req 

def finishItemProc(self, response): 
    item = response.meta['item'] 
    ansList = json.loads(response.body) 
    for delivery in ansList: 
     if delivery['Name'] == ... 
      item['Delivery price'] = delivery['Price'] 
    return item

로그는 다음

2016-10-09 01:11:16 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/exception_count': 9, 
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 1, 
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 8, 
'downloader/request_bytes': 106652, 
'downloader/request_count': 263, 
'downloader/request_method_count/GET': 263, 
'downloader/response_bytes': 5644786, 
'downloader/response_count': 254, 
'downloader/response_status_count/200': 252, 
'downloader/response_status_count/301': 1, 
'downloader/response_status_count/302': 1, 
'dupefilter/filtered': 19, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 10, 8, 22, 11, 16, 949472), 
'log_count/DEBUG': 265, 
'log_count/INFO': 11, 
'request_depth_max': 3, 
'response_received_count': 252, 
'scheduler/dequeued': 263, 
'scheduler/dequeued/memory': 263, 
'scheduler/enqueued': 263, 
'scheduler/enqueued/memory': 263, 
'start_time': datetime.datetime(2016, 10, 8, 22, 7, 7, 811163)} 
2016-10-09 01:11:16 [scrapy] INFO: Spider closed (finished)

출처

2016-10-08 vchslv13

필자가 지금까지 이해하고있는 한 가지 문제는 'self.myMethodName (response)'과 같은 다른 메소드에서 하나의 메소드를 호출하는 것인데, 전혀 작동하지 않는다. 그러나 왜 내가 어떻게하면 다른 메소드에서 하나의 메소드의 코드를 단순하게 빼앗는 것을 피하기 위해해야 할 일을 할 수 있을까? – vchslv13

parseProdPage 방법에 의해 반환 요청은 parseCat 방법 내에서 사용되지 않는다. (그들 모두가 동일한 URL을 가지고 있기 때문에), 그렇지 않으면 그들의 대부분은 필터링됩니다 yield self.parseProdPage(response)

는 또한, 당신은 아마 같은 요청에 dont_filter=True을 설정하려면 : 당신은를 산출 으로 시작해야합니다.

출처

2016-10-10 01:03:47 elacuesta

글쎄, 고마워, 네가 옳지 않아. 요청은'return'ed 또는'yield'ed 일 수 있습니다 - 둘 다 작동하지만, 물론 return을 사용하는 메소드는 하나의 요청 만 리턴 할 수 있습니다. 이제 스크립트의 작업 버전을 완료했습니다. 문제는 실제로'parseCat' 메쏘드에서'self.parseProdPage (response)'호출에 있습니다. 작동하지 않습니다. 'yield request (response.url, callback = self.parseProdPage, dont_filter = True)'는 대신 잘 작동하지만, 하나 이상의 중복 요청을하는 것을 좋아하지 않습니다. – vchslv13

그게 내 뜻이다.'parseCat' 메소드 안에'self.parseProdPage (response)'만있는 라인은 아무런 영향을 미치지 않는다. 왜냐하면 그 요청은 반환되지 않았기 때문이다. 하지만 파이썬이 그것을 반환했다면 파이썬은 "발전기 내부의 가치있는 반환"에 대해 불평 할 것이기 때문에 결과물을 내놓을 것을 제안했습니다 (이미 다른 요청을 내놓고 있습니다). – elacuesta

그래, 우리는 똑같은 얘기를하고 있습니다 만, 내가 기억하는 한 'yield self.parseProdPage (response)'는 작동하지 않는다. 실제로 yield 요청 (response.url, callback = self.parseProdPage, dont_filter = True)에 의해 새로운 요청을해야한다. – vchslv13

scrapy가 POST 요청을하지 않습니다.

답변

관련 문제