시작 페이지를 무시하고 다음 페이지로 계속되는 치료

나는 페이지 매김을 시도하는 메뚜기 거미를 가지고 있지만 크롤링 프로세스를 시작할 때마다 페이지 1 인 시작 페이지를 건너 뛰고 즉시 2 페이지로 이동하는 것처럼 보입니다 당신이 start_urls를 사용할 때 응답이 parse에있어서, 상기 제 1 시간에 간다 때문에시작 페이지를 무시하고 다음 페이지로 계속되는 치료

class IT(CrawlSpider): 
    name = 'IT' 

allowed_domains = ["jobscentral.com.sg"] 
start_urls = [ 
    'https://jobscentral.com.sg/jobs-accounting', 
] 

rules = (Rule(LinkExtractor(allow_domains=("jobscentral.com.sg",), 
        restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), 
        callback='parse_item', follow=True), 
     ) 

def parse_item(self, response): 
    self.logger.info("Response %d for %r" % (response.status, response.url)) 
    #self.logger.info("base url %s", get_base_url(response)) 
    items = [] 
    self.logger.info("Visited Outer Link %s", response.url) 

    for loop in response.xpath('//div[@class="col-md-11"]'): 
     item = JobsItems() 
     t = loop.xpath('./div[@class="col-xs-12 col-md-3 px-0"]/div[@class="posted-date text-muted hidden-sm-down"]//text()').extract()[1].strip() 

.... 
more codes here

출처

2017-09-17 dythe

예 그건 맞습니다. 이 방법은 내부적으로 CrawlSpider에 의해 크롤링 규칙을 실행하도록 정의됩니다. 따라서 첫 번째 응답에서 응답을 처리해야하는 경우에도 마찬가지입니다. 그대로 당신이 당신의 parse_item 방법에 내가 그랬던 방식이 parse_start_url를 넣어 경우,

class IT(CrawlSpider): 
    name = 'IT' 

    allowed_domains = ["jobscentral.com.sg"] 
    start_urls = [ 
     'https://jobscentral.com.sg/jobs-accounting', 
    ] 
    rules = (
     Rule(LinkExtractor(allow_domains=("jobscentral.com.sg",), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback='parse_item', follow=True), 
    ) 

    first_response = True 

    def parse(self, response): 
     if self.first_response = True: 
      # use it or pass it to some other function 
      for r in parse_item(response): 
       yield r 
      self.first_response = False 

     # Pass the response to crawlspider 
     for r in super(IT, self).parse(response) 
      yield r 


    def parse_item(self, response): 

     self.logger.info("Response %d for %r" % (response.status, response.url))

출처

2017-09-17 18:36:08

정답 인 – dythe

모든 것을 유지 아래 같은 것을 사용할 수 있습니다, 어떻게됩니까?

class IT(CrawlSpider): 
    name = 'IT' 

allowed_domains = ["jobscentral.com.sg"] 
start_urls = [ 
    'https://jobscentral.com.sg/jobs-accounting', 
] 

rules = (Rule(LinkExtractor(allow_domains=("jobscentral.com.sg",), 
        restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), 
        callback='parse_item', follow=True), 
     ) 

def parse_item(self, response): 
    parse_start_url = self.parse_item #Just place this line here and see if it fixes the issue 
    self.logger.info("Response %d for %r" % (response.status, response.url)) 
    items = [] 
    self.logger.info("Visited Outer Link %s", response.url) 

    for loop in response.xpath('//div[@class="col-md-11"]'): 
     item = JobsItems() 
     t = loop.xpath('./div[@class="col-xs-12 col-md-3 px-0"]/div[@class="posted-date text-muted hidden-sm-down"]//text()').extract()[1].strip() 

.... 
more codes here

출처

2017-09-19 04:56:18 SIM

으로 표시하면 해당 솔루션도 시도되지만 @ Tarun Lalwani 솔루션도 잘 작동합니다. – dythe

Tarun Lalwani는 치료에 대한 전설입니다. 그의 해결책은 거의 실패하지 않습니다. – SIM

시작 페이지를 무시하고 다음 페이지로 계속되는 치료

답변

관련 문제