이 치료 거미는 무엇이 잘못 되었나요? 마지막 URL 만 긁으십시오

방법에서 parse() 거미가 4 개의 URL을 크롤링 한 다음 일부 데이터를 긁어 내기 위해 parse_dir_contents() 방법으로 보냅니다. 그러나 4 번째 URL 만 긁혔습니다. 왜 다른 3 개의 URL을 긁지 않는지 이해할 수 없습니까? 나는 parse_dir_contents 함수의 for 루프의 필요가 없다고 생각하는 페이지를 검사하여이 치료 거미는 무엇이 잘못 되었나요? 마지막 URL 만 긁으십시오

import scrapy 
from v_one.items import VOneItem 
import json 

class linkedin(scrapy.Spider): 
    name = "linkedin" 
    allowed_domains = ["linkedin.com"] 
    start_urls = [ 
    "https://in.linkedin.com/directory/people-s-1-2-4/", 
    ] 

    def parse(self, response): 

     for href in response.xpath('//*[@id="seo-dir"]/div/div/div/ul/li/a/@href'): 
      url = response.urljoin(href.extract())  
      print "________________"+url 
      yield scrapy.Request(url, callback=self.parse_dir_contents) 



    def parse_dir_contents(self, response): 

     for sel in response.xpath('//*[@id="profile"]'): 
      url = response.url 
      print "____________"+url    
      item = VOneItem() 
      item['name'] = sel.xpath('//*[@id="name"]/text()').extract() 
      item['headline'] = sel.xpath('//*[@id="topcard"]/div/div/div/p/span/text()').extract() 
      item['current'] = sel.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/span/text()').extract() 
      item['education'] = sel.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/a/text()').extract() 
      item['link'] = url 
      yield item

출처

2016-06-25 Siddharth Bhavsar

코드가 모든 페이지/링크를 이미 방문하므로 각 URL을 긁어 모으지 않는다고 생각합니다. 또한 당신의 xpaths는 매우 부서지기 쉽고, 데이터를 더 정확하게 얻기위한 많은 클래스 이름들이 있습니다. 또한 tbody는 일반적으로 브라우저에 의해 추가되므로 실제로 거기에 없을 수도 있습니다. –

. 다음과 같이 기능을 만드십시오.

def parse_dir_contents(self, response): 
     item = VOneItem() 
     item['name'] = response.xpath('//*[@id="name"]/text()').extract() 
     item['headline'] = response.xpath('//*[@id="topcard"]/div/div/div/p/span/text()').extract() 
     item['current'] = response.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/span/text()').extract() 
     item['education'] = response.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/a/text()').extract() 
     item['link'] = response.url 
     return item

그리고 이것이 문제를 해결하는지 확인하십시오.

출처

2016-06-25 10:53:32

오해의 소지가있는이 접근법은 무엇입니까? 조용히 투표를하지 마십시오. 이유를 지정하십시오. –

이 치료 거미는 무엇이 잘못 되었나요? 마지막 URL 만 긁으십시오

답변

관련 문제