치료 거미는 페이지 당 단 하나의 링크 만

http://www.nyhistory.org/programs/upcoming-public-programs에서 모든 이벤트 데이터를 긁어 내고 싶습니다. 이벤트는 페이지 당 5 개의 이벤트로 페이지가 매겨집니다. 두 가지 규칙을 만들었습니다. 하나는 다음 페이지를 따르고 다른 하나는 이벤트의 세부 정보 페이지를 따릅니다. 그래서 거미가 먼저 각 이벤트의 URL을 입력하고 거기에서 필요한 모든 데이터를 수집 한 다음 다음 페이지로 진행하고 각 이벤트의 URL을 입력하는 등의 작업을 수행합니다. 그러나, 어떤 이유로 Spider는 각 페이지에서 하나의 이벤트에만 액세스하며 이는 첫 번째 이벤트입니다. 당신이 "까지"각각의 링크 당신은 당신이 실제로 항목에 대한 취득하고자하는 사람을 찾을 수를 다음과 같은 사항에 대해 언급 한 바와 같이 규칙이 CrawlSpider를 사용하는 경우,치료 거미는 페이지 당 단 하나의 링크 만

import scrapy from nyhistory.items import EventItem from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from datetime import datetime from w3lib.html import remove_tags from scrapy.selector import Selector import re class NyhistorySpider(CrawlSpider): name = "events" start_urls = ['http://www.nyhistory.org/programs/upcoming-public-programs',] rules = [Rule(LinkExtractor(allow='.*?page=.*',restrict_xpaths='//li[@class="pager-next"]'), follow=True), Rule(LinkExtractor(restrict_xpaths='//div[@class="view-content"]/div[contains(@class,"views-row")]'), callback='parse_event_details',follow=True), ] def parse_event_details(self, response): base_url = 'http://www.nyhistory.org' item = EventItem() item['title'] = response.xpath('//div[@class="views-field-title"]//text()')[2].extract() item['eventWebsite'] = response.url details_area = response.xpath('//div[@class="body-programs"]') details_area_str = " ".join(details_area.extract()) details_area_str_split = re.split('EVENT DETAILS|LOCATION|PURCHASING TICKETS', details_area_str) speakers_names_area = details_area_str_split[1] speakersNames = Selector(text=speakers_names_area).xpath('strong').extract() try: item['speaker1FirstName'] = speakersNames[0].split()[0] item['speaker1LastName'] = speakersNames[0].split()[1] except: item['speaker1FirstName'] = '' item['speaker1LastName'] = '' description = remove_tags(details_area_str_split[1]).strip() item['description'] = description try: address_line = remove_tags(details_area_str_split[2]).strip() item['location'] = address_line.split(',')[0] item['city'] = address_line.split(',')[-2].strip() item['state'] = address_line.split(',')[-1].split()[0] item['zip'] = address_line.split(',')[-1].split()[1] item['street'] = address_line.split(',')[1].strip() except: item['location'] = '' item['city'] = '' item['state'] = '' item['zip'] = '' item['street'] = '' try: item['dateFrom'] = self.date_converter(response.xpath('//span[@class="date-display-single"]/text()').extract_first(default='').rstrip(' - ')) except: try: item['dateFrom'] = response.xpath('//span[@class="date-display-single"]/text()').extract()[1].split('|')[0] except: item['dateFrom'] = '' try: item['startTime'] = self.time_converter(response.xpath('//span[@class="date-display-start"]/text()')[1].extract()) # item['endTime'] = self.time_converter(response.xpath('//span[@class="date-display-end"]/text()')[1].extract()) except: try: item['startTime'] = self.time_converter(response.xpath('//span[@class="date-display-single"]/text()').extract()[1].split(' | ')[1]) except: item['startTime'] = '' item['In_group_id'] = '' try: item['ticketUrl'] = base_url + response.xpath('//a[contains(@class,"btn-buy-tickets")]/@href').extract_first() except: item['ticketUrl'] = '' item['eventImage'] = response.xpath('//div[@class="views-field-field-speaker-photo-1"]/div/div/img/@src').extract_first(default='') item['organization'] = "New York Historical Society" yield item @staticmethod def date_converter(raw_date): try: raw_date_datetime_object = datetime.strptime(raw_date.replace(',',''), '%a %m/%d/%Y') final_date = raw_date_datetime_object.strftime('%d/%m/%Y') return final_date except: raw_date_datetime_object = datetime.strptime(raw_date.replace(',','').replace('th','').strip(), '%a %B %d %Y') final_date = raw_date_datetime_object.strftime('%d/%m/%Y') return final_date @staticmethod def time_converter(raw_time): raw_time_datetime_object = datetime.strptime(raw_time, '%I:%M %p') final_time = raw_time_datetime_object.strftime('%I:%M %p') return final_time

출처

2017-12-24 Ostap Didenko

아래에 내 코드를 참조하십시오.

그러나 스파이더 (또는 규칙)는 언제 멈출 지 어떻게 알 수 있습니까? 이는 callback 및 follow 속성을 사용하기위한 것입니다. callback을 사용하는 경우 follow (callback은 링크를 응답으로 처리해야한다고 지정했기 때문에)이 필요하지 않으며 follow을 사용하면 callback이 필요하지 않습니다. 왜냐하면 거미 새로운 링크에 대한 탐구를 계속합니다.

더 나은 규칙을 정의하고 follow에 어떤 규칙을 지정하고 callback으로 반환 할 규칙을 지정해야합니다.

출처

2017-12-24 20:10:42 eLRuLL

고맙습니다! 귀하의 제안에 따라, 나는 아래와 같이 코드를 수정했고 지금은 작동합니다! 규칙 (LinkExtractor (restrict = xpaths = '// pause-next') '), 규칙 = (LinkExtractor (restrict_xpaths) 콜백 = 'parse_event_details'),]]]]]]}}}}} –

치료 거미는 페이지 당 단 하나의 링크 만

답변

관련 문제