2013-07-11 2 views
2

아래의 작업 예제를 crawlSpider로 변환하고 첫 번째 메인 페이지뿐만 아니라 깊이까지 깊이있게 크롤링하려면 어떻게해야합니까? 이 예제는 오류없이 잘 작동하지만 InitSpider 대신 crawlspider를 사용하고 dyeeply로 크롤링하려고합니다. CrawlSpider에서 미리InitSpider를 CrawlSpider로 변환하는 치료법

from scrapy.contrib.spiders.init import InitSpider 
from scrapy.http import Request, FormRequest 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import Rule 

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

from linkedpy.items import LinkedpyItem 

class LinkedPySpider(InitSpider): 
    name = 'LinkedPy' 
    allowed_domains = ['linkedin.com'] 
    login_page = 'https://www.linkedin.com/uas/login' 
    start_urls = ["http://www.linkedin.com/csearch/results"] 

    def init_request(self): 
    #"""This function is called before crawling starts.""" 
    return Request(url=self.login_page, callback=self.login) 

    def login(self, response): 
    #"""Generate a login request.""" 
    return FormRequest.from_response(response, 
      formdata={'session_key': '[email protected]', 'session_password': 'xxxxx'}, 
      callback=self.check_login_response) 

    def check_login_response(self, response): 
    #"""Check the response returned by a login request to see if we aresuccessfully logged in.""" 
    if "Sign Out" in response.body: 
     self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n") 
     # Now the crawling can begin.. 

     return self.initialized() 

    else: 
     self.log("\n\n\nFailed, Bad times :(\n\n\n") 
     # Something went wrong, we couldn't log in, so nothing happens. 

    def parse(self, response): 
    self.log("\n\n\n We got data! \n\n\n") 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//ol[@id=\'result-set\']/li') 
    items = [] 
    for site in sites: 
     item = LinkedpyItem() 
     item['title'] = site.select('h2/a/text()').extract() 
     item['link'] = site.select('h2/a/@href').extract() 
     items.append(item) 
    return items 

출력

2013-07-11 15:50:01-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedpy) 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon 
sole, CloseSpider, WebService, CoreStats, SpiderState 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut 
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De 
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi 
ddleware, ChunkedTransferMiddleware, DownloaderStats 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi 
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle 
ware 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-11 15:50:01-0500 [LinkedPy] INFO: Spider opened 
2013-07-11 15:50:01-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra 
ped 0 items (at 0 items/min) 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602 
3 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2013-07-11 15:50:02-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked 
in.com/uas/login> (referer: None) 
2013-07-11 15:50:02-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www. 
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit> 
2013-07-11 15:50:04-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi 
n.com/nhome/> (referer: https://www.linkedin.com/uas/login) 
2013-07-11 15:50:04-0500 [LinkedPy] DEBUG: 


    Successfully logged in. Let's start crawling! 



2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi 
n.com/csearch/results> (referer: http://www.linkedin.com/nhome/) 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: 


    We got data! 



2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1009/IBM?trk=ncsrch_hits&goback=%2Efcs_*2_*2_fals 
e_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'IBM']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1033/Accenture?trk=ncsrch_hits&goback=%2Efcs_*2_* 
2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Accenture']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1038/Deloitte?trk=ncsrch_hits&goback=%2Efcs_*2_*2 
_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Deloitte']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1035/Microsoft?trk=ncsrch_hits&goback=%2Efcs_*2_* 
2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Microsoft']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1025/Hewlett-Packard?trk=ncsrch_hits&goback=%2Efc 
s_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Hewlett-Packard']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1028/Oracle?trk=ncsrch_hits&goback=%2Efcs_*2_*2_f 
alse_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Oracle']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1093/Dell?trk=ncsrch_hits&goback=%2Efcs_*2_*2_fal 
se_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Dell']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1123/Bank+of+America?trk=ncsrch_hits&goback=%2Efc 
s_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Bank of America']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1015/GE?trk=ncsrch_hits&goback=%2Efcs_*2_*2_false 
_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'GE']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1441/Google?trk=ncsrch_hits&goback=%2Efcs_*2_*2_f 
alse_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Google']} 
2013-07-11 15:50:05-0500 [LinkedPy] INFO: Closing spider (finished) 
2013-07-11 15:50:05-0500 [LinkedPy] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 2243, 
    'downloader/request_count': 4, 
    'downloader/request_method_count/GET': 3, 
    'downloader/request_method_count/POST': 1, 
    'downloader/response_bytes': 91349, 
    'downloader/response_count': 4, 
    'downloader/response_status_count/200': 3, 
    'downloader/response_status_count/302': 1, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2013, 7, 11, 20, 50, 5, 177000), 
    'item_scraped_count': 10, 
    'log_count/DEBUG': 22, 
    'log_count/INFO': 4, 
    'request_depth_max': 2, 
    'response_received_count': 3, 
    'scheduler/dequeued': 4, 
    'scheduler/dequeued/memory': 4, 
    'scheduler/enqueued': 4, 
    'scheduler/enqueued/memory': 4, 
    'start_time': datetime.datetime(2013, 7, 11, 20, 50, 1, 649000)} 
2013-07-11 15:50:05-0500 [LinkedPy] INFO: Spider closed (finished) 
+0

:

또한, 목록을 만드는 대신에 발전기를 사용할 수 있습니까? –

+0

@ 존 내가 몇 가지 코드를 작성하고 질문을 게시했지만 아무도 그것을 대답 [이전 게시물] (http://stackoverflow.com/questions/17578727/scrapy-log-into-a-site-and-do-a-a-do-a-) crawlspider-but-no-working) – Gio

답변

6

상속 감사는 그냥 start_requests 대신 init_request 오버라이드 (override) :

def start_requests(self): 
    yield Request(
     url=self.login_page, 
     callback=self.login, 
     dont_filter=True 
    ) 

parseCrawlSpider 실제로 링크를 크롤링 사용하는 방법이기 때문에, parse 메서드의 이름을 다른 것으로 바꿉니다. 당신은 질문이있는 경우 먼저 다음을 수행하는 코드를 작성에 대해 당신이 여기 물어 어떻게

def parse_page(self, response): 
    self.log("\n\n\n We got data! \n\n\n") 

    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//ol[@id=\'result-set\']/li') 

    for site in sites: 
     item = LinkedpyItem() 
     item['title'] = site.select('./h2/a/text()').extract() 
     item['link'] = site.select('./h2/a/@href').extract() 

     yield item 
관련 문제