2
아래의 작업 예제를 crawlSpider로 변환하고 첫 번째 메인 페이지뿐만 아니라 깊이까지 깊이있게 크롤링하려면 어떻게해야합니까? 이 예제는 오류없이 잘 작동하지만 InitSpider 대신 crawlspider를 사용하고 dyeeply로 크롤링하려고합니다. CrawlSpider
에서 미리InitSpider를 CrawlSpider로 변환하는 치료법
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from linkedpy.items import LinkedpyItem
class LinkedPySpider(InitSpider):
name = 'LinkedPy'
allowed_domains = ['linkedin.com']
login_page = 'https://www.linkedin.com/uas/login'
start_urls = ["http://www.linkedin.com/csearch/results"]
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'session_key': '[email protected]', 'session_password': 'xxxxx'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
return self.initialized()
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[@id=\'result-set\']/li')
items = []
for site in sites:
item = LinkedpyItem()
item['title'] = site.select('h2/a/text()').extract()
item['link'] = site.select('h2/a/@href').extract()
items.append(item)
return items
출력
2013-07-11 15:50:01-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedpy)
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-11 15:50:01-0500 [LinkedPy] INFO: Spider opened
2013-07-11 15:50:01-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-11 15:50:02-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked
in.com/uas/login> (referer: None)
2013-07-11 15:50:02-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www.
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit>
2013-07-11 15:50:04-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/nhome/> (referer: https://www.linkedin.com/uas/login)
2013-07-11 15:50:04-0500 [LinkedPy] DEBUG:
Successfully logged in. Let's start crawling!
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/csearch/results> (referer: http://www.linkedin.com/nhome/)
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG:
We got data!
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin
.com/csearch/results>
{'link': [u'/companies/1009/IBM?trk=ncsrch_hits&goback=%2Efcs_*2_*2_fals
e_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'],
'title': [u'IBM']}
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin
.com/csearch/results>
{'link': [u'/companies/1033/Accenture?trk=ncsrch_hits&goback=%2Efcs_*2_*
2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'],
'title': [u'Accenture']}
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin
.com/csearch/results>
{'link': [u'/companies/1038/Deloitte?trk=ncsrch_hits&goback=%2Efcs_*2_*2
_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'],
'title': [u'Deloitte']}
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin
.com/csearch/results>
{'link': [u'/companies/1035/Microsoft?trk=ncsrch_hits&goback=%2Efcs_*2_*
2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'],
'title': [u'Microsoft']}
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin
.com/csearch/results>
{'link': [u'/companies/1025/Hewlett-Packard?trk=ncsrch_hits&goback=%2Efc
s_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'],
'title': [u'Hewlett-Packard']}
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin
.com/csearch/results>
{'link': [u'/companies/1028/Oracle?trk=ncsrch_hits&goback=%2Efcs_*2_*2_f
alse_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'],
'title': [u'Oracle']}
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin
.com/csearch/results>
{'link': [u'/companies/1093/Dell?trk=ncsrch_hits&goback=%2Efcs_*2_*2_fal
se_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'],
'title': [u'Dell']}
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin
.com/csearch/results>
{'link': [u'/companies/1123/Bank+of+America?trk=ncsrch_hits&goback=%2Efc
s_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'],
'title': [u'Bank of America']}
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin
.com/csearch/results>
{'link': [u'/companies/1015/GE?trk=ncsrch_hits&goback=%2Efcs_*2_*2_false
_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'],
'title': [u'GE']}
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin
.com/csearch/results>
{'link': [u'/companies/1441/Google?trk=ncsrch_hits&goback=%2Efcs_*2_*2_f
alse_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'],
'title': [u'Google']}
2013-07-11 15:50:05-0500 [LinkedPy] INFO: Closing spider (finished)
2013-07-11 15:50:05-0500 [LinkedPy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2243,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 91349,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 11, 20, 50, 5, 177000),
'item_scraped_count': 10,
'log_count/DEBUG': 22,
'log_count/INFO': 4,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 7, 11, 20, 50, 1, 649000)}
2013-07-11 15:50:05-0500 [LinkedPy] INFO: Spider closed (finished)
:
또한, 목록을 만드는 대신에 발전기를 사용할 수 있습니까? –
@ 존 내가 몇 가지 코드를 작성하고 질문을 게시했지만 아무도 그것을 대답 [이전 게시물] (http://stackoverflow.com/questions/17578727/scrapy-log-into-a-site-and-do-a-a-do-a-) crawlspider-but-no-working) – Gio