2013-07-01 6 views
0

안녕하세요 저는 listCell 클래스의 제목과 텍스트에 대해 xpath를 얻으려고합니다. 나는 오류가 없기 때문에 내가 그것을하고 있다고 믿지만 출력 파일에 아무것도 얻지 못한다면 CSV 파일에 표시한다. 아마존과 같은 다른 웹 사이트에서도 내 치료법을 테스트했는데이 웹 사이트에서는 제대로 작동하지 않았습니다. 도와주세요!!치료를 사용하여 xpath를 검색 할 수 없습니다.

def parse(self, response): 
    self.log("\n\n\n We got data! \n\n\n") 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr') 
    items = [] 
    for site in sites: 
     item = CarrierItem() 
     item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract() 
     item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract() 
     items.append(item) 
    return items 

여기 내 HTML입니다. 그것이 HTML에 자바 스크립트를 가지고 있기 때문에 가능하지 않을 수도 있습니까?

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 
<title> Carrier IQ DIS 2.4 :: All Devices</title> 
<script type="text/javascript" src="/dis/js/main.js"> 
<script type="text/javascript" src="/dis/js/validate.js"> 
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css"> 
<link rel="stylesheet" type="text/css" href="/dis/css/style.css"> 
<script type="text/javascript"> 

    .... 

<form id="listForm" name="listForm" method="POST" action=""> 
<table> 
<thead> 
<tbody> 
<tr> 
<td class="crt">1</td> 
<td class="listCell" align="center"> 
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a> 
</td> 
<td class="listCell" align="center"> 
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a> 
</td> 
<td class="listCell" align="center"> 
<td class="listCell" align="center"> 
<td class="cell" align="center">2013-07-01 13:39:38.820</td> 
<td class="cell" align="left">1 - SMS_PullRequest_CS</td> 
<td class="listCell" align="right"> 
<td class="listCell" align="center"> 
<td class="listCell" align="center"> 
</tr> 
</tbody> 
</table> 
</form> 

출력

C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv 
-t csv 
2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier) 
2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt 
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut 
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De 
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi 
ddleware, ChunkedTransferMiddleware, DownloaderStats 
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi 
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle 
ware 
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-01 10:50:19-0500 [dis] INFO: Spider opened 
2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 
items (at 0 items/min) 
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602 
3 
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la 
bs.att.com:8080/dis/login.jsp> (referer: None) 
2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01 
.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d 
is/login> 
2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la 
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login 
.jsp) 
2013-07-01 10:50:20-0500 [dis] DEBUG: 


    Successfully logged in. Let's start crawling! 



2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la 
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/) 
2013-07-01 10:50:21-0500 [dis] DEBUG: 


    We got data! 



2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished) 
2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 1382, 
    'downloader/request_count': 4, 
    'downloader/request_method_count/GET': 3, 
    'downloader/request_method_count/POST': 1, 
    'downloader/response_bytes': 147888, 
    'downloader/response_count': 4, 
    'downloader/response_status_count/200': 3, 
    'downloader/response_status_count/302': 1, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000), 
    'log_count/DEBUG': 12, 
    'log_count/INFO': 4, 
    'request_depth_max': 2, 
    'response_received_count': 3, 
    'scheduler/dequeued': 4, 
    'scheduler/dequeued/memory': 4, 
    'scheduler/enqueued': 4, 
    'scheduler/enqueued/memory': 4, 
    'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)} 
2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished) 

답변

0

시도는 XPath의를 단순화 : 요소 (여러 경우)입니다

sites = hxs.select('//form[@id="listForm"]//tr') 

tbody 같이 HTML에 존재하지만, 귀하의 브라우저에 의해 생성되지 않습니다.

+0

나는 당신이 제안한 것을 시도했는데 Sjaak은 효과가 없었습니다. 나는 추출 된 것이 아무것도없고 이전과 같은 오류가없는 것을 보지 못했습니다. – Gio

관련 문제