선택은 내가 <a href="https://www.indeed.cl/trabajo?q=Data%20scientist&l=" rel="nofollow noreferrer">Indeed</a>에서 한 페이지의 소스를 다운로드

scrapy에 하나 하나 결과와 나는 내가이 XPath를 사용하고, 거기에서 모든 직책을 얻으려고 :선택은 내가 <a href="https://www.indeed.cl/trabajo?q=Data%20scientist&l=" rel="nofollow noreferrer">Indeed</a>에서 한 페이지의 소스를 다운로드

response.xpath('//*[@class=" row result"]//*[@class="jobtitle"]//text()').extract()

문제는 결과가 따라서 한 줄에없는 것을있다이 결과 얻기 :

제가 원하는 것은 하나 하나 과정을 작업을 선택하는 것입니다, 나머지 데이터와 매핑 할 문제가

[u'\n ', 
u'Data', 
u' ', 
u'Scientist', 
u' Experto SQL con conocimiento en R', 
u'\n ', 
u'\n ', 
u'Data', 
u' Analytic con Python', 
u'\n ', 
u'\n ', 
u'Data', 
u' Analytic con R',

, 뭔가를 extract_first()와 유사

response.xpath('//*[@class=" row result"]').extract_first()

그러나 주어진 색인 및 데이터 처리를 계속할 수있는 옵션이 있습니다. 나는이 시도 :

current_job = response.xpath('//*[@class=" row result"]').extract_first() 
current_job = TextResponse(url='',body=current_job,encoding='utf-8')

을하지만 첫 번째 결과를 작동하고 나에게 파이썬 방법처럼 보이지 않는다.

출처

2017-12-30 Luis Ramon Ramirez Rodriguez

뭔가? –

@KlausD. 내가 scream에 내장 된 무언가를 찾고 있으므로 매번 TextResponse()를 사용할 필요가 없다. 존재하는지 확실하지 않다. –

'for' 루프를 사용할 수 없습니까? – furas

우선은 a (text()없이 extract())를 얻을 것 그리고 내가 제목 문자열 요소를 연결하는 text() 및 extract()에는 별도로 모든 a와, 그리고 join()를 사용하는 for을 사용합니다.

import scrapy 

class MySpider(scrapy.Spider): 

    name = 'myspider' 

    start_urls = ['https://www.indeed.cl/trabajo?q=Data%20scientist&l='] 

    def parse(self, response): 
     print('url:', response.url) 

     results = response.xpath('//h2[@class="jobtitle"]/a') 
     print('number:', len(results)) 

     for item in results: 
      title = ''.join(item.xpath('.//text()').extract()) 
      print('title:', title) 

# --- it runs without project and saves in `output.csv` --- 

from scrapy.crawler import CrawlerProcess 

c = CrawlerProcess({ 
    'USER_AGENT': 'Mozilla/5.0', 
}) 
c.crawl(MySpider) 
c.start()

결과 :

number: 10 
title: Data Scientist 
title: CONSULTOR DATA SCIENCE SANTIAGO DE CHILE 
title: Líder Análisis de Datos MCoE Minerals Americas 
title: Ingeniero Inteligencia Mercado, BI 
title: Ingeniero Inteligencia de Mercado, Business Intelligence 
title: Data Scientist 
title: Data Scientist 
title: Data Scientist (Machine Learning) 
title: Data Scientist/Ml Scientist 
title: Young Professional - Spanish LatAm

출처

2017-12-30 06:21:23 furas

감사합니다. 두 번째 XPath에서 점이 사용되지 않으면 결과가 달라집니다. 이유가 무엇입니까? 점이있는 –

은 상대 경로이며 '항목'에있는 ''내부를 검색합니다. 도트가 없으면' '을 검색하므로 HTML의 모든 텍스트를 가져올 수 있습니다. – furas

갈 줄. 프로젝트에 맞게 스크립트를 약간 변경해야합니다. 위에서 언급 한 문제를 해결할 수 있습니다.

import requests 
from scrapy import Selector 

res = requests.get("https://www.indeed.cl/trabajo?q=Data%20scientist") 
sel = Selector(res) 
for item in sel.css("h2.jobtitle a"): 
    title = ' '.join(item.css("::text").extract()) 
    print(title)

출력 :``경우 list_item.strip()와 같은

Data Scientist 
CONSULTOR DATA SCIENCE SANTIAGO DE CHILE 
Líder Análisis de Datos MCoE Minerals Americas 
Ingeniero Inteligencia Mercado, BI 
Ingeniero Inteligencia de Mercado, Business Intelligence 
Data Scientist 
Data Scientist 
Young Professional - Spanish LatAm 
Data Scientist (Machine Learning) 
Data Scientist/Ml Scientist

출처

2017-12-30 09:17:45 SIM

선택은 내가 <a href="https://www.indeed.cl/trabajo?q=Data%20scientist&l=" rel="nofollow noreferrer">Indeed</a>에서 한 페이지의 소스를 다운로드

답변

관련 문제