Scrapy Json Rule SgmlLink Extractor

웹 사이트에서 html 대신 json 응답을 보내면 어떻게 규칙을 세울 수 있습니까? 시작 URL 첫 번째 응답에서 그것은 html 응답을 제공하지만 페이지를 탐색 할 때 json 응답을 제공합니다. 여기 내 규칙 :Scrapy Json Rule SgmlLink Extractor

Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="GridTimeline-items"]'), tags=('div'), 
            attrs=('data-min-position'), allow=(r''), process_value=my_process_value_friends), 
            callback='parse_friends', follow=True),

제 질문은 json 응답에 xpath를 어떻게 적용 할 수 있습니까?

당신은 XPath는 또는 CSS 선택기와 JSON을 구문 분석 할 수 없습니다

출처

2016-09-06 Rocky

당신은'scrapy.linkextractors.Linkextractor'를 사용해야합니다. 그 두 가지는 본질적으로 같은 것입니다. – Granitosaurus

고마워요 :) – Rocky

, 감사합니다.

import json 
def parse(self, response): 
    data = json.loads(response.body) 
    # then just parse it, e.g. 
    item = dict() 
    item['name'] = data['name'] 
    # ...

을 또는 당신은 XML로 JSON을 Conver 유럽 다음 scrapy 선택기로 구문 분석 할 수 있습니다 당신은 그러나 파이썬 사전에 JSON을 해제 할 수 있습니다. 이 그렇게하지만 난 내 예제에서 dicttoxml을 강조 것이다 패키지의 많은 :`SgmlLinkExtractor` 지금 잠시 동안 사용되지 않으며 이후

이

import json 
from dicttoxml import dicttoxml 
from scrapy import Selector 
def parse(self, response): 
    data = json.loads(response.body) 
    data_xml = dicttoxml(data) 
    sel = Selector(root=data_xml) 
    # then parse it 
    item = dict() 
    item['name'] = sel.xpath("//name/text()") 
    # ...

출처

2016-09-06 06:34:25 Granitosaurus

고마워요,하지만 파싱 단계가 아니라 규칙을위한 해결책을 찾고 있어요 – Rocky

@Reymark CrawlSpider의 작동 방식을 확장하지 않고 json 소스에서'restrict_xpath'를 사용할 수 없습니다. 그래도 쉽게 할 수있는 방법은, 내 대답에 설명 된대로 그것을 수동으로 할 것입니다. LinkExtractor에서'parse' 콜백을 사용하고 페이지가 json인지 여부를 확인하십시오. json URL을 찾으면 정상적으로 계속 진행하십시오. – Granitosaurus

Scrapy Json Rule SgmlLink Extractor

답변

관련 문제