왜 Scrapy가 크롤링이나 구문 분석을하지 않습니까?

나는 미 의회 도서관/토마스 웹 사이트를 긁어 내려고하고있다. 이 Python 스크립트는 사이트의 40 개 지폐 샘플 (URL의 # 1-40 식별자)에 액세스하기위한 것입니다. 각 입법부의 본문을 파싱하고 본문/내용을 검색하고 가능한 복수 버전에 대한 링크를 추출하려면 &을 따르십시오. 왜 Scrapy가 크롤링이나 구문 분석을하지 않습니까?

일단 버전 페이지 (들)에 대한 I는, 법률의 각 부분의 본문을 분석 잠재적 섹션 & 추적을 몸/컨텐츠 & 추출물 링크를 검색 할.

한 번 섹션 페이지에서 청구서의 각 섹션 본문을 구문 분석하고 싶습니다.

내 코드의 Rules/LinkExtractor 세그먼트에 문제가 있다고 생각합니다. 파이썬 코드가 시작 URL을 크롤링하지만 구문 분석이나 후속 작업을 수행하지 않습니다.

세 가지 문제 : 일부 법안은 여러 버전 (일부 있지만, 그들이 그렇게 짧은 때문에 일부 청구서 부분을 연결하지 않는

URL의 몸체 부에는 링크를 ERGO하지 않는

일부 섹션 링크

제 질문은 다시, Scrapy가 크롤링이나 구문 분석을하지 않는 이유는 무엇입니까? 난 그냥 들여 쓰기를 고정했습니다

from scrapy.item import Item, Field 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 

class BillItem(Item): 
    title = Field() 
    body = Field() 

class VersionItem(Item): 
    title = Field() 
    body = Field() 

class SectionItem(Item): 
    body = Field() 

class Lrn2CrawlSpider(CrawlSpider): 
    name = "lrn2crawl" 
    allowed_domains = ["thomas.loc.gov"] 
    start_urls = ["http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.%s:" % bill for bill in xrange(000001,00040,00001) ### Sample of 40 bills; Total range of bills is 1-5767 

    ] 

rules = (
     # Extract links matching /query/ fragment (restricting tho those inside the content body of the url) 
     # and follow links from them (since no callback means follow=True by default). 
     # Desired result: scrape all bill text & in the event that there are multiple versions, follow them & parse. 
     Rule(SgmlLinkExtractor(allow=(r'/query/'), restrict_xpaths=('//div[@id="content"]')), callback='parse_bills', follow=True), 

     # Extract links in the body of a bill-version & follow them. 
     #Desired result: scrape all version text & in the event that there are multiple sections, follow them & parse. 
     Rule(SgmlLinkExtractor(restrict_xpaths=('//div/a[2]')), callback='parse_versions', follow=True) 
    ) 

def parse_bills(self, response): 
    hxs = HtmlXPathSelector(response) 
    bills = hxs.select('//div[@id="content"]') 
    scraped_bills = [] 
    for bill in bills: 
     scraped_bill = BillItem() ### Bill object defined previously 
     scraped_bill['title'] = bill.select('p/text()').extract() 
     scraped_bill['body'] = response.body 
     scraped_bills.append(scraped_bill) 
    return scraped_bills 

def parse_versions(self, response): 
    hxs = HtmlXPathSelector(response) 
    versions = hxs.select('//div[@id="content"]') 
    scraped_versions = [] 
    for version in versions: 
     scraped_version = VersionItem() ### Version object defined previously 
     scraped_version['title'] = version.select('center/b/text()').extract() 
     scraped_version['body'] = response.body 
     scraped_versions.append(scraped_version) 
    return scraped_versions 

def parse_sections(self, response): 
    hxs = HtmlXPathSelector(response) 
    sections = hxs.select('//div[@id="content"]') 
    scraped_sections = [] 
    for section in sections: 
     scraped_section = SectionItem() ## Segment object defined previously 
     scraped_section['body'] = response.body 
     scraped_sections.append(scraped_section) 
    return scraped_sections 

spider = Lrn2CrawlSpider()

출처

2013-07-12 DV Hughes

는 스크립트의 끝에 spider = Lrn2CrawlSpider() 라인을 제거 scrapy runspider lrn2crawl.py를 통해 거미를 실행하며, 긁힌 자국 링크를 다음 항목을 반환 - 규칙이 작동합니다.

는 여기에 내가 실행 해요 내용은 다음과 같습니다 도움이

from scrapy.item import Item, Field 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 

class BillItem(Item): 
    title = Field() 
    body = Field() 

class VersionItem(Item): 
    title = Field() 
    body = Field() 

class SectionItem(Item): 
    body = Field() 

class Lrn2CrawlSpider(CrawlSpider): 
    name = "lrn2crawl" 
    allowed_domains = ["thomas.loc.gov"] 
    start_urls = ["http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.%s:" % bill for bill in xrange(000001,00040,00001) ### Sample of 40 bills; Total range of bills is 1-5767 

    ] 

    rules = (
      # Extract links matching /query/ fragment (restricting tho those inside the content body of the url) 
      # and follow links from them (since no callback means follow=True by default). 
      # Desired result: scrape all bill text & in the event that there are multiple versions, follow them & parse. 
      Rule(SgmlLinkExtractor(allow=(r'/query/'), restrict_xpaths=('//div[@id="content"]')), callback='parse_bills', follow=True), 

      # Extract links in the body of a bill-version & follow them. 
      #Desired result: scrape all version text & in the event that there are multiple sections, follow them & parse. 
      Rule(SgmlLinkExtractor(restrict_xpaths=('//div/a[2]')), callback='parse_versions', follow=True) 
     ) 

    def parse_bills(self, response): 
     hxs = HtmlXPathSelector(response) 
     bills = hxs.select('//div[@id="content"]') 
     scraped_bills = [] 
     for bill in bills: 
      scraped_bill = BillItem() ### Bill object defined previously 
      scraped_bill['title'] = bill.select('p/text()').extract() 
      scraped_bill['body'] = response.body 
      scraped_bills.append(scraped_bill) 
     return scraped_bills 

    def parse_versions(self, response): 
     hxs = HtmlXPathSelector(response) 
     versions = hxs.select('//div[@id="content"]') 
     scraped_versions = [] 
     for version in versions: 
      scraped_version = VersionItem() ### Version object defined previously 
      scraped_version['title'] = version.select('center/b/text()').extract() 
      scraped_version['body'] = response.body 
      scraped_versions.append(scraped_version) 
     return scraped_versions 

    def parse_sections(self, response): 
     hxs = HtmlXPathSelector(response) 
     sections = hxs.select('//div[@id="content"]') 
     scraped_sections = [] 
     for section in sections: 
      scraped_section = SectionItem() ## Segment object defined previously 
      scraped_section['body'] = response.body 
      scraped_sections.append(scraped_section) 
     return scraped_sections

희망. 그냥 레코드에 대한

출처

2013-07-12 07:28:08 alecxe

예, 도움이됩니다. 마지막 줄 "spider = [...]"을 제거하면 스크립트를 실행할 수 있습니다. 왜 아직도 혼란 스럽습니까? 디버그에서 스크립트를 실행했을 때 "규칙 ([...]")에 구문 오류가있어서 그곳에 문제가 있다고 믿는 이유가 있습니다. 방금 스크립트가 이상하다는 것을 알았습니다. 실행하고 있지만 작업을 수행하지 않는, 그리고 그 디버그가 잘못된 방향으로 나를 가리키고 있었나요? 아마도 내가 틀렸어. 어떤 경우에는 많이 도움이되는 예. –

는, 스크립트의 문제는이 같은 들여 쓰기를 공유하지 않기 때문에 alecxe 이제 변수 rules이되었다 들여 쓰기 속성을 고정 할 때 변수 rules 그래서, Lrn2CrawlSpider의 범위 내에서되지 않는 것입니다 클래스. 나중에 상속 된 메서드 __init__()은 속성을 읽고 규칙을 컴파일하고 적용합니다.

def __init__(self, *a, **kw): 
    super(CrawlSpider, self).__init__(*a, **kw) 
    self._compile_rules()

마지막 줄을 지우는 것은 아무 관계가 없습니다.

출처

2015-06-03 18:13:48

왜 Scrapy가 크롤링이나 구문 분석을하지 않습니까?

답변

관련 문제