2014-07-11 3 views
0

치료 용 크롤링 항목에 약간의 문제가 있습니다. 이것은 크롤링의 출력이지만 다른 형식을 원합니다. 추출 된 데이터를 Posgresql 데이터베이스에 저장하여이 형식으로 저장할 수 없습니다.치료 배열 항목에 문제가 있습니다.

{'desc': [u'Batterie 6 Cellules Pour PC Portable Dell Inspiron N5010/N5110\xa0', 
      u'Batterie 6 Cellules Pour PC Portable Dell Inspiron N7010/N7110\ufeff - Li-ion 11.1V/5200mAh', 
      u'Batterie 6 Cellules Pour PC Portable Toshiba A200 - Li-ion 10.8V/5200mAh', 
      u'Batterie 6 Cellules Pour PC Portable HP ProBook 4510S - Li-ion 10.8V/5200mAh', 
      u'Batterie 6 Cellules Pour PC Portable HP Compaq CQ45/CQ50/CQ60 - HP Pavilion DV4/DV5/DV6\ufeff\ufeff - Li-ion 10.8V/5200mAh', 
      u'Batterie 6 Cellules Pour PC Portable HP Compaq CQ42/CQ62\ufeff - Li-ion 10.8V/5200mAh'], 
'link': [u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/8442-batterie-6-cellules-pour-pc-portable-dell-inspiron-n5010-n5110.html', 
      u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/8443-batterie-6-cellules-pour-pc-portable-dell-inspiron-n7010-n7110.html', 
      u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/7735-batterie-6-cellules-pour-pc-portable-toshiba-a200.html', 
      u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/7729-batterie-6-cellules-pour-pc-portable-hp-probook-4510s.html', 
      u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/7726-batterie-6-cellules-pour-pc-portable-hp-compaq-cq45-cq50-cq60-dv4-dv5-dv6.html', 
      u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/7725-batterie-6-cellules-pour-pc-portable-hp-compaq-cq42-cq62.html'], 
'price': [u'69,900 DT', 
      u'69,900 DT', 
      u'69,900 DT', 
      u'69,900 DT', 
      u'69,900 DT', 
      u'69,900 DT'], 
'title': [u'Batterie 6 Cellules Pour PC Portable Dell Inspiron N5010/N5110', 
      u'Batterie 6 Cellules Pour PC Portable Dell Inspiron N7010/N7110', 
      u'Batterie 6 Cellules Pour PC Portable Toshiba A200', 
      u'Batterie 6 Cellules Pour PC Portable HP ProBook 4510S', 
      u'Batterie 6 Cellules Pour PC Portable HP Compaq CQ45/CQ50/CQ60/DV4/DV5/DV6', 
      u'Batterie 6 Cellules Pour PC Portable HP Compaq CQ42/CQ62']} 

대신의이 같은 출력을 좀하고 싶습니다 :

{'desc':'Batterie 6 Cellules Pour PC Portable Dell Inspiron N5010/N5110', 
'link':'http://www.tunisianet.com.tn/batterie-pour-pc-portable/7725-batterie-6-cellules-pour-pc-portable-hp-compaq-cq42-cq62.html'} ... 

'price':'69,900 DT' 
'title':'Batterie 6 Cellules Pour PC Portable Dell Inspiron N5010/N5110' 

내 스파이더 코드 :

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from ecommerce.items import ArticleItem 

class Tunisianet_Spider(CrawlSpider): 
    name = 'tunisianet' 
    start_urls = ('http://www.tunisianet.com.tn/', 
    ) # urls from which the spider will start crawling 
    rules = [ 
    Rule(SgmlLinkExtractor(allow=[r'\d{3}-\w+-\w+-\w+$']), callback='parse_Article_tunisianet'), 
    Rule(SgmlLinkExtractor(allow=[r'\d{3}-\w+-\w+-\w+-\w+$']),  callback='parse_Article_tunisianet'), 
    Rule(SgmlLinkExtractor(allow=[r'\d{3}-\w+$']), callback='parse_Article_tunisianet'), 
    # r'\d{4}/\d{2}/\w+' : regular expression for http://tunisianet.com.tn/220-telephone- portable-tunisie 
    ] 
    def parse_Article_tunisianet(self, response): 
     hxs = HtmlXPathSelector(response) 
     item = ArticleItem() 
     # Extract title 
     item['title'] = hxs.select('//*[@id="produit_liste_texte"]/div/h2/a/text()').extract() 
     item['desc'] = hxs.select('//*[@id="produit_liste_texte"]/div/p[1]/a/text()').extract() 
     item['price'] = hxs.select('//*[@id="produit_liste_prix"]/div[1]/span/text()').extract() 
     item['link'] = hxs.select('//*[@id="produit_liste_texte"]/div/h2/a/@href').extract() 

    return item 

답변

1

당신이해야 각 <ul class="clear" id="product_list"><li...>

에와 대한 루프 각 목록 항목 :

  • 새로운 ArticleItem
  • 상대 XPath를 선택
  • 을 적용 인스턴스화

뭔가 같은 :

def parse_Article_tunisianet(self, response): 
    hxs = HtmlXPathSelector(response) 

    for li in hxs.select('//ul[@id="product_list"]/li'): 
     item = ArticleItem() 
     # Extract title 
     item['title'] = li.select('.//*[@id="produit_liste_texte"]/div/h2/a/text()').extract() 
     item['desc'] = li.select('.//*[@id="produit_liste_texte"]/div/p[1]/a/text()').extract() 
     item['price'] = li.select('.//*[@id="produit_liste_prix"]/div[1]/span/text()').extract() 
     item['link'] = li.select('.//*[@id="produit_liste_texte"]/div/h2/a/@href').extract() 

     yield item 
폴 오프 주제의 조금
+0

: parslepy'는'체크 아웃 -이 정말 대단한 일입니다 너 잘 했어. 고마워. 나는 그것을위한 좋은 유즈 케이스를 가지고 있다고 생각한다. 물론 대답은 +1이다. – alecxe

+0

감사합니다. @alecxe. 그것이 parslepy와 함께 어떻게되는지 알려주세요. 기부 환영 ;-) –

관련 문제