0
치료 용 크롤링 항목에 약간의 문제가 있습니다. 이것은 크롤링의 출력이지만 다른 형식을 원합니다. 추출 된 데이터를 Posgresql 데이터베이스에 저장하여이 형식으로 저장할 수 없습니다.치료 배열 항목에 문제가 있습니다.
{'desc': [u'Batterie 6 Cellules Pour PC Portable Dell Inspiron N5010/N5110\xa0',
u'Batterie 6 Cellules Pour PC Portable Dell Inspiron N7010/N7110\ufeff - Li-ion 11.1V/5200mAh',
u'Batterie 6 Cellules Pour PC Portable Toshiba A200 - Li-ion 10.8V/5200mAh',
u'Batterie 6 Cellules Pour PC Portable HP ProBook 4510S - Li-ion 10.8V/5200mAh',
u'Batterie 6 Cellules Pour PC Portable HP Compaq CQ45/CQ50/CQ60 - HP Pavilion DV4/DV5/DV6\ufeff\ufeff - Li-ion 10.8V/5200mAh',
u'Batterie 6 Cellules Pour PC Portable HP Compaq CQ42/CQ62\ufeff - Li-ion 10.8V/5200mAh'],
'link': [u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/8442-batterie-6-cellules-pour-pc-portable-dell-inspiron-n5010-n5110.html',
u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/8443-batterie-6-cellules-pour-pc-portable-dell-inspiron-n7010-n7110.html',
u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/7735-batterie-6-cellules-pour-pc-portable-toshiba-a200.html',
u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/7729-batterie-6-cellules-pour-pc-portable-hp-probook-4510s.html',
u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/7726-batterie-6-cellules-pour-pc-portable-hp-compaq-cq45-cq50-cq60-dv4-dv5-dv6.html',
u'http://www.tunisianet.com.tn/batterie-pour-pc-portable/7725-batterie-6-cellules-pour-pc-portable-hp-compaq-cq42-cq62.html'],
'price': [u'69,900 DT',
u'69,900 DT',
u'69,900 DT',
u'69,900 DT',
u'69,900 DT',
u'69,900 DT'],
'title': [u'Batterie 6 Cellules Pour PC Portable Dell Inspiron N5010/N5110',
u'Batterie 6 Cellules Pour PC Portable Dell Inspiron N7010/N7110',
u'Batterie 6 Cellules Pour PC Portable Toshiba A200',
u'Batterie 6 Cellules Pour PC Portable HP ProBook 4510S',
u'Batterie 6 Cellules Pour PC Portable HP Compaq CQ45/CQ50/CQ60/DV4/DV5/DV6',
u'Batterie 6 Cellules Pour PC Portable HP Compaq CQ42/CQ62']}
대신의이 같은 출력을 좀하고 싶습니다 :
{'desc':'Batterie 6 Cellules Pour PC Portable Dell Inspiron N5010/N5110',
'link':'http://www.tunisianet.com.tn/batterie-pour-pc-portable/7725-batterie-6-cellules-pour-pc-portable-hp-compaq-cq42-cq62.html'} ...
'price':'69,900 DT'
'title':'Batterie 6 Cellules Pour PC Portable Dell Inspiron N5010/N5110'
내 스파이더 코드 :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from ecommerce.items import ArticleItem
class Tunisianet_Spider(CrawlSpider):
name = 'tunisianet'
start_urls = ('http://www.tunisianet.com.tn/',
) # urls from which the spider will start crawling
rules = [
Rule(SgmlLinkExtractor(allow=[r'\d{3}-\w+-\w+-\w+$']), callback='parse_Article_tunisianet'),
Rule(SgmlLinkExtractor(allow=[r'\d{3}-\w+-\w+-\w+-\w+$']), callback='parse_Article_tunisianet'),
Rule(SgmlLinkExtractor(allow=[r'\d{3}-\w+$']), callback='parse_Article_tunisianet'),
# r'\d{4}/\d{2}/\w+' : regular expression for http://tunisianet.com.tn/220-telephone- portable-tunisie
]
def parse_Article_tunisianet(self, response):
hxs = HtmlXPathSelector(response)
item = ArticleItem()
# Extract title
item['title'] = hxs.select('//*[@id="produit_liste_texte"]/div/h2/a/text()').extract()
item['desc'] = hxs.select('//*[@id="produit_liste_texte"]/div/p[1]/a/text()').extract()
item['price'] = hxs.select('//*[@id="produit_liste_prix"]/div[1]/span/text()').extract()
item['link'] = hxs.select('//*[@id="produit_liste_texte"]/div/h2/a/@href').extract()
return item
: parslepy'는'체크 아웃 -이 정말 대단한 일입니다 너 잘 했어. 고마워. 나는 그것을위한 좋은 유즈 케이스를 가지고 있다고 생각한다. 물론 대답은 +1이다. – alecxe
감사합니다. @alecxe. 그것이 parslepy와 함께 어떻게되는지 알려주세요. 기부 환영 ;-) –