2013-04-18 2 views
2

나는 치료법을 사용하여 위키피디아에서 세부 사항을 긁어 내려고하고있다. 나는 그것을 긁을 수 있었지만 나는 매우 지저분하고 가난한 결과를 얻는다. 저는 파이썬과 치료법에 익숙하지 않기 때문에 이것을 고치는 데 어려움을 겪고 있습니다. 여기 치료에서 좋은 결과를 얻는 방법

내 코드입니다 :

from scrapy.spider import BaseSpider 

from scrapy.selector import HtmlXPathSelector 

from wikipedia.items import WikipediaItem 

class WikipediaSpider(BaseSpider): 
    name = "wiki" 
    allowed_domains = ["wikipedia.org"] 
    start_urls = ["http://en.wikipedia.org/wiki/Main_Page"] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//table[@id="mp-upper"]/tr') 
     items = [] 
     for site in sites: 
      item = WikipediaItem() 
      item['title'] = site.select('.//a/text()').extract() 
      item['link'] = site.select('.//a/@href').extract() 
      item['details'] = site.select('.//p/text()').extract() 
      items.append(item) 
     return items 

이것은 결과입니다

2013-04-19 02:18:48+0800 [wiki] DEBUG: Scraped from <200 http://en.wikipedia.org/wiki/Main_Page> 

{'details': [u' is a fungal species found in moist habitats in ', 

u'. The species produces brown ', 
       u' with ', 

       u' of varying shapes up to 40 millimetres (1.6\xa0in) across, and tall, thin ', 

       u' up to 62 millimetres (2.4\xa0in) long, at the base of which is a large and well-defined "bulb". The stem varies in colour, with whitish, pale yellow-brown, pale red-brown, pale brown and grey-brown all observed. The species produces unusually shaped, irregular ', 

       u', each with a few thick protrusions. This feature helps differentiate it from other species that would otherwise be similar in appearance and ', 

       u'. It grows in ', 

       u' association with ', 

       u', and it is for this that the species is named. However, particular species favoured by the fungus are unclear and may include ', 

       u' and ', 

       u' taxa. The mushrooms grow from the ground, often among mosses or ', 

       u'. The species was first described in 2009, and within the genus ', 

       u', it is a part of the ', 

       u' ', 

       u'. The ', 

       u' ', 

       u' was collected from the shore of a lake near ', 

       u', Finland. The species has also been recorded in Sweden and, at 
least in some areas, it is relatively common. (', 

       u')', 

       u'Recently featured: ', 

       u'\xa0\u2013 ', 

       u'\xa0\u2013 ', 

       u': ', 

       u' ', 

       u' ', 

       u'More anniversaries: ', 

       u' ', 

       u' '], 

    'link': [u'/wiki/File:Inocybe_saliceticola.jpg', 

       u'/wiki/Inocybe_saliceticola', 

       u'/wiki/Nordic_countries', 

       u'/wiki/Mushrooms', 

       u'/wiki/Pileus_(mycology)', 

       u'/wiki/Stipe_(mycology)', 

       u'/wiki/Spore', 

       u'/wiki/Habit_(biology)', 

       u'/wiki/Mycorrhizal', 

       u'/wiki/Willow', 

       u'/wiki/Beech', 

       u'/wiki/Alder', 

       u'/wiki/Detritus', 

       u'/wiki/Section_(botany)', 

       u'/wiki/Holotype', 

       u'/wiki/Nurmes', 

       u'/wiki/Inocybe_saliceticola', 

       u'/wiki/Thistle,_Utah', 

       u'/wiki/Be_Here_Now_(album)', 

       u'/wiki/Sumatran_rhinoceros', 

       u'/wiki/Wikipedia:Today%27s_featured_article/April_2013', 

       u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l', 

       u'/wiki/Wikipedia:Featured_articles', 

       u'/wiki/Wikipedia:Recent_additions', 

       u'/wiki/File:Ezra_Meeker_1921_crop.jpg', 

       u'/wiki/Ezra_Meeker', 

       u'/wiki/Oregon_Trail', 

       u'/wiki/Bullock_cart', 

       u'/wiki/Italy_at_the_2009_Mediterranean_Games', 

       u'/wiki/2009_Mediterranean_Games_medal_table', 

       u'/wiki/Cossack_hetman', 

       u'/wiki/Ivan_Petrizhitsky-Kulaga', 

       u'/wiki/Cossacks', 

       u'/wiki/Fokus_(magazine)', 

       u'/wiki/Amir_Garrett', 

       u'/wiki/College_basketball', 


       u'/wiki/Fastball', 

       u'/wiki/Armenian_Genocide', 

       u'/wiki/Karin_dialect', 

       u'/wiki/Scottish_American', 

       u'/wiki/Daniel_Pennie_House', 

       u'/wiki/Wikipedia:Recent_additions', 

       u'/wiki/Wikipedia:Your_first_article', 

       u'/wiki/Template_talk:Did_you_know', 

       u'/wiki/Slang', 

       u'/wiki/Hammer', 

       u'/wiki/Church_(building)', 

       u'/wiki/Wikipedia:Today%27s_articles_for_improvement', 

       u'/wiki/File:2013_Boston_Marathon_aftermath_people.jpg', 

       u'/wiki/West_fertilizer_plant_explosion', 

       u'/wiki/West,_Texas', 

       u'/wiki/Texas', 

       u'/wiki/Moment_magnitude_scale', 

       u'/wiki/2013_Sistan_and_Baluchestan_earthquake', 

       u'/wiki/Sistan_and_Baluchestan_Province', 

       u'/wiki/15_April_2013_Iraq_attacks', 

       u'/wiki/Boston_Marathon_bombings', 

       u'/wiki/2013_Boston_Marathon', 

       u'/wiki/Death_and_state_funeral_of_Hugo_Ch%C3%A1vez', 

       u'/wiki/Nicol%C3%A1s_Maduro', 

       u'/wiki/Venezuelan_presidential_election,_2013', 

       u'/wiki/List_of_Presidents_of_Venezuela', 

       u'/wiki/Adam_Scott_(golfer)', 

       u'/wiki/2013_Masters_Tournament', 

       u'/wiki/Government_of_India', 

       u'/wiki/Bollywood', 

       u'/wiki/Pran', 

       u'/wiki/Dadasaheb_Phalke_Award', 

       u'/wiki/Deaths_in_2013', 

       u'/wiki/Colin_Davis', 

       u'/wiki/Maria_Tallchief', 

       u'/wiki/Jonathan_Winters', 

       u'//en.wikinews.org/wiki/Main_Page', 

       u'/wiki/Portal:Current_events', 

       u'/wiki/April_18', 

       u'/wiki/File:Stpetes.JPG', 

       u'/wiki/1506', 

       u'/wiki/St._Peter%27s_Basilica', 

       u'/wiki/Vatican_City', 

       u'/wiki/Old_St._Peter%27s_Basilica', 

       u'/wiki/1689', 

       u'/wiki/Militia_(United_States)', 

       u'/wiki/Boston', 

       u'/wiki/1689_Boston_revolt', 

       u'/wiki/Dominion_of_New_England', 

       u'/wiki/1923', 

       u'/wiki/New_York_Yankees', 

       u'/wiki/Major_League_Baseball', 

       u'/wiki/Yankee_Stadium_(1923)', 

       u'/wiki/1938', 

       u'/wiki/Superman', 

       u'/wiki/Jerry_Siegel', 

       u'/wiki/Joe_Shuster', 

       u'/wiki/Action_Comics_1', 

       u'/wiki/Superhero', 

       u'/wiki/Comic_book', 

       u'/wiki/1947', 

       u'/wiki/List_of_the_largest_artificial_non-nuclear_explosions', 

       u'/wiki/Royal_Navy', 

       u'/wiki/Tonne', 

       u'/wiki/Ammunition', 

       u'/wiki/Heligoland', 

       u'/wiki/1949', 

       u'/wiki/Republic_of_Ireland', 

       u'/wiki/Commonwealth_of_Nations', 

       u'/wiki/1996', 

       u'/wiki/1996_shelling_of_Qana', 

       u'/wiki/Qana', 

       u'/wiki/Operation_Grapes_of_Wrath', 

       u'/wiki/United_Nations_Interim_Force_in_Lebanon', 

       u'/wiki/April_17', 

       u'/wiki/April_18', 

       u'/wiki/April_19', 

       u'/wiki/Wikipedia:Selected_anniversaries/April', 

       u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l', 

       u'/wiki/List_of_historical_anniversaries', 

       u'/wiki/Coordinated_Universal_Time', 

       u'//en.wikipedia.org/w/index.php?title=Main_Page&action=purge'], 
'title': [u'Inocybe saliceticola', 

u'Nordic countries', 

       u'mushrooms', 

       u'caps', 

       u'stems', 

       u'spores', 

       u'habit', 

       u'mycorrhizal', 

       u'willow', 

       u'beech', 

       u'alder', 

       u'detritus', 

       u'section', 

       u'holotype', 

       u'Nurmes', 

       u'Thistle, Utah', 

       u'Be Here Now', 

       u'Sumatran rhinoceros', 

       u'Archive' 

       u'List of historical anniversaries', 

       u'UTC', 

       u'Reload this page']} 
+1

wikipedia는 API를 제공합니다. http://www.mediawiki.org/wiki/API:Main_page – dm03514

+0

여전히 Scrapy와 함께 가고 싶다면 게시물을 편집하고 결과를 스크랩하는 형식을 메모하십시오. – alecxe

답변

2

나는 당신이 한 동일한 페이지에 액세스 할 수 없습니다, 그러나 당신이 얻을 결과는 아마 위키 피 디아 때문에 너무 불규칙 텍스트는 링크가 너무 많습니다. site.select('.//p/text()')을 할 때 노드 <p> 바로 아래에있는 텍스트 만 선택하면됩니다. 즉, 서브 노드 <a href=..>text</a> 내부의 내용은 긁히지 않습니다. links 태그는 결과를 나눠서 이상한 목록으로 끝납니다. 당신은 모든 노드를 검색하려면

당신은 당신이합니다 (<a> 태그 포함) <p> 태그 내부의 모든 것을해야합니다

contents = site.select('.//p/node()').extract() 
item['details'] = ''.join(contents) 

그 방법을 사용할 수 있습니다. 링크 태그가없는 텍스트 만 원할 경우 strip_html(item['details']) (실제로는 contents = site.select('.//p//text()').extract()이 더 잘 작동하고 xpath 지향적 일 수 있음)을 사용할 수 있습니다.

+0

덕분에 작동합니다. 그러나 내가 긁어 내고 싶어하는 위키피디아의 내용이 바뀌 었으므로 정확한 결과를 얻지 못했습니다. – Apple