2014-11-19 1 views
2

다중 스레드를 사용하여 url의 txt 파일을 탐색하고 각 URL에서 발견 된 내용을 긁습니다. 이 작업은 약 20 개의 URL (일치하는 URL이 아님)에서 작동하지만 파일의 마지막 URL에 지속적으로 달라 붙습니다. 순서대로하지 않는 것 같습니다.WorkerPool을 사용하여 URL 목록을 통해 다중 스레드 처리

나는 그것이 왜 멈추고 있는지, 어디에서 시작해야하는지에 대해서는 잘 모르겠다.

from bs4 import BeautifulSoup, SoupStrainer 
import urllib3 
import urllib2 
import io 
import os 
import re 
import workerpool 
from urllib2 import Request, urlopen, URLError 

NUM_SOCKETS = 3 
NUM_WORKERS = 5 

urlfile = open("dailynewsurls.txt",'r') # read one line at a time until end of file 
http = urllib3.PoolManager(maxsize=NUM_SOCKETS) 
workers = workerpool.WorkerPool(size=NUM_WORKERS) 

class MyJob(workerpool.Job): 
    def __init__(self, url): 
     self.url = url 

    def run(self): 
     r = http.request('GET', self.url) 
     req = urllib2.Request(url) 
     try: 
      page = urllib2.urlopen(req) 
     except: 
      print "had to skip one" 
      return    
     pagecontent = page.read() # get a file-like object at this url 

#this tells it to soup the page that is at the url above 
     soup = BeautifulSoup(pagecontent) 

#this tells it to find the string in the first instance of each of the tags in the parenthesis 
     title = soup.find_all('title') 
     article = soup.find_all('article') 


     try: 
      title = str(title[0].get_text().encode('utf-8')) 
     except: 
      print "had to skip one" 
      return 
     try: 
      article = str(article[0].get_text().encode('utf-8')) 
     except: 
      print "had to skip one" 
      return 

     try: 
    # make the file using the things above 
      output_files_pathname = 'DailyNews/' # path where output will go 
      new_filename = title + ".txt" 

# write each of the things defined into the text file 
      outfile = open(output_files_pathname + new_filename,'w') 
      outfile.write(title) 
      outfile.write("\n") 
      outfile.write(article) 
      outfile.close() 
      print "%r added as a text file" % title 
      return 

     except: 
      print "had to skip one" 
      return 

     return 

for url in urlfile: 
    workers.put(MyJob(url)) 

workers.shutdown() 
workers.wait() 

print "All done." 

다음의 URL의 예를 들어 목록은 다음과 같습니다

http://www.nydailynews.com/entertainment/tv-movies/x-factor-season-2-episode-2-recap-oops-britney-spears-article-1.1159546 
http://www.nydailynews.com/new-york/brooklyn/lois-mclohon-resurfaced-iconic-daily-news-coney-island-cheesecake-photo-brings-back-memories-50-year-long-romance-article-1.1160457 
http://www.nydailynews.com/new-york/uptown/espaillat-linares-rivals-bitter-history-battle-state-senate-seat-article-1.1157994 
http://www.nydailynews.com/sports/baseball/mlb-power-rankings-yankees-split-orioles-tumble-rankings-nationals-shut-stephen-strasburg-hang-top-spot-article-1.1155953 
http://www.nydailynews.com/news/national/salon-sell-internet-online-communities-article-1.1150614 
http://www.nydailynews.com/sports/more-sports/jiyai-shin-wins-women-british-open-dominating-fashion-record-nine-shot-victory-article-1.1160894 
http://www.nydailynews.com/entertainment/music-arts/justin-bieber-offered-hockey-contract-bakersfield-condors-minor-league-team-article-1.1157991 
http://www.nydailynews.com/sports/baseball/yankees/umpire-blown-call-9th-inning-dooms-yankees-5-4-loss-baltimore-orioles-camden-yards-article-1.1155141 
http://www.nydailynews.com/entertainment/gossip/kellie-pickler-shaving-head-support-best-friend-cancer-fight-hair-article-1.1160938 
http://www.nydailynews.com/new-york/secret-103-000-settlement-staffers-accused-assemblyman-vito-lopez-sexual-harassment-included-penalty-20k-involved-talked-details-article-1.1157849 
http://www.nydailynews.com/entertainment/tv-movies/ricki-lake-fun-adds-substance-new-syndicated-daytime-show-article-1.1153301 
http://www.nydailynews.com/sports/college/matt-barkley-loyalty-usc-trojans-contention-bcs-national-championship-article-1.1152969 
http://www.nydailynews.com/sports/daily-news-sports-photos-day-farewell-andy-roddick-world-1-u-s-open-champ-retires-loss-juan-martin-del-potro-article-1.1152827 
http://www.nydailynews.com/entertainment/gossip/britney-spears-made-move-relationship-fiance-jason-trawick-reveals-article-1.1152722 
http://www.nydailynews.com/new-york/brooklyn/brooklyn-lupus-center-tayumika-zurita-leads-local-battle-disease-difficult-adversary-article-1.1153494 
http://www.nydailynews.com/life-style/fashion/kate-middleton-prabal-gurung-dress-sells-hour-myhabit-site-sold-1-995-dress-599-article-1.1161583 
http://www.nydailynews.com/news/politics/obama-romney-campaigns-vie-advantage-president-maintains-lead-article-1.1161540 
http://www.nydailynews.com/life-style/free-cheap-new-york-city-tuesday-sept-11-article-1.1155950 
http://www.nydailynews.com/news/world/dozens-storm-embassy-compound-tunis-article-1.1159663 
http://www.nydailynews.com/opinion/send-egypt-message-article-1.1157828 
http://www.nydailynews.com/sports/more-sports/witnesses-feel-sheryl-crow-lance-amstrong-activities-article-1.1152899 
http://www.nydailynews.com/sports/baseball/yankees/hiroki-kuroda-replacing-cc-sabathia-yankees-ace-pitcher-real-possibility-playoffs-looming-article-1.1161812 
http://www.nydailynews.com/life-style/eats/finland-hosts-pop-down-restaurant-belly-earth-262-feet-underground-article-1.1151523 
http://www.nydailynews.com/sports/more-sports/mighty-quinn-sept-23-article-1.1165584 
http://www.nydailynews.com/sports/more-sports/jerry-king-lawler-stable-condition-suffering-heart-attack-wwe-raw-broadcast-monday-night-article-1.1156915 
http://www.nydailynews.com/news/politics/ambassador-chris-stevens-breathing-libyans-found-american-consulate-rescue-article-1.1161454 
http://www.nydailynews.com/news/crime/swiss-banker-bradley-birkenfeld-104-million-reward-irs-blowing-whistle-thousands-tax-dodgers-article-1.1156736 
http://www.nydailynews.com/sports/hockey/nhl-board-governors-votes-favor-lockout-league-players-association-fail-reach-agreement-cba-article-1.1159131 
http://www.nydailynews.com/news/national/iphone-5-works-t-network-article-1.1165543 
http://www.nydailynews.com/sports/baseball/yankees/yankees-broadcasters-michael-kay-ken-singleton-opportunity-important-statement-article-1.1165479 
http://www.nydailynews.com/news/national/boss-year-michigan-car-dealer-retires-employees-1-000-year-service-article-1.1156763 
http://www.nydailynews.com/entertainment/tv-movies/hero-denzel-washington-clint-eastwood-article-1.1165538 
http://www.nydailynews.com/sports/football/giants/ny-giants-secondary-roasted-tony-romo-dallas-cowboys-offense-article-1.1153055 
http://www.nydailynews.com/news/national/hide-and-seek-tragedy-3-year-old-suffocates-hiding-bean-bag-article-1.1160138 
+0

가져올 urllib3 가져올 곳을 말해 줄 수 있습니까? 귀하의 방법을 사용하여 URL에 대한 요청을 만들려고 노력하고 있지만 http 및 요청에 대해 정의되지 않은 오류가 계속! 당신이 도와 줘서 고마워. – user1788736

답변

3

i를 threading 모듈을 사용하려고 할 것이다; 여기 내가 일하는 것 같아요 :

from bs4 import BeautifulSoup, SoupStrainer 
import threading 
import urllib2 

def fetch_url(url): 
    urlHandler = urllib2.urlopen(url) 
    html = urlHandler.read() 
#this tells it to soup the page that is at the url above 
    soup = BeautifulSoup(html) 

#this tells it to find the string in the first instance of each of the tags in the parenthesis 
    title = soup.find_all('title') 
    article = soup.find_all('article') 

    try: 
     title = str(title[0].get_text().encode('utf-8')) 
    except: 
     print "had to skip one bad title\n" 
     return 
    try: 
     article = str(article[0].get_text().encode('utf-8')) 
    except: 
     print "had to skip one bad article" 
     return 

    try: 
# make the file using the things above 
     output_files_pathname = 'DailyNews/' # path where output will go 
     new_filename = title + ".txt" 

# write each of the things defined into the text file 
     outfile = open(output_files_pathname + new_filename, 'w') 
     outfile.write(title) 
     outfile.write("\n") 
     outfile.write(article) 
     outfile.close() 
     print "%r added as a text file" % title 
     return 

    except: 
     print "had to skip one cant write file" 
     return 

    return 
with open("dailynewsurls.txt", 'r') as urlfile: 
# read one line at a time until end of file 
    threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urlfile] 
    for thread in threads: 
     thread.start() 
    for thread in threads: 
     thread.join() 
+1

Liam !! 감사합니다! –

관련 문제