셀레늄을 이용한 웹 스크랩

제 의도는 웹 페이지 (http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061)에서 이름, 위치, 게시 시간, 리뷰 제목 및 전체 리뷰 내용을 얻는 것입니다.셀레늄을 이용한 웹 스크랩

내 코드 :

from bs4 import BeautifulSoup 
    from selenium import webdriver 
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 

    firefox_capabilities = DesiredCapabilities.FIREFOX 
    firefox_capabilities['marionette'] = True 
    firefox_capabilities['binary'] = '/etc/firefox' 

    driver = webdriver.Firefox(capabilities=firefox_capabilities) 
    driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061') 
    soup = BeautifulSoup(driver.page_source,"lxml") 
    for link in soup.select(".profile"): 
     try: 
      profile = link.select("p:nth-of-type(1) a")[0] 
      profile1 = link.select("p:nth-of-type(2)")[0] 
     except:pass  
      print(profile.text,profile1.text) 
    driver = webdriver.Firefox(capabilities=firefox_capabilities) 
    driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061') 
    soup1 = BeautifulSoup(driver.page_source,"lxml") 
    for link in soup1.select(".col-10.review"): 
     try: 
     profile2 = link.select("small:nth-of-type(1)")[0] 
     profile3 = link.select("span:nth-of-type(3)")[0] 
     profile4 = link.select("a:nth-of-type(1)")[0] 
     except:pass 
     print(profile2.text,profile3.text,profile4.text) 
    driver = webdriver.Firefox(capabilities=firefox_capabilities) 
    driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061') 
    soup2 = BeautifulSoup(driver.page_source,"lxml") 
    for link in soup2.select(".more.review"): 
     try: 
     containers=page_soup.findAll("div",{"class":"more reviewdata"}) 
     count=len(containers) 
     for index in range(count): 
      count1=len(containers[index].p) 
      for i in range(count1): 
      profile5 = link.select("p:nth-of-type(i)")[0] 
     except:pass 
     print(profile5.text) 
    driver.quit()

나는 이름, 위치, 시간과 검토의 제목 출력을 얻고 있지만 사용자의 전체 리뷰를 얻을 수 없습니다입니다. 누구든지 내 코드의 최적화와 함께 동일한 결과물을 얻는 데 도움이된다면 (즉) 웹 페이지를 한 번만로드하여 필요한 코드를 추출하도록 코드를 작성해 주시면 감사하겠습니다. 또한 누군가가 웹 사이트의 모든 웹 페이지에서 Jio의 모든 고객 리뷰를 추출하는 데 도움이 될 수 있다면 매우 도움이 될 것입니다.

출처

2017-11-18 Monisha

적은 통증과 함께 몇 줄의 코드로 동일한 결과를 얻을 수 있습니다. 그러나 여기서는 세 가지 주요 범주를 정의했습니다. name, review_title, review_data 및 나머지 필드는 아주 쉽게 트 위치 할 수 있습니다.

from selenium import webdriver;import time 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 

driver = webdriver.Chrome() 
driver.get("http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061") 
wait = WebDriverWait(driver, 10) 

for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".review-article"))): 
    link = item.find_element_by_css_selector(".reviewdata a") 
    link.click() 
    time.sleep(2) 

    name = item.find_element_by_css_selector("p a").text 
    review_title = item.find_element_by_css_selector("strong a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews]").text 
    review_data = ' '.join([' '.join(items.text.split()) for items in item.find_elements_by_css_selector(".reviewdata")]) 
    print("Name: {}\nReview_Title: {}\nReview_Data: {}\n".format(name, review_title, review_data)) 

driver.quit()

또는 같은 조합되어 (셀레늄 + BS4) 할 :

from bs4 import BeautifulSoup 
from selenium import webdriver;import time 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 

driver = webdriver.Chrome() 
driver.get("http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061") 
wait = WebDriverWait(driver, 10) 

for items in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".review-article"))): 
    link = items.find_element_by_css_selector(".reviewdata a") 
    link.click() 
    time.sleep(2) 

soup = BeautifulSoup(driver.page_source,"lxml") 
for item in soup.select(".review-article"): 
    name = item.select("p a")[0].text 
    review_title = item.select("strong a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews]")[0].text 
    review_data = ' '.join([' '.join(items.text.split()) for items in item.select(".reviewdata")]) 
    print("Name: {}\nReview_Title: {}\nReview_Data: {}\n".format(name, review_title, review_data)) 

driver.quit()

을

이

는 양자 택일 할 수있는 방법입니다

출처

2017-11-19 10:56:23 SIM

셀레늄을 이용한 웹 스크랩

답변

관련 문제