2017-09-15 1 views
1

위키 피 디아 페이지의 참조 섹션에서 URL을 긁어내는 프로그램을 만들려고하지만 해당 태그/클래스를 분리하는 데 문제가 있습니다. 위키 피 디아 참조 URL의 스크랩 URL 섹션

## Import required packages ## 
from urllib.request import urlopen 
from urllib.error import HTTPError 
from bs4 import BeautifulSoup 
import re 

selectWikiPage = input(print("Please enter the Wikipedia page you wish to scrape from")) 
isWikiFound = re.findall(selectWikiPage, 'wikipedia') 
if "wikipedia" in selectWikiPage: 
    print("Input accepted") 
    html = urlopen(selectWikiPage) 
    bsObj = BeautifulSoup(html, "lxml") 
    findReferences = bsObj.findAll("#References") 
    for wikiReferences in findReferences: 
     print(wikiReferences.get_text()) 

else: 
    print("Error: Please enter a valid Wikipedia URL") 

내가 약간 요청 라이브러리를 사용하도록 코드를 변경 프로그램

Please enter the Wikipedia page you wish to scrape from 
Nonehttp://wikipedia.org/wiki/randomness 
Input accepted 
+1

. 한 가지 방법은 먼저 참조 섹션을 선택한 다음 해당 섹션 내에서 검색하는 것입니다. 예를 들어'''bsObj.find ("ol", { "class": "references"}). findAll ('a')''' – Lexxxxx

답변

0

의 출력입니다.

나는 테스트 케이스 등이 링크를 사용 " https://en.wikipedia.org/wiki/Randomness"

당신이 위키 페이지에 사용 된 텍스트의 원천 링크 만 검색 할 경우

:

import requests 
from bs4 import BeautifulSoup 

session = requests.Session()  
selectWikiPage = input(print("Please enter the Wikipedia page you wish to scrape from")) 

if "wikipedia" in selectWikiPage: 
    html = session.post(selectWikiPage) 
    bsObj = BeautifulSoup(html.text, "html.parser") 
    findReferences = bsObj.findAll('span', {'class':'reference-text'}) 
    href = BeautifulSoup(str(findReferences), "html.parser") 
    references = href.findAll('a', href=True) 
    links = [a["href"] for a in soup.find_all("a", href=True)]  
    print i in links: 
else: 
    print("Error: Please enter a valid Wikipedia URL") 

출력 :

Please enter the Wikipedia page you wish to scrape from 
Nonehttps://en.wikipedia.org/wiki/Randomness 
Link: /wiki/Oxford_English_Dictionary 
Link: http://www.people.fas.harvard.edu/~junliu/Workshops/workshop2007/ 
Link: /wiki/International_Standard_Book_Number_(identifier) 
Link: /wiki/Special:BookSources/0-19-512332-8 
Link: /wiki/International_Standard_Book_Number_(identifier) 
Link: /wiki/Special:BookSources/0-674-01517-7 
Link: /wiki/International_Standard_Book_Number_(identifier) 
Link: /wiki/Special:BookSources/0-387-98844-0 
Link: http://www.nature.com/nature/journal/v446/n7138/abs/nature05677.html 
Link: /w/index.php?title=Bell%27s_aspect_experiment&action=edit&redlink=1 
Link: /wiki/Nature_(journal) 
Link: /wiki/John_Gribbin 
Link: https://www.academia.edu/11720588/No_entailing_laws_but_enablement_in_the_evolution_of_the_biosphere 
Link: /wiki/International_Standard_Book_Number 
Link: /wiki/Special:BookSources/9781450311786 
Link: /wiki/Digital_object_identifier 
Link: //doi.org/10.1145%2F2330784.2330946 
Link: https://www.academia.edu/11720575/Extended_criticality_phase_spaces_and_enablement_in_biology 
Link: /wiki/Digital_object_identifier 
Link: //doi.org/10.1016%2Fj.chaos.2013.03.008 
Link: /wiki/PubMed_Identifier 
Link: //www.ncbi.nlm.nih.gov/pubmed/7059501 
Link: /wiki/Digital_object_identifier 
Link: //doi.org/10.1111%2Fj.1365-2133.1982.tb00897.x 
Link: http://webpages.uncc.edu/yonwang/papers/thesis.pdf 
Link: http://www.lbl.gov/Science-Articles/Archive/pi-random.html 
Link: http://www.ciphersbyritter.com/RES/RANDTEST.HTM 
Link: http://dx.doi.org/10.1038/nature09008 
Link: https://www.nytimes.com/2008/06/08/books/review/Johnson-G-t.html?_r=1 

당신이 참조 페이지 내의 모든 URL 링크 retreive하려면 다음

을3210
import requests 
from bs4 import BeautifulSoup 

session = requests.Session() 
selectWikiPage = input(print("Please enter the Wikipedia page you wish to scrape from")) 

if "wikipedia" in selectWikiPage: 
    html = session.post(selectWikiPage) 
    bsObj = BeautifulSoup(html.text, "html.parser") 
    findReferences = bsObj.find('ol', {'class': 'references'}) 
    href = BeautifulSoup(str(findReferences), "html.parser") 
    links = [a["href"] for a in href.find_all("a", href=True)] 
    for link in links: 
     print("Link: " + link) 
else: 
    print("Error: Please enter a valid Wikipedia URL") 

출력 : 귀하의 findall은 아무것도 반환하지 않습니다

Please enter the Wikipedia page you wish to scrape from 
Nonehttps://en.wikipedia.org/wiki/Randomness 
Link: #cite_ref-1 
Link: /wiki/Oxford_English_Dictionary 
Link: #cite_ref-2 
Link: http://www.people.fas.harvard.edu/~junliu/Workshops/workshop2007/ 
Link: #cite_ref-3 
Link: /wiki/International_Standard_Book_Number_(identifier) 
Link: /wiki/Special:BookSources/0-19-512332-8 
Link: #cite_ref-4 
Link: /wiki/International_Standard_Book_Number_(identifier) 
Link: /wiki/Special:BookSources/0-674-01517-7 
Link: #cite_ref-5 
Link: /wiki/International_Standard_Book_Number_(identifier) 
Link: /wiki/Special:BookSources/0-387-98844-0 
Link: #cite_ref-6 
Link: http://www.nature.com/nature/journal/v446/n7138/abs/nature05677.html 
Link: /w/index.php?title=Bell%27s_aspect_experiment&action=edit&redlink=1 
Link: /wiki/Nature_(journal) 
Link: #cite_ref-7 
Link: /wiki/John_Gribbin 
Link: #cite_ref-8 
Link: https://www.academia.edu/11720588/No_entailing_laws_but_enablement_in_the_evolution_of_the_biosphere 
Link: /wiki/International_Standard_Book_Number 
Link: /wiki/Special:BookSources/9781450311786 
Link: /wiki/Digital_object_identifier 
Link: //doi.org/10.1145%2F2330784.2330946 
Link: #cite_ref-9 
Link: https://www.academia.edu/11720575/Extended_criticality_phase_spaces_and_enablement_in_biology 
Link: /wiki/Digital_object_identifier 
Link: //doi.org/10.1016%2Fj.chaos.2013.03.008 
Link: #cite_ref-10 
Link: /wiki/PubMed_Identifier 
Link: //www.ncbi.nlm.nih.gov/pubmed/7059501 
Link: /wiki/Digital_object_identifier 
Link: //doi.org/10.1111%2Fj.1365-2133.1982.tb00897.x 
Link: #cite_ref-11 
Link: http://webpages.uncc.edu/yonwang/papers/thesis.pdf 
Link: #cite_ref-12 
Link: http://www.lbl.gov/Science-Articles/Archive/pi-random.html 
Link: #cite_ref-13 
Link: #cite_ref-14 
Link: #cite_ref-15 
Link: http://www.ciphersbyritter.com/RES/RANDTEST.HTM 
Link: #cite_ref-16 
Link: http://dx.doi.org/10.1038/nature09008 
Link: #cite_ref-NYOdds_17-0 
Link: #cite_ref-NYOdds_17-1 
Link: https://www.nytimes.com/2008/06/08/books/review/Johnson-G-t.html?_r=1 
+0

OP가 단지 모든 하이퍼 링크를 원한다면'soup.find_all ("a", href = True)의'에 대한 links = [a [ "href"]'충분하면'span'을 조사 할 필요는 없다. – Tony

+0

안녕하세요, Tony, 답장을 보내 주셔서 감사합니다. OP 질문의 범위에 맞도록 내 대답을 편집했습니다. – Ali

관련 문제