2017-01-31 4 views
1

에 URL 링크를 추출 할 수 없습니다 내가 그것을 실행할 때 내가하지만, 다음과 같은 코드를 작성했습니다내가 인쇄하기 위해 노력하고있어 파이썬

https://www.saa.gov.uk/search.php?SEARCHED=1&ST=&SEARCH_TERM=aberdeen%2C+Aberdeen+City&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&x=0&y=0&PAGE=0&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=aberdeen&DRILL_SEARCH_TERM=ABERDEEN%2C+Aberdeen+City&DD_UNITARY_AUTHORITY=Aberdeen+City&DD_TOWN=ABERDEEN#results

다음 링크에서 지명 (종교 단체, BANKHEAD) 그것은 잠시 동안 빈 출력을 생성하고 cmd를 다시 입력하라는 메시지를 표시합니다. 나는, 내가 처음 어떤 아이디어 앞에 #를 넣어 한 덕분에 & 감사를 끝이 다른 인쇄 문에서도 시도했습니다

import requests 
from bs4 import BeautifulSoup 
import csv 

# connection header 
header={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
'Accept-Encoding':'gzip, deflate, sdch, br', 
'Accept-Language':'en-US,en;q=0.8,fr;q=0.6', 
'Connection':'keep-alive', 
'Cookie':'mdtp=y4Ts2Vvql5V9MMZNjqB9T+7S/vkQKPqjHHMIq5jk0J1l5l131dU0YXsq7Rr15GDyghKHrS/qcD2vdsMCVtzKByJEDZFI+roS6tN9FN5IS70q8PkCCBjgFPDZjlR1A3H9FJ/zCWXMNJbaXqF8MgqE+nhR3/lji+eK4mm/GP9b8oxlVdupo9KN9SKanxu/JFEyNXutjyN+BsxRztNem1Z+ExSQCojyxflI/tc70+bXAu3/ppdP7fIXixfEOAWezmOh3ywchn9DV7Af8wH45t8u4+Y=; mdtpdi=mdtpdi#f523cd04-e09e-48bc-9977-73f974d50cea#1484041095424_zXDAuNhEkKdpRUsfXt+/1g==; seen_cookie_message=yes; _ga=GA1.4.666959744.1484041122; _gat=1', 
'Host':'https://www.saa.gov.uk/', 
'Referer':'https://www.saa.gov.uk/', 
'Upgrade-Insecure-Requests':'1', 
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.51 Safari/537.36' 
} 

session = requests.session() 

user_agent = {'User-Agent': 'Mozilla/5.0'} 
url = 'https://www.saa.gov.uk/search.php?SEARCHED=1&ST=&SEARCH_TERM=aberdeen%2C+Aberdeen+City&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&x=0&y=0&PAGE=0&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=aberdeen&DRILL_SEARCH_TERM=ABERDEEN%2C+Aberdeen+City&DD_UNITARY_AUTHORITY=Aberdeen+City&DD_TOWN=ABERDEEN#results' 



response = session.get(url, headers=header) 


soup = BeautifulSoup(response.content,"lxml") 

for link in soup.findAll('a'): 
    #print link.get('href') 
    print link.find('a')['href'] 
+0

왜 link.find ('a')를 다시 호출 하시겠습니까? 이미 앵커 요소를 모두 찾지 않았습니까? –

답변

1

이 웹 페이지에는 하드 제한 때문에 간단한 requests.get가 없습니다 당신이 찾고있는 장소 이름과 링크가 여분의 속성이없는 td 태그에 있기 때문에, 은 파싱에 더 적합합니다. 그래서이 경우

:

import requests 
from bs4 import BeautifulSoup as soup 

url = 'https://www.saa.gov.uk/search.php?SEARCHED=1&ST=&SEARCH_TERM=aberdeen%2C+Aberdeen+City&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&x=0&y=0&PAGE=0&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=aberdeen&DRILL_SEARCH_TERM=ABERDEEN%2C+Aberdeen+City&DD_UNITARY_AUTHORITY=Aberdeen+City&DD_TOWN=ABERDEEN#results' 
baseurl = 'https://www.saa.gov.uk' 

response = requests.get(url) 
html = soup(response.text, 'lxml') 

for link in html.select('td a'): 
    print link.text, baseurl + link['href'] 

그리고 결과 :

... 
... 
BOYD ORR CLOSE https://www.saa.gov.uk/search.php?SEARCHED=1&ST=&SEARCH_TERM=aberdeen%2C+ABERDEEN%2C+Aberdeen+City&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&x=0&y=0&PAGE=0&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=aberdeen&DRILL_SEARCH_TERM=BOYD+ORR+CLOSE%2C+ABERDEEN%2C+Aberdeen+City&DD_UNITARY_AUTHORITY=Aberdeen+City&DD_TOWN=ABERDEEN&DD_STREET=BOYD+ORR+CLOSE#results 
DON TERRACE https://www.saa.gov.uk/search.php?SEARCHED=1&ST=&SEARCH_TERM=aberdeen%2C+ABERDEEN%2C+Aberdeen+City&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&x=0&y=0&PAGE=0&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=aberdeen&DRILL_SEARCH_TERM=DON+TERRACE%2C+ABERDEEN%2C+Aberdeen+City&DD_UNITARY_AUTHORITY=Aberdeen+City&DD_TOWN=ABERDEEN&DD_STREET=DON+TERRACE#results 
BYRON CRESCENT https://www.saa.gov.uk/search.php?SEARCHED=1&ST=&SEARCH_TERM=aberdeen%2C+ABERDEEN%2C+Aberdeen+City&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&x=0&y=0&PAGE=0&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=aberdeen&DRILL_SEARCH_TERM=BYRON+CRESCENT%2C+ABERDEEN%2C+Aberdeen+City&DD_UNITARY_AUTHORITY=Aberdeen+City&DD_TOWN=ABERDEEN&DD_STREET=BYRON+CRESCENT#results 
MARINE TERRACE https://www.saa.gov.uk/search.php?SEARCHED=1&ST=&SEARCH_TERM=aberdeen%2C+ABERDEEN%2C+Aberdeen+City&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&x=0&y=0&PAGE=0&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=aberdeen&DRILL_SEARCH_TERM=MARINE+TERRACE%2C+ABERDEEN%2C+Aberdeen+City&DD_UNITARY_AUTHORITY=Aberdeen+City&DD_TOWN=ABERDEEN&DD_STREET=MARINE+TERRACE#results 
... 
... 

편집 : 당신은 단지 첫 번째 테이블에있는 모든 링크를 찾을해야하는 경우

, 모두 당신이 필요 먼저 first_table = html.find('table')으로 첫 번째 표를 찾은 다음이 표에서 링크를 검색하십시오.

html = soup(response.text, 'lxml') 

first_table = html.find('table') 

for link in first_table.select('td a'): 
    print link.text, baseurl + link['href'] 
관련 문제