특정 웹 사이트에서 긁어 모으는 문제

이것은 스택 오버플로에 대한 나의 첫 번째 질문이므로 나와 함께하시기 바랍니다. http://www.normattiva.it/특정 웹 사이트에서 긁어 모으는 문제

나는이 아래 코드 (및 유사 순열) 사용하고 있습니다 : 당신으로

import requests, sys 

debug = {'verbose': sys.stderr} 
user_agent = {'User-agent': 'Mozilla/5.0', 'Connection':'keep-alive'} 

url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art' 

r = requests.session() 
s = r.get(url, headers=user_agent) 
#print(s.text) 
print(s.url) 
print(s.headers) 
print(s.request.headers)

을 나는 웹 사이트에서 자동으로 일부 이탈리아어 법률의 텍스트를 (즉, 부스러기)를 다운로드하려고

"caricaArticolo"쿼리를로드하려고하는 것을 볼 수 있습니다.

그러나, 출력 내 검색이 유효하지 않다는 페이지는 (는 "세션이 유효하지 않거나 만료"가)

페이지 나 브라우저로드를 사용하지 않는 오전 인식하는 것 "브레이크 아웃"자바 스크립트 함수.

<body onload="javascript:breakout();">

내가 사용하는 "브라우저"시뮬레이터 파이썬 스크립트와 같은 셀레늄 및 robobrowser하려고 노력하지만 결과는 동일합니다.

10 분 동안 페이지 출력을보고 도움을 청할 사람이 있습니까?

출처

2016-09-15 terzim

@terzin 먼저 페이지를 액세스하려면 권한이 부여 된 사용자가 있어야합니다.유효한 세션이 없습니다. –

동일한 코드를 시도하고 원하는 출력을 얻고 있습니다. –

"beautifulsoup"을 사용하여 필요한 것을 수행 할 수있는 강력한 라이브러리를 사용해보십시오. –

당신은 네트워크에서 문서 탭에서 열린 dev에 도구를 사용하여 페이지에있는 링크를 클릭하면 :

당신은 세 개의 링크를 볼 수 있습니다

는 첫 번째는 우리가 두 번째 반환을 클릭 것입니다 특정 조항으로 이동할 수있는 HTML과 마지막에는 기사 텍스트가 들어 있습니다. /caricaArticoloDefault로, 첫 번째는 제에 대한 후자를

<div id="alberoTesto"> 
     <iframe 
      src="/atto/caricaAlberoArticoli?atto.dataPubblicazioneGazzetta=2016-08-31&atto.codiceRedazionale=16G00182&atto.tipoProvvedimento=DECRETO LEGISLATIVO" 
      name="leftFrame" scrolling="auto" id="leftFrame" title="leftFrame" height="100%" style="width: 285px; float:left;" frameborder="0"> 
     </iframe> 

     <iframe 
      src="/atto/caricaArticoloDefault?atto.dataPubblicazioneGazzetta=2016-08-31&atto.codiceRedazionale=16G00182&atto.tipoProvvedimento=DECRETO LEGISLATIVO" 
      name="mainFrame" id="mainFrame" title="mainFrame" height="100%" style="width: 800px; float:left;" scrolling="auto" frameborder="0"> 
     </iframe>

을하고 ID메인 프레임 다음 firstlink에서 반환 된 소스에서

, 두 은 iframe 태그를 볼 수 있습니다 우리가 원하는 것입니다. 당신이 bs4를 사용하여 세션 객체와 페이지를 분석하여 작업을 수행 할 수 있도록

당신은 초기 요청에서 쿠키를 사용합니다 :

import requests, sys 
import os 
from urlparse import urljoin 
import io 
user_agent = {'User-agent': 'Mozilla/5.0', 'Connection': 'keep-alive'} 

url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art' 

with requests.session() as s: 
    s.headers.update(user_agent) 
    r = s.get("http://www.normattiva.it/") 
    soup = BeautifulSoup(r.content, "lxml") 
    # get all the links from the initial page 
    for a in soup.select("div.testo p a[href^=http]"): 
     soup = BeautifulSoup(s.get(a["href"]).content) 
     # The link to the text is in a iframe tag retuened from the previous get. 

     text_src_link = soup.select_one("#mainFrame")["src"] 

     # Pick something to make the names unique 
     with io.open(os.path.basename(text_src_link), "w", encoding="utf-8") as f: 
      # The text is in pre tag that is in the div with the pre class 
      text = BeautifulSoup(s.get(urljoin("http://www.normattiva.it", text_src_link)).content, "html.parser")\ 
       .select_one("div.wrapper_pre pre").text 
      f.write(text)

첫 번째 텍스트 파일의 조각 :

   IL PRESIDENTE DELLA REPUBBLICA 
    Visti gli articoli 76, 87 e 117, secondo comma, lettera d), della 
Costituzione; 
    Vistala legge 28 novembre 2005, n. 246 e, in particolare, 
l'articolo 14: 
    comma 14, cosi' come sostituito dall'articolo 4, comma 1, lettera 
a), della legge 18 giugno 2009, n. 69, con il quale e' stata 
conferita al Governo la delega ad adottare, con le modalita' di cui 
all'articolo 20 della legge 15 marzo 1997, n. 59, decreti legislativi 
che individuano le disposizioni legislative statali, pubblicate 
anteriormente al 1° gennaio 1970, anche se modificate con 
provvedimenti successivi, delle quali si ritiene indispensabile la 
permanenza in vigore, secondo i principi e criteri direttivi fissati 
nello stesso comma 14, dalla lettera a) alla lettera h); 
    comma 15, con cui si stabilisce che i decreti legislativi di cui 
al citato comma 14, provvedono, altresi', alla semplificazione o al 
riassetto della materia che ne e' oggetto, nel rispetto dei principi 
e criteri direttivi di cui all'articolo 20 della legge 15 marzo 1997, 
n. 59, anche al fine di armonizzare le disposizioni mantenute in 
vigore con quelle pubblicate successivamente alla data del 1° gennaio 
1970; 
    comma 22, con cui si stabiliscono i termini per l'acquisizione del 
prescritto parere da parte della Commissione parlamentare per la 
semplificazione; 
    Visto il decreto legislativo 30 luglio 1999, n. 300, recante 
riforma dell'organizzazione del Governo, a norma dell'articolo 11 
della legge 15 marzo 1997, n. 59 e, in particolare, gli articoli da 
20 a 22;

출처

2016-09-15 13:13:19

감사합니다 !!!! 아래를보십시오! – terzim

훌륭한 멋진 멋진 Padraic. 그것은 작동합니다. 수입을 줄이기 위해 조금만 수정해야했지만 훌륭하게 작동합니다. 매우 감사합니다. 나는 단지 파이썬의 잠재력을 발견하고 있으며이 특정 작업으로 훨씬 더 쉽게 여행을 할 수있게되었습니다. 나는 그것을 혼자서 해결하지 못했을 것이다.

import requests, sys 
import os 
from urllib.parse import urljoin 
from bs4 import BeautifulSoup 
import io 
user_agent = {'User-agent': 'Mozilla/5.0', 'Connection': 'keep-alive'} 

url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art' 

with requests.session() as s: 
    s.headers.update(user_agent) 
    r = s.get("http://www.normattiva.it/") 
    soup = BeautifulSoup(r.content, "lxml") 
    # get all the links from the initial page 
    for a in soup.select("div.testo p a[href^=http]"): 
     soup = BeautifulSoup(s.get(a["href"]).content) 
     # The link to the text is in a iframe tag retuened from the previous get. 

     text_src_link = soup.select_one("#mainFrame")["src"] 

     # Pick something to make the names unique 
     with io.open(os.path.basename(text_src_link), "w", encoding="utf-8") as f: 
      # The text is in pre tag that is in the div with the pre class 
      text = BeautifulSoup(s.get(urljoin("http://www.normattiva.it", text_src_link)).content, "html.parser")\ 
       .select_one("div.wrapper_pre pre").text 
      f.write(text)

출처

2016-09-15 19:39:14 terzim

특정 웹 사이트에서 긁어 모으는 문제

답변

관련 문제