2016-07-06 1 views
2

NLTK 라이브러리를 사용하여 언어 분석을 수행하기 위해 자체 URL 보관 파일에서 fanfiction을 추출하려고합니다. 그러나 URL에서 HTML을 긁어내는 모든 시도는 모든 것을 반환하지만 fanfic (및 필자는 필요하지 않은 주석 형식)을 반환합니다.Python에서 요청 소스와 다른 HTML을 반환하는 요청을 받았습니다.

먼저 나는 URLLIB 라이브러리 (과 BeautifulSoup로)에 내장 된 시도 : 그럼이 요청 라이브러리에 대해 알게

import urllib 
from bs4 import BeautifulSoup  
html = request.urlopen("http://archiveofourown.org/works/6846694").read() 
soup = BeautifulSoup(html,"html.parser") 
soup.prettify() 

하는 방법과 사용자 에이전트는 문제의 일부가 될 수있다, 그래서 나는이 시도 같은 결과 :

import requests 
headers = { 
     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36', 
     'Content-Type': 'text/html', 
} 
requests.get("http://archiveofourown.org/works/6846694",headers=headers,timeout=5).text 

은 그럼 내가 셀레늄과 PhantomJS에 대해 알게, 그래서 나는 사람들을 설치하고이 그러나 다시 시도 - 같은 결과 :

from selenium import webdriver 
from bs4 import BeautifulSoup 
browser = webdriver.PhantomJS() 
browser.get("http://archiveofourown.org/works/6846694") 
soup = BeautifulSoup(browser.page_source, "html.parser") 
soup.prettify() 

이러한 시도에서 잘못된 것이 있습니까? 아니면이 문제가 서버에 있습니까?

+0

결과가 일관됩니까? –

답변

2

JavaScript 실행 및 비동기 요청이 모두 완료된 전체 페이지 소스가 필요한 경우 마지막 접근 방식은 올바른 방향으로 나아가는 단계입니다. 한 가지를 놓치셨습니까? 소스를 읽기 전에 페이지를로드하려면 give PhantomJS time해야합니다 (의도적 인 말장난).

from bs4 import BeautifulSoup 

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 


driver = webdriver.PhantomJS() 
driver.get("http://archiveofourown.org/works/6846694") 

wait = WebDriverWait(driver, 10) 

# click proceed 
proceed = wait.until(EC.presence_of_element_located((By.LINK_TEXT, "Proceed"))) 
proceed.click() 

# wait for the content to be present 
wait.until(EC.presence_of_element_located((By.ID, "workskin"))) 

soup = BeautifulSoup(driver.page_source, "html.parser") 
soup.prettify() 
+0

이것에 대해 고마워, 나는 이것들이 내가 (이미 로그인 한 이후로) 생각하지 않은 염려이기 때문에 웹 스크렙 핑에 매우 익숙하다. –

1

Alexce이 코드는 당신이 원하는 모든이 경우 당신이 원하는 걸 포기하지 않은 이유를 설명했습니다

는, 당신은 또한 당신이 성인 콘텐츠를 볼 동의 "진행"을 클릭해야 당신이 PARAM view_adult=true을 추가하면 소스에서 사용할 텍스트 :

import requests 
from bs4 import BeautifulSoup 
url = "http://archiveofourown.org/works/6846694?view_adult=true" 


r= requests.get(url) 
soup = BeautifulSoup(r.content, "lxml") 
chap = soup.select_one("#chapter-1") 
preface = soup.select_one("div.preface.group") 


print(preface) 
print(chap) 

당신에게 줄 것이다 :

<div class="preface group"> 
<h2 class="title heading"> 
     The Complete Works of Emmanuel Allen 
    </h2> 
<h3 class="byline heading"> 
<a href="http://archiveofourown.org/users/violue/pseuds/violue" rel="author">violue</a> 
</h3> 
<div class="summary module" role="complementary"> 
<h3 class="heading">Summary:</h3> 
<blockquote class="userstuff"> 
<p>Dean Winchester, reluctant business owner, reluctant home owner, and reluctant cat owner, is striking up a very promising friendship with the author of his favorite book series.</p><p>And he has no idea.</p> 
</blockquote> 
</div> 
<div class="notes module" role="complementary"> 
<h3 class="heading">Notes:</h3> 
<blockquote class="userstuff"> 
<p>Oh yeah, I've got notes.</p><p> 
<s>1.) This is complete, though later chapters are still being beta'd. I'll be posting a chapter at a time, whenever the hell I feel like it. Probably every day/every other day because it's hard to just SIT ON ALL THESE CHAPTERS I HAVE WHEN THEY'RE READY TO POST!!!</s> 
</p><p>2.) This is of the mostly aimless domestic fluff variety, in that there's no big overarching storyline. But that's pretty common with my stories. ¯\_(ツ)_/¯ </p><p>3.) There's a bit of <i>me</i> in this story. I am a depressed and surly cat owner living in the Pacific Northwest, and so is Dean, but most of this is just my imagination.</p><p>4.) Thanks to <a href="http://archiveofourown.org/users/Tennyo/works">TENNYO</a>, <a href="http://chiwalker.tumblr.com/">CHIWALKER</a>, <a href="http://buckysbuckhole.tumblr.com/">CASFUCKER</a>, and <a href="http://kelisab.tumblr.com">KELISAB</a> for beta'ing! If you find mistakes in the story, it's all their fault, and you should throw soggy tomatoes at them.</p><p>5.) No, I think that's it. Start reading.</p> 
</blockquote> 
</div> 
</div> 
<div class="chapter" id="chapter-1"> 
<!-- chapter management --> 
<div class="chapter preface group" role="complementary"> 
<h3 class="title"> 
<a href="/works/6846694/chapters/15628576">Chapter 1</a>: Prologue 
    </h3> 
<!-- only display byline if different from the main byline --> 
</div> 
<!--main content--> 
<div class="userstuff module" role="article"> 
<h3 class="landmark heading" id="work">Chapter Text</h3> 
<p>“Wow, that’s beautiful!”</p><p>Dean doesn’t even have to look up from his book to know what this customer is talking about. Winchester General Store has a lot of things; food, beer, toiletries, camping gear, used books and more, but the only thing that could be considered “beautiful” in this store is the hand-carved, ornate wooden house sitting in a display case mounted on the wall behind Dean. Actually, “house” isn’t the right word. It started as a house in Dean’s mind, but by the time he was done carving, sanding, polishing, and in some places hot gluing the white oak structure, it had become a mausoleum. A beautiful, <em>inviting </em>mausoleum, but a mausoleum nonetheless. Dean had even borrowed some acrylic paints from Charlie to color the climbing ivy painstakingly carved onto the sides.</p><p>“Thanks, man,” Dean says, setting his book down. Might as well let the guy know this was <em>his </em>hard work.</p><p>The man’s eyes widen. “You <em>made </em>this?”</p><p>“Sure did. Worked on it for two months.” Dean nods toward the twelve pack of Mountain Dew the customer is holding. “You all set?”</p><p>The man puts the case on the counter by the register, and Dean rings it up. “How much?”</p><p>“Eight ninety-nine for the Dew.”</p><p>The man shakes his head. “No, I mean the sculpture. My wife and I just bought a place up in Cougar Falls, and that would look <em>great </em>in the front room.”</p><p>Dean blinks, surprised. He’s gotten a lot of compliments on the mausoleum in the past ten or so months, but no one’s ever assumed it was for sale before.</p><p>“Sorry, man, not for sale.”</p><p>“Come on. Name your price.” Dean gets all sorts of customers here. Locals, people out in the area for camping, people up here to go rafting down Filbert River, and of course, people just passing through on their way to some place bigger and better. This guy falls into the last category.</p><p>“No can do, that thing’s got something important inside. Can’t part with it.”</p><p>“Important? Like what?”</p><p>Dean shrugs. “My parents.”</p><p>“W… what?” the man stammers.</p><p>“Yeah. There’s an urn inside. Kinda had to glue the top of the building on to get the urn in there, but you can’t really tell unless you’re real close and looking at just the right angle.”</p><p>“<em>Both </em>of your parents?”</p><p>“Well, my mom died ages ago, and my dad kept her ashes the rest of his life.” Dean turns to look at his carving fondly. “And when my dad died, we had him cremated too. One night I got real drunk, I was still kind of in mourning, and I decided my parents should be together. So I dumped my dad’s ashes into my mom’s urn, and then I gave the urn a good shake,” Dean says, shaking an imaginary urn. “My brother was <em>pissed </em>when I told him, but he’s over it now. Anyway, I made this here structure to keep them in. Sort of an apology gift.”</p><p>The bell over the front door jingles, and Dean turns back to see the customer has taken off. “Don’t you want your Mountain Dew?” he yells, even though the guy’s already outside.</p><p>Jeez. What a wimp. Dean reaches into the display case, patting the top of the mausoleum gently. “What a baby. Am I right, guys?”</p><p>The urn full of Winchester ashes stays silent of course. Dean snickers, picks his book up off the counter, and gets back to reading.</p><p><br/> 
<br/> 
</p><p> </p> 
</div> 
<!--/main--> 
</div> 

희망에 따라 필요한 모든 것.

+1

그리고 100k에 달하는 것에 축하한다! 그것은 큰 문제입니다. – alecxe

+0

@alecxe, 건배 :) –

+1

이것도 좋습니다. 가능한 경우 답변으로 표시하겠습니다. 정말 고마워! –

관련 문제