bs4를 사용하여 html 파일의 텍스트를 추출하십시오.

내 html 파일에서 텍스트를 추출하고 싶습니다. 특정 파일을 아래에서 사용하는 경우 :bs4를 사용하여 html 파일의 텍스트를 추출하십시오.

import bs4, sys 
from urllib import urlopen 
#filin = open(sys.argv[1], 'r') 
filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8') 
soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8')

이 작동합니다. 그러나 개방하여 비 특정 파일을 아래 시도 (sys.argv에를 [1], 'R') :

import bs4, sys 
from urllib import urlopen 
filin = open(sys.argv[1], 'r') 
#filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8') 
soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8')

또는

import bs4, sys 
from urllib import urlopen 
with open(sys.argv[1], 'r') as filin: 
    webpage = urlopen(filin).read().decode('utf-8') 
    soup = bs4.BeautifulSoup(webpage) 
    for node in soup.findAll('html'): 
     print u''.join(node.findAll(text=True)).encode('utf-8')

나는 아래 오류를 받고있을 것입니다

Traceback (most recent call last): 
    File "/home/iykeln/Desktop/py/clean.py", line 5, in <module> 
    webpage = urlopen(filin).read().decode('utf-8') 
    File "/usr/lib/python2.7/urllib.py", line 87, in urlopen 
    return opener.open(url) 
    File "/usr/lib/python2.7/urllib.py", line 180, in open 
    fullurl = unwrap(toBytes(fullurl)) 
    File "/usr/lib/python2.7/urllib.py", line 1057, in unwrap 
    url = url.strip() 
AttributeError: 'file' object has no attribute 'strip'

출처

2013-08-04 Iykeln

open으로 전화하지 말고 urlopen에 파일 이름을 전달하십시오.

import bs4, sys 
from urllib import urlopen 

webpage = urlopen(sys.argv[1]).read().decode('utf-8') 
soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8')

참고로, 당신은 로컬 파일을 여는 urllib을 필요가 없습니다 도움이

import bs4, sys 

with open(sys.argv[1], 'r') as f: 
    webpage = f.read().decode('utf-8') 

soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8')

희망을.

출처

2013-08-04 12:01:38 alecxe

네! 맞습니다. 감사합니다 alecxe. – Iykeln

예! 그것은 도움이되었습니다. @ alecxe. 감사. – Iykeln

bs4를 사용하여 html 파일의 텍스트를 추출하십시오.

답변

관련 문제