2014-02-17 2 views
2

인터넷에서 문서를 파싱하는 데 도움을주십시오.원격 문서를 구문 분석하는 방법은 무엇입니까?

import pprint 
import xml.dom.minidom 
from xml.dom.minidom import Node 

import requests 

addr = requests.get('http://fh79272k.bget.ru/py_test/books.xml') 
print(addr.status_code) 

doc = xml.dom.minidom.parse(str(addr))   # load doc into object 
                # usually parsed up front 


mapping = {} 
for node in doc.getElementsByTagName("book"):  # traverse DOM object 
    isbn = node.getAttribute("isbn")    # via DOM object API 
    L = node.getElementsByTagName("title") 
    for node2 in L: 
     title = "" 
     for node3 in node2.childNodes: 
      if node3.nodeType == Node.TEXT_NODE: 
       title += node3.data 
     mapping[isbn] = title 

# mapping now has the same value as in the SAX example 
pprint.pprint(mapping) 

이 스크립트는 작동하지 않습니다. 오류 메시지는 다음과 같습니다

Traceback (most recent call last): File "C:\VINT\OPENSERVER\OpenServer\domains\localhost\python\parse_html\1\dombook.py", line 14, in doc = xml.dom.minidom.parse(str(addr)) # load doc into object File "C:\Python33\lib\xml\dom\minidom.py", line 1960, in parse return expatbuilder.parse(file) File "C:\Python33\lib\xml\dom\expatbuilder.py", line 908, in parse fp = open(file, 'rb') OSError: [Errno 22] Invalid argument: ''

XML :

<catalog> 
<book isbn="0-596-00128-2"> 
<title>Python & XML</title> 
<date>December 2001</date> 
<author>Jones, Drake</author> 
</book> 
<book isbn="0-596-15810-6"> 
<title>Programming Python, 4th Edition</title> 
<date>October 2010</date> 
<author>Lutz</author> 
</book> 
<book isbn="0-596-15806-8"> 
<title>Learning Python, 4th Edition</title> 
<date>September 2009</date> 
<author>Lutz</author> 
</book> 
<book isbn="0-596-15808-4"> 
<title>Python Pocket Reference, 4th Edition</title> 
<date>October 2009</date> 
<author>Lutz</author> 
</book> 
<book isbn="0-596-00797-3"> 
<title>Python Cookbook, 2nd Edition</title> 
<date>March 2005</date> 
<author>Martelli, Ravenscroft, Ascher</author> 
</book> 
<book isbn="0-596-10046-9"> 
<title>Python in a Nutshell, 2nd Edition</title> 
<date>July 2006</date> 
<author>Martelli</author> 
</book> 
<!-- 
plus many more Python books that should appear here 
--> 
</catalog> 

답변

1

당신이 아닌 몸의 텍스트에서, 응답 개체에서 XML을 구축하고 있습니다. 대신 str(addr)의, addr.text를 사용

doc = xml.dom.minidom.parse(addr.text) 

을 또한, HTML은 귀찮은입니다 처리하기 위해 XML 파서를 사용하여. Beautiful Soup을 사용해보세요.

관련 문제