구문 분석 XML

102

I가 내가 사용 구문 분석 할 다음과 같은 XML 파이썬의 ElementTree :구문 분석 XML

<rdf:RDF xml:base="http://dbpedia.org/ontology/" 
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:owl="http://www.w3.org/2002/07/owl#" 
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" 
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" 
    xmlns="http://dbpedia.org/ontology/"> 

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague"> 
     <rdfs:label xml:lang="en">basketball league</rdfs:label> 
     <rdfs:comment xml:lang="en"> 
      a group of sports teams that compete against each other 
      in Basketball 
     </rdfs:comment> 
    </owl:Class> 

</rdf:RDF>

나는 모든 owl:Class 태그를 찾은 다음 모든 rdfs:label 인스턴스의 값을 추출 할 그들 안에. 다음 코드를 사용하고 있습니다 :

tree = ET.parse("filename") 
root = tree.getroot() 
root.findall('owl:Class')

네임 스페이스 때문에 다음 오류가 발생합니다.

SyntaxError: prefix 'owl' not found in prefix map

나는 http://effbot.org/zone/element-namespaces.htm에서 문서를 읽는 시도하지만, 난 여전히 위의 XML 여러 중첩 된 네임 스페이스를 가지고 있기 때문에이 작업을 얻을 수 없습니다입니다.

모든 owl:Class 태그를 찾기 위해 코드를 변경하는 방법을 알려주세요.

출처

2013-02-13 Sudar

153

ElementTree는 네임 스페이스에 너무 똑똑하지 않습니다. .find(), findall() 및 iterfind() 메서드에 명시 적 네임 스페이스 사전을 제공해야합니다. .이 아주 잘 설명되어 있지 않습니다 :

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed 

root.findall('owl:Class', namespaces)

접두사가 만 당신이 전달하는 namespaces 매개 변수에 고개 있습니다 이것은 당신이 원하는 네임 스페이스 접두사를 사용할 수 있다는 것을 의미; API는 owl: 부분을 분할하고 namespaces 사전에서 해당 네임 스페이스 URL을 조회 한 다음 검색을 변경하여 XPath 표현식 {http://www.w3.org/2002/07/owl}Class을 찾습니다. 동일한 물론 너무 자신을 구문 사용할 수 있습니다

root.findall('{http://www.w3.org/2002/07/owl#}Class')

당신이 lxml library 것들로 전환 할 수있는 경우

가 더 낫다; 해당 라이브러리는 동일한 ElementTree API를 지원하지만 요소의 .nsmap 속성에서 네임 스페이스를 수집합니다.

출처

2013-02-13 12:18:22

감사합니다. 특히 두 번째 부분에서는 네임 스페이스를 직접 지정할 수 있습니다. – Sudar

감사합니다. 어떻게 하드 코딩하지 않고 XML에서 네임 스페이스를 직접 얻을 수 있습니까? 아니면 어떻게 무시할 수 있습니까? 나는 findall ('{*} Class')을 시도했지만 내 경우에는 효과가 없다. – Kostanos

당신은'xmlns' 속성에 대한 트리를 직접 스캔해야합니다. 대답에서 말했듯이,'lxml'은 이것을 당신에게 해주고,'xml.etree.ElementTree' 모듈은 그렇지 않습니다. 그러나 특정 (이미 하드 코딩 된) 요소와 일치시키려는 경우 특정 네임 스페이스의 특정 요소를 일치 시키려고합니다. 이 네임 스페이스는 요소 이름보다 더 많은 문서간에 변경되지 않습니다. 요소 이름으로 하드 코딩 할 수도 있습니다. –

여기 (마티 피에 터스 언급으로) 하드 코드 네임 스페이스를하지 않고 LXML이 작업을 수행하거나 텍스트를 스캔하는 방법은 다음과 같습니다

from lxml import etree 
tree = etree.parse("filename") 
root = tree.getroot() 
root.findall('owl:Class', root.nsmap)

출처

2014-11-07 18:22:52

잘 작동합니다. –

전체 네임 스페이스 URL *은 하드 코딩해야하는 네임 스페이스 식별자입니다. 로컬 접두사 ('owl')는 파일마다 바뀔 수 있습니다. 그러므로이 답변이 제시하는 것을하는 것은 정말 나쁜 생각입니다. –

@MattiVirkkunen 정확하게 올빼미 정의가 파일에서 파일로 변경 될 수 있다면, 하드 코딩 대신 각 파일에 정의 된 정의를 사용해서는 안됩니까? –

주 : 이것은 대한 답변 유용 하드 코딩 된 네임 스페이스를 사용하지 않고 Python의 ElementTree 표준 라이브러리.

는
>>> from io import StringIO >>> from xml.etree import ElementTree >>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/" ... xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" ... xmlns:owl="http://www.w3.org/2002/07/owl#" ... xmlns:xsd="http://www.w3.org/2001/XMLSchema#" ... xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" ... xmlns="http://dbpedia.org/ontology/"> ... ... <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague"> ... <rdfs:label xml:lang="en">basketball league</rdfs:label> ... <rdfs:comment xml:lang="en"> ... a group of sports teams that compete against each other ... in Basketball ... </rdfs:comment> ... </owl:Class> ... ... </rdf:RDF>''' >>> my_namespaces = dict([ ... node for _, node in ElementTree.iterparse( ... StringIO(my_schema), events=['start-ns'] ... ) ... ]) >>> from pprint import pprint >>> pprint(my_namespaces) {'': 'http://dbpedia.org/ontology/', 'owl': 'http://www.w3.org/2002/07/owl#', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#', 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

그런 다음 사전은에 인수로 전달 될 수

는 네임 스페이스 이벤트 ( 시작 NS)를 시작 구문 분석, 당신은 ElementTree.iterparse 기능을 사용할 수 있습니다 XML 데이터에서 네임 스페이스의 접두사와 URI를 추출하려면 검색 기능 :

root.findall('owl:Class', my_namespaces)

출처

2016-05-24 09:09:56

이것은 lxml에 액세스하지 않고 하드 코딩 네임 스페이스를 원하지 않는 사용자에게 유용합니다. – delrocco

이 줄에 대해 'ValueError : write to closed'오류가 발생했습니다. filemy_namespaces = dict ([ET.iterparse (StringIO (my_schema), events = [ 'start-ns']) . 어떤 아이디어가 잘못 되길 원하니? – Yuli

아마도이 오류는 ASCII 문자열을 거부하는 클래스 io.StringIO와 관련되어 있습니다. Python3으로 내 조리법을 테스트했습니다. 유니 코드 문자열 접두사 'u'를 샘플 문자열에 추가하면 Python 2 (2.7)에서도 작동합니다. –

늦어서 몇 년 나는 알고 있지만 난 그냥 네임 스페이스 유효한 XML로 사전을 변환 처리 할 패키지를 생성 에스. 패키지는 PyPi @https://pypi.python.org/pypi/xmler에서 호스팅됩니다.

myDict = { 
    "RootTag": {      # The root tag. Will not necessarily be root. (see #customRoot) 
     "@ns": "soapenv",   # The namespace for the RootTag. The RootTag will appear as <soapenv:RootTag ...> 
     "@attrs": {      # @attrs takes a dictionary. each key-value pair will become an attribute 
      { "xmlns:soapenv": "http://schemas.xmlsoap.org/soap/envelope/" } 
     }, 
     "childTag": { 
      "@attrs": { 
       "someAttribute": "colors are nice" 
      }, 
      "grandchild": "This is a text tag" 
     } 
    } 
}

과 같이 보입니다 XML 출력을 얻을 :이 미래

에있는 사람들에게 유용하다

<soapenv:RootTag xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> 
    <childTag someAttribute="colors are nice"> 
     <grandchild>This is a text tag</grandchild> 
    </childTag> 
</soapenv:RootTag>

희망이 보이는 사전을 취할 수있는이 패키지를 사용

출처

2016-08-09 21:38:02 watzon

나는 이와 비슷한 코드를 사용 해왔고 항상 문서를 읽을 가치가 있다는 것을 알았습니다 ... 늘 그렇듯이!

findall()은 현재 태그의 직접 하위 인 개의 요소 만 찾습니다. 그래서, 정말로 모든 것.

크고 복잡한 xml 파일을 다루는 경우 특히 그 하위 하위 요소 (등)가 포함될 경우 다음을 수행하면서 코드를 작성하는 것이 좋습니다. 요소가 XML에있는 곳을 알고 있으면 괜찮을 것이라고 생각합니다! 기억할만한 가치가 있다고 생각했습니다.

root.iter()

REF :. https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements가 "Element.findall()는 현재의 구성 요소의 직접 자식 인 태그 만 요소를 발견 Element.find()는 특정 태그와 제 자식을 발견하고 Element.text element.get()은 요소의 속성에 액세스합니다. "

출처

2016-08-16 09:51:36 MJM

구문 분석 XML

답변

관련 문제