xml을 개체로 deserialize하려고하면 xml 트리에서 다양한 항목의 인코딩 문제가 발생합니다.개체에 deserialization xml 문제 - 특수 문자로 원치 않는 분할
XML 예 : 파이썬
<?xml version="1.0" encoding="utf-8"?>
<results>
<FlightTravel>
<QuantityOfPassengers>6</QuantityOfPassengers>
<Id>N5GWXM</Id>
<InsuranceId>330992</InsuranceId>
<TotalTime>3h 00m</TotalTime>
<TransactionPrice>540.00</TransactionPrice>
<AdditionalPrice>0</AdditionalPrice>
<InsurancePrice>226.56</InsurancePrice>
<TotalPrice>9561.31</TotalPrice>
<CompanyName>XXXXX</CompanyName>
<TaxID>111-11-11-111</TaxID>
<InvoiceStreet>Jagiellońska</InvoiceStreet>
<InvoiceHouseNo>8</InvoiceHouseNo>
<InvoiceZipCode>Jagiellońska</InvoiceZipCode>
<InvoiceCityName>Warszawa</InvoiceCityName>
<PayerStreet>Jagiellońska</PayerStreet>
<PayerHouseNo>8</PayerHouseNo>
<PayerZipCode>11-111</PayerZipCode>
<PayerCityName>Warszawa</PayerCityName>
<PayerEmail>[email protected]</PayerEmail>
<PayerPhone>123123123</PayerPhone>
<Segments>
<Segment0>
<DepartureAirport>WAW</DepartureAirport>
<DepartureDate>śr. 06 lip</DepartureDate>
<DepartureTime>07:50</DepartureTime>
<ArrivalAirport>VIE</ArrivalAirport>
<ArrivalDate>śr. 06 lip</ArrivalDate>
<ArrivalTime>09:15</ArrivalTime>
</Segment0>
<Segment1>
<DepartureAirport>VIE</DepartureAirport>
<DepartureDate>śr. 06 lip</DepartureDate>
<DepartureTime>10:00</DepartureTime>
<ArrivalAirport>SZG</ArrivalAirport>
<ArrivalDate>śr. 06 lip</ArrivalDate>
<ArrivalTime>10:50</ArrivalTime>
</Segment1>
</Segments>
</FlightTravel>
</results>
XML 직렬화 해제 기능 :
# -*- coding: utf-8 -*-
from lxml import etree
import codecs
class TitleTarget(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True #if tag == 'Title' else False
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text.append(data)
def close(self):
return self.text
parser = etree.XMLParser(target = TitleTarget())
infile = 'Flights.xml'
results = etree.parse(infile, parser)
out = open('wynik.txt', 'w')
out.write('\n'.join(results))
out.close()
출력 :
[ '6', 'N5GWXM', '330992 ','3h 00m ','540.00 ','0 ','226.56 ','9561.31 ','XXXXX ', '11 'Jagiello', 'Jagiello', 'Jagiello', 'Snan', 'Warszawa', 'Jagiello', 'ń', ' 스카 ', '11', '바르샤바', '[email protected]', '123123123', 'WAW', 'ś', 'r. 06 입술 ', '07 : 50', 'VIE', 'ś', 'r. 06 입술 ', '09 : 15', 'VIE', 'ś', 'r. 06 입술 ', '10 : 00', 'SZG', 'ś', 'r. 06 입술 ', '10 : 50'] 항목 'Jagiellońska'에서
특수 문자 'N'입니다. 파서가 배열에 데이터를 추가 할 때 'ń'문자는 분할 문자의 일부 왕자이며 내 질문은 왜 이런 일이 발생했는지입니다. 나머지 항목은 배열에 올바르게 추가됩니다. 'śr 06.lip'항목에서 정확히 같은 상황입니다.