urlib.urlretrieve 및 urlib2가 파일을 손상시킵니다.

저는 작업중인 XBMC 확장 프로그램에 좌절감을 느끼고 있습니다.urlib.urlretrieve 및 urlib2가 파일을 손상시킵니다.

요약하면 Firefox, IE 등을 사용하여 파일을 다운로드하면 파일이 유효하고 잘 작동하지만 Firefox에서 urlib 또는 urlib2를 사용하면 파일이 손상됩니다.

문제의 파일은 다음과 같습니다 http://re.zoink.it/00b007c479 (007960DAD4832AC714C465E207055F2BE18CAFF6.torrent) 여기

은 체크섬입니다

PY: 2d1528151c62526742ce470a01362ab8ea71e0a7 
IE: 60a93c309cae84a984bc42820e6741e4f702dc21

체크섬 잘못 일치 (파이썬 DL가 손상, IE/FF DL이 손상되지) 여기

내가이 작업을 수행하기 위해 작성한 함수의

def DownloadFile(uri, localpath): 
    '''Downloads a file from the specified Uri to the local system. 

    Keyword arguments: 
    uri -- the remote uri to the resource to download 
    localpath -- the local path to save the downloaded resource 
    ''' 
    remotefile = urllib2.urlopen(uri) 
    # Get the filename from the content-disposition header 
    cdHeader = remotefile.info()['content-disposition'] 

    # typical header looks like: 'attachment; filename="Boardwalk.Empire.S05E00.The.Final.Shot.720p.HDTV.x264-BATV.[eztv].torrent"' 
    # use RegEx to slice out the part we want (filename) 
    filename = re.findall('filename=\"(.*?)\"', cdHeader)[0]  
    filepath = os.path.join(localpath, filename) 
    if (os.path.exists(filepath)): 
     return 

    data = remotefile.read() 
    with open(filepath, "wb") as code: 
    code.write(data) # this is resulting in a corrupted file 

    #this is resulting in a corrupted file as well 
    #urllib.urlretrieve(uri, filepath)

내가 뭘 잘못하고 있니? 그것의 명중 또는 미스; 파이썬으로 다운로드하면 일부 소스가 올바르게 다운로드되고 다른 소스는 항상 손상된 파일이됩니다. 그들은 모두 내가 웹 브라우저를 사전에

감사합니다 ...

출처

2014-10-22 Neal Bailey

응답을 사용한다 올바르게 다운로드하는 것은 Gzip으로 인코딩입니다 :

>>> import urllib2 
>>> remotefile = urllib2.urlopen('http://re.zoink.it/00b007c479') 
>>> remotefile.info()['content-encoding'] 
'gzip'

브라우저가 당신을 위해 이것을 디코딩하지만 urllib2는 않습니다 아니. 먼저이 직접 수행해야합니다 :

>>> import zlib 
>>> import hashlib 
>>> data = remotefile.read() 
>>> hashlib.sha1(data).hexdigest() 
'2d1528151c62526742ce470a01362ab8ea71e0a7' 
>>> hashlib.sha1(zlib.decompress(data, zlib.MAX_WBITS + 16)).hexdigest() 
'60a93c309cae84a984bc42820e6741e4f702dc21'

당신은 아마 투명하게 콘텐츠를 인코딩을 처리하는 requests module를 사용하여 전환 할 :

import zlib 

data = remotefile.read() 
if remotefile.info().get('content-encoding') == 'gzip': 
    data = zlib.decompress(data, zlib.MAX_WBITS + 16)

일단 데이터가 완벽하게 SHA1 서명에 맞는 압축 해제 :

>>> import requests 
>>> response = requests.get('http://re.zoink.it/00b007c479') 
>>> hashlib.sha1(response.content).hexdigest() 
'60a93c309cae84a984bc42820e6741e4f702dc21'

출처

2014-10-22 16:35:19

고맙습니다! 방금 헤더를 버리고 gzip 인코딩을 알게되었습니다 (http://s12.postimg.org/x1f33j8kd/content_encoding.jpg). Arg ... 얼마나 실망 스럽습니까. –

@NealBailey :이 특별한 경우 서버는'Accept-Encoding' 헤더를 무시한 것처럼 보입니다. (빈 헤더를 보내고'identity'를 설정했는데 서버가 여전히'gzip' 인코딩을 사용했습니다. tsk.). 그러나 다른 경우 헤더를 설정하면 문제를 피할 수 있습니다. –

urlib.urlretrieve 및 urlib2가 파일을 손상시킵니다.

답변

관련 문제