수백만 행이있는 디스크상의 테이블 병합과 관련된 문제

TypeError: Cannot serialize the column [date] because its data contents are [empty] object dtype.수백만 행이있는 디스크상의 테이블 병합과 관련된 문제

안녕하세요! 현재 각각 하나의 노드가 포함 된 두 개의 대형 HDFStore가 있는데 두 노드가 모두 메모리에 맞지 않습니다. 노드에는 NaN 값이 포함되어 있지 않습니다. 이제 this을 사용하여이 두 노드를 병합하고 싶습니다. 처음에는 모든 데이터가 하나의 청크에 들어 맞고 이것이 정상적으로 작동하는 작은 저장소에 대해 테스트되었습니다. 하지만 이제 청크를 청크로 병합해야하는 경우에 다음과 같은 오류가 발생합니다 : TypeError: Cannot serialize the column [date], because its data contents are [empty] object dtype.

이것은 내가 실행중인 코드입니다. 내가 눈치

>>> import pandas as pd 
>>> from pandas import HDFStore 
>>> print pd.__version__ 
0.12.0rc1 

>>> h5_1 ='I:/Data/output/test8\\var1.h5' 
>>> h5_3 ='I:/Data/output/test8\\var3.h5' 
>>> h5_1temp = h5_1.replace('.h5','temp.h5') 

>>> A = HDFStore(h5_1) 
>>> B = HDFStore(h5_3) 
>>> Atemp = HDFStore(h5_1temp) 

>>> print A 
<class 'pandas.io.pytables.HDFStore'> 
File path: I:/Data/output/test8\var1.h5 
/var1   frame_table (shape->12626172) 
>>> print B 
<class 'pandas.io.pytables.HDFStore'> 
File path: I:/Data/output/test8\var3.h5 
/var3   frame_table (shape->6313086) 

>>> nrows_a = A.get_storer('var1').nrows 
>>> nrows_b = B.get_storer('var3').nrows 
>>> a_chunk_size = 500000 
>>> b_chunk_size = 500000 
>>> for a in xrange(int(nrows_a/a_chunk_size) + 1): 
...  a_start_i = a * a_chunk_size 
...  a_stop_i = min((a + 1) * a_chunk_size, nrows_a) 
...  a = A.select('var1', start = a_start_i, stop = a_stop_i) 
...  for b in xrange(int(nrows_b/b_chunk_size) + 1): 
...   b_start_i = b * b_chunk_size 
...   b_stop_i = min((b + 1) * b_chunk_size, nrows_b) 
...   b = B.select('var3', start = b_start_i, stop = b_stop_i) 
...   Atemp.append('mergev13', pd.merge(a, b , left_index=True, right_index=True,how='inner')) 

... 
Traceback (most recent call last): 
    File "<interactive input>", line 9, in <module> 
    File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 658, in append 
    self._write_to_group(key, value, table=True, append=True, **kwargs) 
    File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 923, in _write_to_group 
    s.write(obj = value, append=append, complib=complib, **kwargs) 
    File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 3251, in write 
    return super(AppendableMultiFrameTable, self).write(obj=obj.reset_index(), data_columns=data_columns, **kwargs) 
    File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2983, in write 
    **kwargs) 
    File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2715, in create_axes 
    raise e 
TypeError: Cannot serialize the column [date] because 
its data contents are [empty] object dtype

것들, 내가 pandas_version에있어 것을 언급 = '0.10.1'그러나 내 팬더 버전이 0.12.0rc1입니다. 더 나아가 좀 더 구체적인 노드 정보 : Atemp에서 chunksize 영역부터

>>> A.select_column('var1','date').unique() 
array([2006001, 2006009, 2006017, 2006025, 2006033, 2006041, 2006049, 
     2006057, 2006065, 2006073, 2006081, 2006089, 2006097, 2006105, 
     2006113, 2006121, 2006129, 2006137, 2006145, 2006153, 2006161, 
     2006169, 2006177, 2006185, 2006193, 2006201, 2006209, 2006217, 
     2006225, 2006233, 2006241, 2006249, 2006257, 2006265, 2006273, 
     2006281, 2006289, 2006297, 2006305, 2006313, 2006321, 2006329, 
     2006337, 2006345, 2006353, 2006361], dtype=int64) 

>>> B.select_column('var3','date').unique() 
array([2006001, 2006017, 2006033, 2006049, 2006065, 2006081, 2006097, 
     2006113, 2006129, 2006145, 2006161, 2006177, 2006193, 2006209, 
     2006225, 2006241, 2006257, 2006273, 2006289, 2006305, 2006321, 
     2006337, 2006353], dtype=int64) 

>>> A.get_storer('var1').levels 
['x', 'y', 'date'] 

>>> A.get_storer('var1').attrs 
/var1._v_attrs (AttributeSet), 12 attributes: 
    [CLASS := 'GROUP', 
    TITLE := '', 
    VERSION := '1.0', 
    data_columns := ['date', 'y', 'x'], 
    index_cols := [(0, 'index')], 
    levels := ['x', 'y', 'date'], 
    nan_rep := 'nan', 
    non_index_axes := [(1, ['x', 'y', 'date', 'var1'])], 
    pandas_type := 'frame_table', 
    pandas_version := '0.10.1', 
    table_type := 'appendable_multiframe', 
    values_cols := ['values_block_0', 'date', 'y', 'x']] 

>>> A.get_storer('var1').table 
/var1/table (Table(12626172,)) '' 
    description := { 
    "index": Int64Col(shape=(), dflt=0, pos=0), 
    "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1), 
    "date": Int64Col(shape=(), dflt=0, pos=2), 
    "y": Int64Col(shape=(), dflt=0, pos=3), 
    "x": Int64Col(shape=(), dflt=0, pos=4)} 
    byteorder := 'little' 
    chunkshape := (3276,) 
    autoIndex := True 
    colindexes := { 
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False, 
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False, 
    "y": Index(6, medium, shuffle, zlib(1)).is_CSI=False, 
    "x": Index(6, medium, shuffle, zlib(1)).is_CSI=False} 

>>> B.get_storer('var3').levels 
['x', 'y', 'date'] 

>>> B.get_storer('var3').attrs 
/var3._v_attrs (AttributeSet), 12 attributes: 
    [CLASS := 'GROUP', 
    TITLE := '', 
    VERSION := '1.0', 
    data_columns := ['date', 'y', 'x'], 
    index_cols := [(0, 'index')], 
    levels := ['x', 'y', 'date'], 
    nan_rep := 'nan', 
    non_index_axes := [(1, ['x', 'y', 'date', 'var3'])], 
    pandas_type := 'frame_table', 
    pandas_version := '0.10.1', 
    table_type := 'appendable_multiframe', 
    values_cols := ['values_block_0', 'date', 'y', 'x']] 

>>> B.get_storer('var3').table 
/var3/table (Table(6313086,)) '' 
    description := { 
    "index": Int64Col(shape=(), dflt=0, pos=0), 
    "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1), 
    "date": Int64Col(shape=(), dflt=0, pos=2), 
    "y": Int64Col(shape=(), dflt=0, pos=3), 
    "x": Int64Col(shape=(), dflt=0, pos=4)} 
    byteorder := 'little' 
    chunkshape := (3276,) 
    autoIndex := True 
    colindexes := { 
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False, 
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False, 
    "y": Index(6, medium, shuffle, zlib(1)).is_CSI=False, 
    "x": Index(6, medium, shuffle, zlib(1)).is_CSI=False} 

>>> print Atemp 
<class 'pandas.io.pytables.HDFStore'> 
File path: I:/Data/output/test8\var1temp.h5 
/mergev13   frame_table (shape->823446)

는 500000이며, 노드의 형태는 823,446 적어도 하나 개의 덩어리가 병합 말해 준다. 그러나 나는 어디서 오류가 발생하는지 알아낼 수 없으며, 정확히 어디에서 잘못되었는지 발견하려고하는 단서가 없어졌습니다. 어떤 도움을 아주 많이이 같은 오류를 제공 내 테스트 저장소의 chunksize 영역을 줄임으로써

편집

.. 감사합니다. 물론 좋지는 않지만 지금은 나에게 공유 할 수있는 기회를 제공합니다. 코드 + HDFStores는 here을 클릭하십시오.

출처

2013-07-17 Mattijn

은''pandas_version 함께 할 때 당신은 참고 파일을 닫아야합니다 ''는 메타 데이터가 저장되는 방법을 가리 킵니다; 이것은 잠시 동안 변경되지 않았습니다. 내가 몇 가지 살펴 보겠습니다 – Jeff

병합 된 프레임에는 행이 없을 수 있습니다. len-zero 프레임을 추가하는 것은 오류입니다 (그러나 더 바람직해야합니다).

확인은 귀하의 제공 데이터 세트와

df = pd.merge(a, b , left_index=True, right_index=True,how='inner') 

if len(df): 
    Atemp.append('mergev46', df)

결과를 추가하기 전에 렌

<class 'pandas.io.pytables.HDFStore'> 
File path: var4.h5 
/var4   frame_table (shape->1334) 
<class 'pandas.io.pytables.HDFStore'> 
File path: var6.h5 
/var6   frame_table (shape->667) 
<class 'pandas.core.frame.DataFrame'> 
MultiIndex: 1334 entries, (928, 310, 2006001) to (1000, 238, 2006361) 
Data columns (total 1 columns): 
var4 1334 non-null values 
dtypes: float64(1) 
<class 'pandas.core.frame.DataFrame'> 
MultiIndex: 667 entries, (928, 310, 2006001) to (1000, 238, 2006353) 
Data columns (total 1 columns): 
var6 667 non-null values 
dtypes: float64(1) 
<class 'pandas.io.pytables.HDFStore'> 
File path: var4temp.h5 
/mergev46   frame_table (shape->977)

당신이 그들을

Closing remaining open files: var6.h5... done var4.h5... done var4temp.h5... done

출처

2013-07-17 11:35:26 Jeff

내가 이것에 대한 문제를 추가, 내가 거기에 미묘한 오류 여기에 HDFStore 그 오류를 제기하고있다 (보다는 오히려 그것을 진행하게 내버려 둬, 올바른 생각), 비어있는 생각 프레임은 보통 다음과 같은 경우를 제외하고는 append에 성공합니다. https://github.com/pydata/pandas/issues/4273 – Jeff

github에서 제공 한 것처럼 명확한 예입니다. 오류 발생시 근본적인 문제를 발견하는 방법을 매우 인상적입니다. 오늘 저녁에 내 데이터 세트를 확인하는 시간을 갖습니다. 당신이 제공 한 결과에서 알 수있는 한 가지 작은 점 : 'mergev46'노드가 '내부'병합을 수행하여 'var6'보다 큰 모양을 어떻게 가질 수 있습니까? – Mattijn

임시 저장소 (appendin에있는 저장소)를 열면''mode = 'w'''에서 열어야합니다. 그렇지 않으면 이전 코드 실행에 추가됩니다. – Jeff

수백만 행이있는 디스크상의 테이블 병합과 관련된 문제

답변

관련 문제