Pyspark : pyspark의 dataframe

에서 UTF 널 문자를 제거 나는 다음과 유사한 pyspark의 dataframe 있습니다Pyspark : pyspark의 dataframe

열 e의 값 중 하나가 UTF null 문자 \u0000을 포함

df = sql_context.createDataFrame([ 
    Row(a=3, b=[4,5,6],c=[10,11,12], d='bar', e='utf friendly'), 
    Row(a=2, b=[1,2,3],c=[7,8,9], d='foo', e=u'ab\u0000the') 
    ])

ERROR: invalid byte sequence for encoding "UTF8": 0x00

의미가 있습니다 : 나는 PostgreSQL 데이터베이스에이 df를로드하려고하면, 나는 다음과 같은 오류가 발생합니다. 포스트 그레스에 데이터를로드하기 전에 pyspark 데이터 프레임에서 null 문자를 효과적으로 제거하려면 어떻게해야합니까?

pyspark.sql.functions 중 일부를 사용하여 데이터를 정리하지 않고 처음 시도했습니다. encode, decode 및 regex_replace 작동하지 않았다 :

df.select(regexp_replace(col('e'), u'\u0000', '')) 
df.select(encode(col('e'), 'UTF-8')) 
df.select(decode(col('e'), 'UTF-8'))

적으로는, 내가 정확히 지정하지 않고 전체 dataframe를 청소하고자하는 열 또는 위반 캐릭터는 무엇인지, 나는 반드시 앞서이 정보를 알 수 없기 때문에 시각.

UTF8 인코딩의 postgres 9.4.9 데이터베이스를 사용하고 있습니다.

출처

2016-12-14 Steve

아, 기다려요. 나는 가지고 있다고 생각합니다. 나는 이런 식으로 뭔가를 할 경우, 작동하는 것 같다 : 모든 문자열 컬럼에

null = u'\u0000' 
new_df = df.withColumn('e', regexp_replace(df['e'], null, ''))

그리고 매핑 :

string_columns = ['d','e'] 
new_df = df.select(
    *(regexp_replace(col(c), null, '').alias(c) if c in string_columns else c for 
    c in df.columns) 
)

출처

2016-12-14 21:33:57 Steve

당신은 null 값을 대체 할 DataFrame.fillna()를 사용할 수 있습니다.

Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.

Parameters:

value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.

subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.

출처

2016-12-15 06:57:07 Nirmal

문제의 셀이 실제로 null이 아니기 때문에 여기서는 효과가 있다고 생각하지 않습니다. UTF Null 문자 \ u0000가 포함되어 있습니다. 내 예제 df에서'df.fillna()'를 실행하면 실제로 어떤 셀도 null이 아니기 때문에 동일한 데이터 프레임을 반환하는 것처럼 보입니다. 결과 df를 postgres 테이블에로드하려고하면 여전히 동일한 오류 메시지가 나타납니다. – Steve

Pyspark : pyspark의 dataframe

답변

관련 문제