Pyspark에서 DataFrame을 서브 클래스화할 수 있습니까?

Pyspark에 대한 문서에서는 sqlContext, sqlContext.read() 및 기타 다양한 방법으로 구성되는 데이터 프레임을 보여줍니다.Pyspark에서 DataFrame을 서브 클래스화할 수 있습니까?

(https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html 참조)

그것은 Dataframe 서브 클래스와 독립적으로 인스턴스화 할 수 있습니까? 기본 DataFrame 클래스에 메서드와 기능을 추가하고 싶습니다.

출처

2017-01-11 jerzy

정말 목표에 따라 다릅니다.

기술적으로 가능합니다. pyspark.sql.DataFrame은 단순한 Python 클래스입니다. 필요할 경우 확장하거나 원숭이 패치 할 수 있습니다.

from pyspark.sql import DataFrame 

class DataFrameWithZipWithIndex(DataFrame): 
    def __init__(self, df): 
     super(self.__class__, self).__init__(df._jdf, df.sql_ctx) 

    def zipWithIndex(self): 
     return (self.rdd 
      .zipWithIndex() 
      .map(lambda row: (row[1],) + row[0]) 
      .toDF(["_idx"] + self.columns))

사용 예제 :

+----+---+---+ 
|_idx|foo|bar| 
+----+---+---+ 
| 0| a| 1| 
+----+---+---+

```
df = sc.parallelize([("a", 1)]).toDF(["foo", "bar"]) 

with_zipwithindex = DataFrameWithZipWithIndex(df) 

isinstance(with_zipwithindex, DataFrame) 
```
이
```
True 
```
```
with_zipwithindex.zipWithIndex().show() 
```
실질적으로 당신이 훨씬 여기서 할 수 없습니다 말하기. DataFrame은 JVM 객체를 감싸는 얇은 래퍼이며 docstring을 제공하고, 인수를 기본적으로 필요한 형식으로 변환하고, JVM 메서드를 호출하고, 필요한 경우 Python 어댑터를 사용하여 결과를 래핑하는 것 이상을하지 않습니다.

일반 파이썬 코드를 사용하면 DataFrame/Dataset 내부로 이동하거나 핵심 동작을 수정할 수 없습니다. 독립형을 원한다면 Python DataFrame 구현 만 가능합니다.

출처

2017-01-11 18:54:05 user6910411

Pyspark에서 DataFrame을 서브 클래스화할 수 있습니까?

답변

관련 문제