스파크 스칼라 클래스의 객체 목록으로 데이터 변환

아래의 데이터를 다음 클래스의 객체 목록으로 변환하는 스파크 변환 코드를 작성하려고합니다. 스칼라와 스파크에 완전히 익숙하며 데이터 분할을 시도했습니다. 그리고 그들을 케이스 클래스에 넣었지만 나는 그들을 다시 추가 할 수 없었다. 이것에 대한 도움을 요청하십시오.스파크 스칼라 클래스의 객체 목록으로 데이터 변환

데이터 :

FirstName,LastName,Country,match,Goals 
Cristiano,Ronaldo,Portugal,Match1,1 
Cristiano,Ronaldo,Portugal,Match2,1 
Cristiano,Ronaldo,Portugal,Match3,0 
Cristiano,Ronaldo,Portugal,Match4,2 
Lionel,Messi,Argentina,Match1,1 
Lionel,Messi,Argentina,Match2,2 
Lionel,Messi,Argentina,Match3,1 
Lionel,Messi,Argentina,Match4,2

원하는 출력 :

PLayerStats{ String FirstName, 
    String LastName, 
    String Country, 
    Map <String,Int> matchandscore 
}

출처

2016-12-24 Bhushan

은 첫째 키 값 쌍으로 선을 변환 한 후 groupByKey 또는 reduceByKey는 다음 작업을 키 값 쌍 데이터를 변환 할 수 있습니다 적용 (Cristiano, rest of data) 말 groupByKey 또는 reduceByKey를 값을 넣어 클래스에 적용한 후. 유명한 단어 카운트 프로그램의 도움을 받아보십시오. 다음과 같이 뭔가를 시도 할 수

http://spark.apache.org/examples.html

출처

2016-12-24 03:48:18

는 :

val file = sc.textFile("myfile.csv") 

val df = file.map(line => line.split(",")).  // split line by comma 
       filter(lineSplit => lineSplit(0) != "FirstName"). // filter out first row 
       map(lineSplit => {   // transform lines 
       (lineSplit(0), lineSplit(1), lineSplit(2), Map((lineSplit(3), lineSplit(4).toInt)))}). 
       toDF("FirstName", "LastName", "Country", "MatchAndScore")   

df.schema 
// res34: org.apache.spark.sql.types.StructType = StructType(StructField(FirstName,StringType,true), StructField(LastName,StringType,true), StructField(Country,StringType,true), StructField(MatchAndScore,MapType(StringType,IntegerType,false),true)) 

df.show 

+---------+--------+---------+----------------+ 
|FirstName|LastName| Country| MatchAndScore| 
+---------+--------+---------+----------------+ 
|Cristiano| Ronaldo| Portugal|Map(Match1 -> 1)| 
|Cristiano| Ronaldo| Portugal|Map(Match2 -> 1)| 
|Cristiano| Ronaldo| Portugal|Map(Match3 -> 0)| 
|Cristiano| Ronaldo| Portugal|Map(Match4 -> 2)| 
| Lionel| Messi|Argentina|Map(Match1 -> 1)| 
| Lionel| Messi|Argentina|Map(Match2 -> 2)| 
| Lionel| Messi|Argentina|Map(Match3 -> 1)| 
| Lionel| Messi|Argentina|Map(Match4 -> 2)| 
+---------+--------+---------+----------------+

출처

2016-12-24 04:10:33 Psidom

data를라는 RDD[String]로 당신에게 이미로드 된 데이터를 가정 :

case class PlayerStats(FirstName: String, LastName: String, Country: String, matchandscore: Map[String, Int]) 

val result: RDD[PlayerStats] = data 
    .filter(!_.startsWith("FirstName")) // remove header 
    .map(_.split(",")).map { // map into case classes 
    case Array(fn, ln, cntry, mn, g) => PlayerStats(fn, ln, cntry, Map(mn -> g.toInt)) 
    } 
    .keyBy(p => (p.FirstName, p.LastName)) // key by player 
    .reduceByKey((p1, p2) => p1.copy(matchandscore = p1.matchandscore ++ p2.matchandscore)) 
    .map(_._2) // remove key

출처

2016-12-24 09:24:01

감사합니다! ti는 – Bhushan

@Bhushan이 도움이 되었기 때문에 기뻤습니다. 미래의 독자들에게 이것이 유용하다는 것을 알리는 대답을 받거나/upvote 할 수 있습니다. –

스파크 스칼라 클래스의 객체 목록으로 데이터 변환

답변

관련 문제