0

PySpark에서 일부 조인을 수행하고 결과를 하이브에 저장하려고합니다. 작은 데이터 세트와 코드 조각은 작동하지만 데이터의 크기가 증가 내가 오류가 아래에 도착하면 다음하이브에 큰 RDD 쓰기 - 언로드 메모리를 저장 메모리에 전송하지 못했습니다.

소프트웨어 버전은

  • HDFS 2.7.3
  • 원사 2.7.3 하이브 1000년 2월 1일 있습니다
  • MapReduce2 2.7.3
  • Spark2 2.1.1
  • 파이썬 3.5.2
  • 호튼 웍스 2.6.2.0-205

코드 :

hdfs_df.write.mode("append").format("orc").save("HIVE PATH") 

예외 : 저장하기 전에

어쩌면
File "/grid/0/hadoop/yarn/local/usercache/pentaho/appcache/application_1512030580416_0001/container_e16_1512030580416_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save 
    File "/grid/0/hadoop/yarn/local/usercache/pentaho/appcache/application_1512030580416_0001/container_e16_1512030580416_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ 
    File "/grid/0/hadoop/yarn/local/usercache/pentaho/appcache/application_1512030580416_0001/container_e16_1512030580416_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco 
    File "/grid/0/hadoop/yarn/local/usercache/pentaho/appcache/application_1512030580416_0001/container_e16_1512030580416_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o439.save. 
: org.apache.spark.SparkException: Job aborted. 
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147) 
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) 
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) 
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) 
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121) 
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101) 
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) 
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) 
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) 
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) 
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) 
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
    at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:484) 
    at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:520) 
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) 
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 
    at py4j.Gateway.invoke(Gateway.java:280) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:214) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 25.0 failed 4 times, most recent failure: Lost task 21.3 in stage 25.0 (TID 6957, dgsddevhdp14.mcs.local, executor 2): java.lang.AssertionError: assertion failed: transferring unroll memory to storage memory failed 
    at scala.Predef$.assert(Predef.scala:170) 
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:382) 
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) 
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1007) 
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:947) 
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1007) 
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:711) 
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 
    at org.apache.spark.scheduler.Task.run(Task.scala:99) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 

답변

1

이 .coalesce하기 (200) 시도 (또는 다른 번호) 그것.

+0

hdfs_df.coalesce (100) .write.mode ("추가"). 형식 ("orc"). save ("HIVE_PATH") 행운 : ( –

+0

) 원인 : 원인 : ..... java.lang.AssertionError : 어설 션이 실패했습니다 : 저장소 메모리로 풀 메모리를 전송하지 못했습니다. 여기에서이 문제를 확인하십시오 : https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/ spark/storage/memory/MemoryStore.scala. 직원을 위해 더 많은 메모리가 필요해 보이는 것 ... – user3689574

관련 문제