s3에서 입력 된 bz2 파일을 Hadoop에 복사하지 못했습니다.

아마존의 EMR에서 실행중인 맵 전용 hadoop 작업이 있으며, 최신 ami 버전 : 3.0.4에서 실행됩니다. 가끔씩 다음과 같은 예외가 발생합니다.s3에서 입력 된 bz2 파일을 Hadoop에 복사하지 못했습니다.

Error: com.amazonaws.AmazonClientException: Unable to verify integrity of data download. Client calculated content length didn't match content length received from Amazon S3. The 
data may be corrupt. 
    at com.amazonaws.util.ContentLengthValidationInputStream.validate(ContentLengthValidationInputStream.java:144) 
    at com.amazonaws.util.ContentLengthValidationInputStream.read(ContentLengthValidationInputStream.java:81) 
    at java.io.FilterInputStream.read(FilterInputStream.java:133) 
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.read(EmrFileSystem.java:289) 
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) 
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) 
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334) 
    at java.io.DataInputStream.read(DataInputStream.java:149) 
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) 
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334) 
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) 
    at java.io.BufferedInputStream.read(BufferedInputStream.java:254) 
    at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.readAByte(CBZip2InputStream.java:195) 
    at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:866) 
    at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504) 
    at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333) 
    at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:423) 
    at org.apache.hadoop.io.compress.BZip2Codec.read(BZip2Codec.java:483) 
    at java.io.InputStream.read(InputStream.java:101) 

    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211) 
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) 
    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164) 
    at org.apache.hadoop.mapred.MapTask.nextKeyValue(MapTask.java:544) 
    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) 
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper.nextKeyValue(WrappedMapper.java:91) 
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) 
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:775) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) 
    at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:162) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:415) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) 
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)

치료할 방법이 있습니까? 왜 이런 일이 생길까요? 아마존에서 네트워크 문제입니까? 일반적으로 같은 작업을 다시 실행하면 성공하므로 입력 파일에 문제가 있어서는 안됩니다. 이 예외를 잡을 수있는 방법이 있습니까? 왜 hadoop이 자동으로 그것을 치료하지 않습니까?

내 주요 클래스는 다음과 같습니다

이 솔루션은 (아마존의 고객 지원은 나에게 말했다 후) 매우 간단했다

public class LogParserMapReduce extends Configured implements Tool { 
    private static final Log LOG = LogFactory.getLog(LogParserMapReduce.class); 

    @Override 
    public int run(String[] args) throws Exception { 
    Configuration conf = super.getConf(); 

    conf.setBoolean("mapred.compress.map.output", true); 
    conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class); 
    conf.setBoolean("keep.failed.task.files", true); 

    /* 
    * Instantiate a Job object for your job's configuration. 
    */ 
    Job job = Job.getInstance(conf); 

    /* 
    * The expected command-line arguments are the paths containing 
    * input and output data. Terminate the job if the number of 
    * command-line arguments is not exactly 2. 
    */ 
    if (args.length != 2) { 
     System.out.printf("Usage: LogParserMapReduce <input dir> <output dir>\n"); 
     System.exit(-1); 
    } 

    /* 
    * Specify the jar file that contains your driver, mapper, and reducer. 
    * Hadoop will transfer this jar file to nodes in your cluster running 
    * mapper and reducer tasks. 
    */ 
    job.setJarByClass(LogParserMapReduce.class); 

    /* 
    * Specify an easily-decipherable name for the job. 
    * This job name will appear in reports and logs. 
    */ 
    job.setJobName("LogParser"); 

    /* 
    * Specify the paths to the input and output data based on the 
    * command-line arguments. 
    */ 
    FileInputFormat.addInputPaths(job, args[0]); 
    FileOutputFormat.setOutputPath(job, new Path(args[1])); 
    FileOutputFormat.setCompressOutput(job, true); 
    FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); 

    /* 
    * Specify the mapper and reducer classes. 
    */ 
    job.setMapperClass(LogParserMapper.class); 

    /* 
    * For the SysLogEvent count application, the input file and output 
    * files are in text format - the default format. 
    * 
    * In text format files, each record is a line delineated by a 
    * by a line terminator. 
    * 
    * When you use other input formats, you must call the 
    * SetInputFormatClass method. When you use other 
    * output formats, you must call the setOutputFormatClass method. 
    */ 

    /* 
    * For the logs count application, the mapper's output keys and 
    * values have the same data types as the reducer's output keys 
    * and values: Text and IntWritable. 
    * 
    * When they are not the same data types, you must call the 
    * setMapOutputKeyClass and setMapOutputValueClass 
    * methods. 
    */ 

    /* 
    * Specify the job's output key and value classes. 
    */ 
    job.setOutputKeyClass(NullWritable.class); 
    job.setOutputValueClass(Text.class); 

    job.setNumReduceTasks(0); 

    LOG.info("LogParserMapReduce: waitingForCompletion"); 
    /* 
    * Start the MapReduce job and wait for it to finish. 
    * If it finishes successfully, return 0. If not, return 1. 
    */ 
    boolean success = job.waitForCompletion(true); 
    return success ? 0 : 1; 
    } 

}

출처

2014-04-23 Gavriel

안녕하세요 당신이 해결할 수 있었습니까? 동일한 문제가 발생하는 메신저! – itzhaki

: 나는 최신 AMI (현재는 3.1.0이다)로 업그레이드했다을 가지고 그 최신 Hadoop (2.4) 및 Java 코드 컴파일을 위해 동일한 hadoop 버전을 사용했는지 확인하십시오. 그 이후로 나는 이런 종류의 문제를 보지 못했습니다.

출처

2014-06-02 06:58:02 Gavriel

s3에서 입력 된 bz2 파일을 Hadoop에 복사하지 못했습니다.

답변

관련 문제