Hadoop 시퀀스 파일에서 데이터를 추출하는 방법은 무엇입니까?

하둡 시퀀스 파일이 실제로 이상합니다. 시퀀스 파일에 이미지를 압축하고 복구 이미지를 만들 수 없습니다. 나는 간단한 시험을한다. 그리고 바이트의 크기가 사용 시퀀스 파일 전후에 동일하지 않은 것을 발견했습니다.Hadoop 시퀀스 파일에서 데이터를 추출하는 방법은 무엇입니까?

Configuration confHadoop = new Configuration(); 
     FileSystem fs = FileSystem.get(confHadoop); 

     String fileName = args[0]; 
     Path file = new Path(fs.getUri().toString() + "/" + fileName); 
     Path seqFile = new Path("/temp.seq"); 
     SequenceFile.Writer writer = null; 
     FSDataInputStream in = null; 
     try{ 
      writer = SequenceFile.createWriter(confHadoop,Writer.file(seqFile), Writer.keyClass(Text.class), 
        Writer.valueClass(BytesWritable.class)); 

      in = fs.open(file); 
      byte buffer[] = IOUtils.toByteArray(in); 


      System.out.println("original size ----> " + String.valueOf(buffer.length)); 
      writer.append(new Text(fileName), new BytesWritable(buffer)); 
      System.out.println(calculateMd5(buffer)); 
      writer.close(); 

     }finally{ 
      IOUtils.closeQuietly(in); 
     } 

     SequenceFile.Reader reader = new SequenceFile.Reader(confHadoop, Reader.file(seqFile)); 

     Text key = new Text(); 
     BytesWritable val = new BytesWritable(); 

     while (reader.next(key, val)) { 
      System.out.println("size get from sequence file --->" + String.valueOf(val.getLength())); 
      String md5 = calculateMd5(val.getBytes()); 
      Path readSeq=new Path("/write back.png"); 
      FSDataOutputStream out = null; 
      out = fs.create(readSeq); 
      out.write(val.getBytes()); //YES! GOT THE ORIGIANL IAMGE 
      out.close(); 
      System.out.println(md5); 
      ............. 
}

출력에서 동일한 바이트 수를 얻었으며 이미지를 로컬 디스크에 다시 기록한 후에 원본 이미지를 얻었을 것입니다. 그러나 왜 MD5 가치가 같지 않은가?

내가 여기서 만들었던 것은 무엇입니까?

14/04/22 16:21:35 INFO compress.CodecPool: Got brand-new compressor [.deflate] 
original size ----> 485709 
c413e36fd864b27d4c8927956298edbb 
14/04/22 16:21:35 INFO compress.CodecPool: Got brand-new decompressor [.deflate] 
size get from sequence file --->485709 
322cce20b732126bcb8876c4fcd925cb

출처

2014-04-22 hakunami

'in.available()'이 문제입니다. 파일을 읽는 방법을 알고 있습니까? –

@ThomasJungblut 방금이 [대답] (http://stackoverflow.com/a/1264756/1285444)을 사용하여 읽을 방법을 변경하고 원본 이미지를 얻을 수 있습니다. 문제는 MD5가 여전히 다르다는 것입니다. 문제를 일으켜서 미안해. – hakunami

어떤 MD5가 다른가요? 두 시퀀스 파일 중? seq 파일에는 체크섬과 동기화 지점이 포함되어 있기 때문에 이미지와 시퀀스 파일의 MD5가 달라야합니다. –

마지막으로이 이상한 문제를 해결하고 공유해야합니다. 먼저 시퀀스에서 바이트를 가져 오는 잘못된 방법을 보여 드리겠습니다.

Configuration conf = new Configuration(); 
FileSystem fs = FileSystem.get(conf); 
Path input = new Path(inPath); 
Reader reader = new SequenceFile.Reader(conf, Reader.file(input)); 
Text key = new Text(); 

BytesWritable val = new BytesWritable(); 
    while (reader.next(key, val)) { 
    fileName = key.toString(); 
    byte[] data = val.getBytes(); //don't think you have got the data! 
}

이유는 getBytes()가 원본 데이터의 정확한 크기를 반환하지 않기 때문입니다. 내가 출력 시퀀스 파일의 크기를 확인

FSDataInputStream in = null; 
in = fs.open(input); 
byte[] buffer = IOUtils.toByteArray(in); 

Writer writer = SequenceFile.createWriter(conf, 
Writer.file(output), Writer.keyClass(Text.class), 
Writer.valueClass(BytesWritable.class)); 

writer.append(new Text(inPath), new BytesWritable(buffer)); 
writer.close();

를 사용하여 데이터를 넣어, 그것은 원래 크기 플러스 머리, 난 getBytes()가 원래보다 나에게 더 바이트를주는 이유는 확실하지 않다. 그러나 데이터를 올바르게 가져 오는 방법을 알아 보겠습니다.

옵션 # 1, 필요한 데이터의 크기를 복사하십시오.

byte[] rawdata = val.getBytes(); 
length = val.getLength(); //exactly size of original data 
byte[] data = Arrays.copyOfRange(rawdata, 0, length); this is corrent

Option #2

byte[] data = val.copyBytes();

이 더 달콤한입니다. :) 마침내 맞았습니다.

출처

2014-04-23 14:31:31 hakunami

네, 그들은'getBytes'가 블러드 바이 츠를 얻지 못하는 방법에 대한 좋은 이름이라고 생각할 때 산성 여행을해야만했습니다 !!! 이 것을 파악하려고 굴곡을 돌았습니다. – samthebest

당신은 저를 구합니다! – soulmachine

Hadoop 시퀀스 파일에서 데이터를 추출하는 방법은 무엇입니까?

답변

관련 문제