Hadoop 분산 캐시를 사용하여 파일을 메모리에 저장하는 방법은 무엇입니까?

필자가 아는 한 분산 캐시는 모든 노드에 파일을 복사 한 다음 로컬 파일 시스템에서 파일을 읽고 매핑하거나 줄입니다.Hadoop 분산 캐시를 사용하여 파일을 메모리에 저장하는 방법은 무엇입니까?

제 질문은 : 맵이나 감축이 메모리에서 직접 파일을 읽을 수 있도록 Hadoop 분산 캐시를 사용하여 파일을 메모리에 저장할 수있는 방법이 있습니까?

My MapReduce 프로그램은 모든 노드에 약 1M 인 png 그림을 배포 한 다음 모든지도 작업이 분산 캐시에서 그림을 읽고지도 입력에서 다른 그림으로 일부 이미지 처리를 수행합니다.

2013-12-12 hequn8128

import java.io.BufferedReader; 
import java.io.FileReader; 
import java.io.IOException; 
import java.net.URI; 
import java.util.StringTokenizer; 

import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.filecache.DistributedCache; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.mapreduce.Mapper; 
import org.apache.hadoop.mapreduce.Reducer; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.util.GenericOptionsParser; 

public class WordCount { 

    public static class TokenizerMapper 
     extends Mapper<Object, Text, Text, IntWritable>{ 

    private final static IntWritable one = new IntWritable(1); 
    private Text word = new Text(); 

    public void map(Object key, Text value, Context context 
        ) throws IOException, InterruptedException { 

      Path[] uris = DistributedCache.getLocalCacheFiles(context 
        .getConfiguration()); 





        try{ 
         BufferedReader readBuffer1 = new BufferedReader(new FileReader(uris[0].toString())); 
         String line; 
         while ((line=readBuffer1.readLine())!=null){ 
          System.out.println(line); 

         } 
         readBuffer1.close(); 
        }  
        catch (Exception e){ 
         System.out.println(e.toString()); 
        } 

        StringTokenizer itr = new StringTokenizer(value.toString()); 

     while (itr.hasMoreTokens()) { 
     word.set(itr.nextToken()); 
     context.write(word, one); 
     } 
    } 
    } 

    public static class IntSumReducer 
     extends Reducer<Text,IntWritable,Text,IntWritable> { 
    private IntWritable result = new IntWritable(); 

    public void reduce(Text key, Iterable<IntWritable> values, 
         Context context 
         ) throws IOException, InterruptedException { 
     int sum = 0; 
     for (IntWritable val : values) { 
     sum += val.get(); 
     } 
     int length=key.getLength(); 
     System.out.println("length"+length); 
     result.set(sum); 
/*  key.set("lenght"+lenght);*/ 
     context.write(key, result); 


    } 
    } 

    public static void main(String[] args) throws Exception { 

     final String NAME_NODE = "hdfs://localhost:9000"; 
    Configuration conf = new Configuration(); 

    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); 
    if (otherArgs.length != 2) { 
     System.err.println("Usage: wordcount <in> <out>"); 
     System.exit(2); 
    } 
    Job job = new Job(conf, "word count"); 
    job.setJarByClass(WordCount.class); 
    job.setMapperClass(TokenizerMapper.class); 
    job.setCombinerClass(IntSumReducer.class); 
    job.setReducerClass(IntSumReducer.class); 
    job.setOutputKeyClass(Text.class); 
    job.setOutputValueClass(IntWritable.class); 


    DistributedCache.addCacheFile(new URI(NAME_NODE 
     + "/dataset1.txt"), 
     job.getConfiguration()); 



    FileInputFormat.addInputPath(job, new Path(otherArgs[0])); 
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); 
    System.exit(job.waitForCompletion(true) ? 0 : 1); 
    } 

}

출처

2013-12-17 08:04:40

감사합니다. 분산 캐시를 사용하는 방법을 알고 있습니다. 내 질문은 로컬 파일 시스템 대신 메모리에 파일을 저장하는 방법입니다. 프로그램에서 모든 맵은 로컬 파일 시스템에서 dataset1.txt를 읽습니다. 스파크가 내 요구를 충족시킬 수있는 것 같습니다. – hequn8128

설치()에서 그림을로드하십시오. – Malcolm

위대한 질문입니다. 비슷한 문제를 해결하기 위해 노력 중입니다. Hadoop이 메모리 캐시에서 즉시 지원한다고 생각하지 않습니다. 그러나이 목적을 위해 그리드의 어딘가에 메모리 캐시에 다른 메모리를 두는 것은 그리 어렵지 않습니다. 작업 구성의 일부로 캐시 위치와 매개 변수 이름을 전달할 수 있습니다.

위의 코드 예제에서 본다면 원래의 질문에 대한 대답은 아닙니다. 또한 최적이 아닌 코드 샘플을 보여줍니다. 이상적으로는 setup() 메소드의 일부로 캐시 파일에 액세스하고 map() 메소드의 일부로 사용할 정보를 캐시해야합니다. 위의 예에서 mapreduce 작업의 성능과 절충하는 모든 키 - 값 쌍에 대해 캐시 파일을 한 번 읽습니다.

출처

2014-05-10 06:22:59 Saket

Hadoop 분산 캐시를 사용하여 파일을 메모리에 저장하는 방법은 무엇입니까?

답변

관련 문제