Hadoop mapreduce WordCount를 통해 가장 자주 반복되는 단어 목록을 정렬하는 방법은 무엇입니까?

안녕하세요 저는 mapoduce에 새내기입니다.Hadoop mapreduce WordCount를 통해 가장 자주 반복되는 단어 목록을 정렬하는 방법은 무엇입니까?

아무도 내가 원하는 출력을 표시하기 위해 아래 게시 된 코드를 수정하는 데 도움이 될 수 있습니까?

나는

로 지정된 입력 파일

입력했습니다 나는

Hi 1 
my 3 
name 1 
is 1 
is 1 
John 1 
doing 1 
engineering 1 
parents 1 
stay 1 
at 1 
California 1

로 출력을 Hi my name is John.Im doing my engineering.My parents stay at California

을 얻을하지만 출력이

로을 정렬 할
my 3 Hi 1 etc.....

다음에 표시 할 다른 모든 항목. 개념은 정렬되고 표시되어야하는 최대 횟수 반복되는 단어를 먼저 표시하는 것입니다.

저는이 작업을 단일 노드에서 실행하고 있습니다. 그리고

$ hadoop jar job.jar input output

로이 작업을 실행하는거야 그리고 난

package org.apache.hadoop.examples; import java.io.IOException; import java.util.StringTokenizer; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.IntWritable; import org.rg.apache.hadoop.fs.Path; import oapache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { private static final Log LOG = LogFactory.getLog(WordCount.class); public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; //printKeyAndValues(key, values); for (IntWritable val : values) { sum += val.get(); LOG.info("val = " + val.get()); } LOG.info("sum = " + sum + " key = " + key); result.set(sum); context.write(key, result); //System.err.println(String.format("[reduce] word: (%s), count: (%d)", key, result.get())); } // a little method to print debug output private void printKeyAndValues(Text key, Iterable<IntWritable> values) { StringBuilder sb = new StringBuilder(); for (IntWritable val : values) { sb.append(val.get() + ", "); } System.err.println(String.format("[reduce] key: (%s), value: (%s)", key, sb.toString())); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

나는 것 내가 하둡-2.0.0-cdh4.0.0를 실행 해요

$ hadoop namenode -format $ hadoop namenode $ hadoop datanode sbin$ ./yarn-daemon.sh start resourcemanager sbin$ ./yarn-daemon.sh start resourcemanager

을 시작했습니다 누군가가 이것을 생각해 낼 수 있다면 좋겠다.

출처

2012-07-24 Anonymous

질문은 자주 인쇄되는 단어를 먼저 인쇄하는 방법에 관한 것입니다. –

나는 그 질문이 매우 분명하다고 믿습니다. 일반적인 단어 수의 예에서 출력은 어휘별로 어순으로 정렬됩니다. 그/그녀는 그것을 계산하기 위해서 결과를 얻고 싶습니다. –

다른지도 감면 작업을 작성하여 주문하는 방법은 ... –

단어를 찾을 때마다 카운트를 줄이는 방법은 어떻습니까? 0부터 시작하여 숫자의 숫자가 올 것입니다. 가장 먼저 계산해야합니다.

출처

2012-07-24 15:42:19

Hadoop mapreduce WordCount를 통해 가장 자주 반복되는 단어 목록을 정렬하는 방법은 무엇입니까?

답변

관련 문제