博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
MapReduce实现数据去重
阅读量:5308 次
发布时间:2019-06-14

本文共 4299 字,大约阅读时间需要 14 分钟。

一、原理分析

  Mapreduce的处理过程,由于Mapreduce会在Map~reduce中,将重复的Key合并在一起,所以Mapreduce很容易就去除重复的行。Map无须做任何处理,设置Map中写入context的东西为不作任何处理的行,也就是Map中最初处理的value即可,而Reduce同样无须做任何处理,写入输出文件的东西就是,最初得到的Key。

  我原来以为是map阶段用了hashmap,根据hash值的唯一性。估计应该不是...

  Map是输入文件有几行,就运行几次。

二、代码

2.1 Mapper

package algorithm;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class DuplicateRemoveMapper extends		Mapper
{ //输入文件是数字 不过可能也有字符等 所以用Text,不用LongWritable public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { context.write(value, new Text());//后面不能是null,否则,空指针 }}

  

2.2 Reducer

package algorithm;import java.io.IOException;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class DuplicateRemoveReducer extends Reducer
{ public void reduce(Text key, Iterable
value, Context context) throws IOException, InterruptedException { // process values context.write(key, null); //可以出处null }}

  

2.3 Main

package algorithm;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class DuplicateMainMR  {	public static void main(String[] args) throws Exception{		// TODO Auto-generated method stub		Configuration conf = new Configuration(); 		Job job = new Job(conf,"DuplicateRemove");		job.setJarByClass(DuplicateMainMR.class);		job.setMapperClass(DuplicateRemoveMapper.class);		job.setReducerClass(DuplicateRemoveReducer.class);		job.setOutputKeyClass(Text.class);		//输出是null,不过不能随意写  否则包类型不匹配		job.setOutputValueClass(Text.class);				job.setNumReduceTasks(1);		//hdfs上写错了文件名 DupblicateRemove  多了个b		//hdfs不支持修改操作		FileInputFormat.addInputPath(job, new Path("hdfs://192.168.58.180:8020/ClassicalTest/DupblicateRemove/DuplicateRemove.txt"));		FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.58.180:8020/ClassicalTest/DuplicateRemove/DuplicateRemoveOut"));		System.exit(job.waitForCompletion(true) ? 0 : 1);	}}

  

三、输出分析

3.1 输入与输出

没啥要对比的....不贴了

3.2 控制台

 

doop.mapreduce.Job.updateStatus(Job.java:323)  INFO - Job job_local4032991_0001 completed successfully DEBUG - PrivilegedAction as:hxsyl (auth:SIMPLE) from:org.apache.hadoop.mapreduce.Job.getCounters(Job.java:765)  INFO - Counters: 38	File System Counters		FILE: Number of bytes read=560		FILE: Number of bytes written=501592		FILE: Number of read operations=0		FILE: Number of large read operations=0		FILE: Number of write operations=0		HDFS: Number of bytes read=48		HDFS: Number of bytes written=14		HDFS: Number of read operations=13		HDFS: Number of large read operations=0		HDFS: Number of write operations=4	Map-Reduce Framework		Map input records=8		Map output records=8		Map output bytes=26		Map output materialized bytes=48		Input split bytes=142		Combine input records=0		Combine output records=0		Reduce input groups=6		Reduce shuffle bytes=48		Reduce input records=8		Reduce output records=6		Spilled Records=16		Shuffled Maps =1		Failed Shuffles=0		Merged Map outputs=1		GC time elapsed (ms)=4		CPU time spent (ms)=0		Physical memory (bytes) snapshot=0		Virtual memory (bytes) snapshot=0		Total committed heap usage (bytes)=457179136	Shuffle Errors		BAD_ID=0		CONNECTION=0		IO_ERROR=0		WRONG_LENGTH=0		WRONG_MAP=0		WRONG_REDUCE=0	File Input Format Counters 		Bytes Read=24	File Output Format Counters 		Bytes Written=14 DEBUG - PrivilegedAction as:hxsyl (auth:SIMPLE) from:org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:323) DEBUG - stopping client from cache: org.apache.hadoop.ipc.Client@37afeb11 DEBUG - removing client from cache: org.apache.hadoop.ipc.Client@37afeb11 DEBUG - stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@37afeb11 DEBUG - Stopping client DEBUG - IPC Client (521081105) connection to /192.168.58.180:8020 from hxsyl: closed DEBUG - IPC Client (521081105) connection to /192.168.58.180:8020 from hxsyl: stopped, remaining connections 0

 

转载于:https://www.cnblogs.com/hxsyl/p/6127764.html

你可能感兴趣的文章
log4j2异步日志解读(二)AsyncLogger
查看>>
结对项目:一寸时光APP(日程管理)三
查看>>
chrome控制台console方法表
查看>>
使用FPM快速生成RPM包
查看>>
Drawable学习之----LevelListDrawable
查看>>
简单介绍一些HTML代码(字幕、音频和视频)
查看>>
快递行业呼叫中心解决方案
查看>>
《javascript dom编程艺术》笔记(一)——优雅降级、向后兼容、多个函数绑定onload函数...
查看>>
IIS日志详解--logfiles
查看>>
bind cname
查看>>
python手记(7)------字典dict基础
查看>>
关于齐次坐标的理解
查看>>
mptcp 主机无法多IP直连同一路由器
查看>>
37.数字在排序数组中出现的次数
查看>>
人,绩效和职业道德
查看>>
[CSS3] Understand CSS Selector Specificity
查看>>
[D3] 7. Quantitative Scales
查看>>
神奇的python系列8:函数(一)
查看>>
BZOJ1965: [Ahoi2005]SHUFFLE 洗牌(exgcd 找规律)
查看>>
P3376 【模板】网络最大流
查看>>