一、MapReduce概念知识
1、MapReduce概述
MapReduce是一种分布式计算模型,由Google提出,主要用于搜索领域,解决海量数据的计算问题.
MapReduce是分布式运行的,由两个阶段组成:Map和Reduce,
Map阶段:一个独立的程序,有很多个节点同时运行,每个节点处理一部分数据。Reduce阶段是一个独立的程序,有很多个节点同时运行,每个节点处理一部分数据
Reduce阶段:【在这先把reduce理解为一个单独的聚合程序即可】。
MapReduce框架都有默认实现,用户只需要覆盖map()和reduce()两个函数,即可实现分布式计算,非常简单。
这两个函数的形参和返回值都是<key、value>,使用的时候一定要注意构造<k,v>。
2、MapReduce原理
MapReduce分为三个阶段Map阶段,Suffer阶段,Reduce阶段
map阶段
map任务处理
1.1 框架使用InputFormat类的子类把输入文件(夹)划分为很多InputSplit,默认,每个HDFS的block对应一个InputSplit。通过RecordReader类,把每个InputSplit解析成一个个<k1,v1>。默认,框架对每个InputSplit中的每一行,解析成一个<k1,v1>。
1.2 框架调用Mapper类中的map(…)函数,map函数的形参是<k1,v1>对,输出是<k2,v2>对。一个InputSplit对应一个map task。程序员可以覆盖map函数,实现自己的逻辑。
1.3
(假设reduce存在)框架对map输出的<k2,v2>进行分区。不同的分区中的<k2,v2>由不同的reduce task处理。默认只有1个分区。
(假设reduce不存在)框架对map结果直接输出到HDFS中。
1.4 (假设reduce存在)框架对每个分区中的数据,按照k2进行排序、分组。分组指的是相同k2的v2分成一个组。注意:分组不会减少<k2,v2>数量。
1.5 (假设reduce存在,可选)在map节点,框架可以执行reduce归约。
1.6 (假设reduce存在)框架会对map task输出的<k2,v2>写入到linux 的磁盘文件中。
至此,整个map阶段结束
shuffle过程
1.每个map有一个环形内存缓冲区,用于存储map的输出。默认大小100MB(io.sort.mb属性),一旦达到阀值0.8(io.sort.spill.percent),一个后台线程把内容溢写到(spilt)磁盘的指定目录(mapred.local.dir)下的一个新建文件中。
2.写磁盘前,要partition,sort。如果有combiner,combine排序后数据。
3.等最后记录写完,合并全部文件为一个分区且排序的文件。
1.Reducer通过Http方式得到输出文件的特定分区的数据。
2.排序阶段合并map输出。然后走Reduce阶段。
3.reduce执行完之后,写入到HDFS中。
reduce阶段
reduce任务处理
2.1 框架对多个map任务的输出,按照不同的分区,通过网络copy到不同的reduce节点。这个过程称作shuffle。
2.2 框架对reduce端接收的[map任务输出的]相同分区的<k2,v2>数据进行合并、排序、分组。
2.3 框架调用Reducer类中的reduce方法,reduce方法的形参是<k2,{v2…}>,输出是<k3,v3>。一个<k2,{v2…}>调用一次reduce函数。程序员可以覆盖reduce函数,实现自己的逻辑。
2.4 框架把reduce的输出保存到HDFS中。
至此,整个reduce阶段结束。
例子:实现WordCountApp
二、MapReduce代码实现
案例一:单行单单词实现字频统计
1、使用Hadoop自带的mapreduce实现wordcount
在Linux随便目录编辑文件,写入单行单词若干随机,然后上传到hdfs上
使用Hadoop自带的mapreduce执行wordcount
执行成功,在Hadoop客户端查看结果
2、通过Java代码实现MapReduce
代码部分
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; // 单词统计 public class MR01 { /** * map阶段 <K1:偏移量 V1 每行数据 K2 V2> */ public static class WordMapper extends Mapper<LongWritable,Text,Text,LongWritable>{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line=value.toString();//想要对每行数据处理 需要转化为String int v=1; context.write(new Text(line),new LongWritable(v)); } } /** * reduce阶段 k3 v3 * 聚合 */ public static class WordReducer extends Reducer<Text,LongWritable,Text,LongWritable>{ @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { int count=0; for (LongWritable value : values) { count+=value.get(); } context.write(key,new LongWritable(count)); } } public static void main(String[] args) throws Exception{ // 配置mapreduce Job job = Job.getInstance(); job.setJobName("第一个mr程序 单词统计"); job.setJarByClass(MR01.class); //Map端所在类的位置 job.setMapperClass(WordMapper.class); //reduce端所在类的位置 job.setReducerClass(WordReducer.class); //指定map端kv的输出类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); //指定reduce端kv的输出类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); //指定路径 Path input = new Path("/words.txt"); Path out = new Path("/output");//输出路径不能已存在 FileInputFormat.addInputPath(job,input); FileOutputFormat.setOutputPath(job,out); //启动 job.waitForCompletion(true); System.out.println("正在运行mr"); } }
打包
通过xftp上传jar包实现mapreduce
查看结果
多一个空格是因为我们没有处理空格,其他与Hadoop自带的mapreduce相同
案例二:单行多单词
1、上传数据
2、MapReduce代码实现
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; /** * 单词统计 * map阶段 分组 java:1 java:1 hadoop:1 hadoop:1 * reduce阶段 聚合 java:{1,1} hadoop:{1,1} java:2 */ public class MR02 { public static class WordMapper extends Mapper<LongWritable,Text,Text,LongWritable>{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] words = line.split(","); for (String word : words) { context.write(new Text(word),new LongWritable(1)); } } } //中间通过suffer阶段合并排序 // key:{1,1,1,1,1,1....} public static class WordReduce extends Reducer<Text,LongWritable,Text,LongWritable>{ @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { //统计每个单词的累加结果 int count = 0; for (LongWritable value : values) { count+=value.get(); } context.write(key,new LongWritable(count)); } } // maia方法中构建mapreduce任务 通过Job类构建 public static void main(String[] args) throws Exception { Job job = Job.getInstance(); job.setJobName("统计每行多少单词"); job.setJarByClass(MR02.class); //mapreduce的输出格式 job.setMapperClass(WordMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); job.setReducerClass(WordReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); //指定路径 注意:输出路径不能已存在 Path input = new Path("/word.txt"); Path output = new Path("/output"); FileInputFormat.addInputPath(job,input); //路径不能已存在 // 手动加上 如果存在就删除 FileSystem FileSystem fileSystem = FileSystem.get(new Configuration()); if (fileSystem.exists(output)){ fileSystem.delete(output,true);//true代表迭代删除多级目录 } FileOutputFormat.setOutputPath(job,output); //启动job job.waitForCompletion(true); System.out.println("统计一行多少个单词"); } }
打包上传实现
可以通过master:8088查看日志记录
案例三:根据学生表求各班年龄和
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class MRClazzAgeSum { public static class ClazzMapper extends Mapper<LongWritable, Text, Text, LongWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] split = line.split(","); int age = Integer.parseInt(split[2]);//使用int包装类方法实现字符串转为int类型 String clazz = split[4]; context.write(new Text(clazz), new LongWritable(age)); } } public static class ClazzReduce extends Reducer<Text, LongWritable, Text, LongWritable> { @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable value : values) { sum += value.get(); } context.write(key, new LongWritable(sum)); } } public static void main(String[] args) throws Exception { Job job = Job.getInstance(); //设置reducemask处理个数 job.setNumReduceTasks(2); job.setJobName("按照班级求age"); job.setJarByClass(MRClazzAgeSum.class); //map输出格式 job.setMapperClass(ClazzMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); //reduce输出格式 job.setReducerClass(ClazzReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); //指定路径 注意:输出路径不能已存在 Path input = new Path("/data/student/students.txt"); FileInputFormat.addInputPath(job, input); Path output = new Path("/out"); //判断路径是否存在,如果存在就删除 FileSystem fs = FileSystem.get(new Configuration()); if (fs.exists(output)) { fs.delete(output, true); } FileOutputFormat.setOutputPath(job, output); //启动job job.waitForCompletion(true); System.out.println("按照班级求age"); } }
上传jar包运行
案例四:没有reduce任务求班级性别为男的信息
package com.shujia.mr; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class MR04Sex { public static class SexMapper extends Mapper<LongWritable,Text,Text,NullWritable>{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] split = line.split(","); String sex = split[3]; if("男".equals(sex)){ context.write(new Text(line),NullWritable.get()); } } } public static void main(String[] args) throws Exception { Job job = Job.getInstance(); /** * reduce如果没有必须通过参数设置为0 * mapreduce中有一个默认的reduce代码 并且reduce task默认为1 */ job.setNumReduceTasks(0); job.setJarByClass(MR04Sex.class); job.setJobName("sex"); job.setMapperClass(SexMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(NullWritable.class); Path input = new Path("/students.txt"); FileInputFormat.addInputPath(job,input); //路径不能已存在 // 手动加上 如果存在就删除 FileSystem Path output = new Path("/output"); FileSystem fs = FileSystem.get(new Configuration()); if(fs.exists(output)){ fs.delete(output,true); } FileOutputFormat.setOutputPath(job,output); //启动job job.waitForCompletion(true); System.out.println("sex"); //运行过程中手动指定具体的类 // hadoop jar xxxxx.jar com.shujia.mr.具体的类名 } }
打包上传
执行mr任务
结果
案例五、预聚合(统计男生人数)
package com.shujia.mr; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; /** * combine 预聚合 * map之后reduce之前 */ public class MR05Gender { public static class GenderMapper extends Mapper<LongWritable,Text,Text,LongWritable>{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] split = line.split(","); if("男".equals(split[3])){ context.write(new Text("男"),new LongWritable(1)); } } } //combine 预聚合 一个发生在reduce之前reduce端 public static class CombineReducer extends Reducer<Text,LongWritable,Text,LongWritable>{ @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long count=0; for (LongWritable value : values) { count+=value.get(); } context.write(key,new LongWritable(count)); } } public static class GenderReducer extends Reducer<Text,LongWritable,Text,LongWritable>{ @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long count=0; for (LongWritable value : values) { count+=value.get(); } context.write(key,new LongWritable(count)); } } public static void main(String[] args) throws Exception { Job job = Job.getInstance(); job.setJobName("combine 预聚合"); //map端 job.setJarByClass(MR05Gender.class); job.setMapperClass(GenderMapper.class); job.setMapOutputKeyClass(Text.class); //combine 预聚合 job.setCombinerClass(CombineReducer.class); //reduce端 job.setMapOutputValueClass(LongWritable.class); job.setReducerClass(GenderReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); Path input = new Path("/students.txt"); FileInputFormat.addInputPath(job,input); //路径不能已存在 // 手动加上 如果存在就删除 FileSystem Path output = new Path("/output"); FileSystem fs = FileSystem.get(new Configuration()); if(fs.exists(output)){ fs.delete(output,true); } FileOutputFormat.setOutputPath(job,output); //启动job job.waitForCompletion(true); System.out.println("combine 预聚合"); } }
结果
案例六 File
package com.shujia.mr; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class MR06Files { public static class FilesMapper extends Mapper<LongWritable,Text,Text,NullWritable>{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { context.write(value,NullWritable.get()); } } public static void main(String[] args) throws Exception { Job job = Job.getInstance(); job.setNumReduceTasks(0); job.setMapOutputValueClass(FilesMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(NullWritable.class); Path input1 = new Path("/students.txt"); FileInputFormat.addInputPath(job,input1); Path input2 = new Path("/score.txt"); FileInputFormat.addInputPath(job,input2); //路径不能已存在 // 手动加上 如果存在就删除 FileSystem Path output = new Path("/output"); FileSystem fs = FileSystem.get(new Configuration()); if(fs.exists(output)){ fs.delete(output,true); } FileOutputFormat.setOutputPath(job,output); //启动job job.waitForCompletion(true); } }
案例七 join实现求每个学生成绩之和
package com.shujia.mr; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; import java.util.ArrayList; // 学生信息,总分 /** * map: * 1.判断数据来源 * 2.根据来源打标记 * reduce * 3.循环获取数据 * 4.判断数据标记 获取的是什么数据 * 5.根据不同的数据 做不同的处理 */ public class MR07Join { public static class JoinMapper extends Mapper<LongWritable,Text,Text,Text>{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //通过context获取切片(一部分是/student.txt 一部分是score) InputSplit inputSplit = context.getInputSplit(); FileSplit fileSplit=(FileSplit)inputSplit; //路径有两个 stu sco String path = fileSplit.getPath().toString(); //判断路径 /student.txt /data/student.txt /data/students/student.txt if(path.contains("students")){ String id = value.toString().split(",")[0]; String stu="$"+value.toString(); context.write(new Text(id),new Text(stu)); }else { String id = value.toString().split(",")[0]; String stu="#"+value.toString(); context.write(new Text(id),new Text(stu)); } } } // id:{$信息,#1001,0,98,#成绩,#成绩} public static class JoinReducer extends Reducer<Text,Text,Text,NullWritable>{ @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { // StringBuffer sb=new StringBuffer(); // for (Text text : values) { // sb.append(text.toString()); // } // context.write(new Text(sb.toString()),NullWritable.get()); String stu=null; ArrayList<Integer> scos = new ArrayList<Integer>(); for (Text value : values) {//可能是成绩 可能是信息 String s = value.toString(); if(s.startsWith("$")){//是学生信息 stu=s.substring(1); }else { //学生成绩 int scoce = Integer.parseInt(s.split(",")[2]); scos.add(scoce); } } // 先求和 在拼接 long sum=0; for (Integer sco : scos) { sum+=sco; } stu=stu+","+sum; context.write(new Text(stu),NullWritable.get()); } } public static void main(String[] args) throws Exception { Job job = Job.getInstance(); job.setJobName("join"); job.setJarByClass(MR07Join.class); //map job.setMapperClass(JoinMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); //reduce job.setReducerClass(JoinReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); Path input1 = new Path("/students.txt"); FileInputFormat.addInputPath(job,input1); Path input2 = new Path("/score.txt"); FileInputFormat.addInputPath(job,input2); //路径不能已存在 // 手动加上 如果存在就删除 FileSystem Path output = new Path("/output"); FileSystem fs = FileSystem.get(new Configuration()); if(fs.exists(output)){ fs.delete(output,true); } FileOutputFormat.setOutputPath(job,output); //启动job job.waitForCompletion(true); } }
转自:https://www.cnblogs.com/lycc0210/p/15582930.html