MapReduce

一、MapReduce概念知识

1、MapReduce概述

MapReduce是一种分布式计算模型，由Google提出，主要用于搜索领域，解决海量数据的计算问题.
MapReduce是分布式运行的，由两个阶段组成：Map和Reduce，

Map阶段：一个独立的程序，有很多个节点同时运行，每个节点处理一部分数据。Reduce阶段是一个独立的程序，有很多个节点同时运行，每个节点处理一部分数据

Reduce阶段：【在这先把reduce理解为一个单独的聚合程序即可】。
MapReduce框架都有默认实现，用户只需要覆盖map()和reduce()两个函数，即可实现分布式计算，非常简单。
这两个函数的形参和返回值都是<key、value>，使用的时候一定要注意构造<k,v>。

2、MapReduce原理

MapReduce分为三个阶段Map阶段，Suffer阶段，Reduce阶段

map阶段

map任务处理
1.1 框架使用InputFormat类的子类把输入文件(夹)划分为很多InputSplit，默认，每个HDFS的block对应一个InputSplit。通过RecordReader类，把每个InputSplit解析成一个个<k1,v1>。默认，框架对每个InputSplit中的每一行，解析成一个<k1,v1>。
1.2 框架调用Mapper类中的map(…)函数，map函数的形参是<k1,v1>对，输出是<k2,v2>对。一个InputSplit对应一个map task。程序员可以覆盖map函数，实现自己的逻辑。
1.3
(假设reduce存在)框架对map输出的<k2,v2>进行分区。不同的分区中的<k2,v2>由不同的reduce task处理。默认只有1个分区。
(假设reduce不存在)框架对map结果直接输出到HDFS中。
1.4 (假设reduce存在)框架对每个分区中的数据，按照k2进行排序、分组。分组指的是相同k2的v2分成一个组。注意：分组不会减少<k2,v2>数量。
1.5 (假设reduce存在，可选)在map节点，框架可以执行reduce归约。
1.6 (假设reduce存在)框架会对map task输出的<k2,v2>写入到linux 的磁盘文件中。

至此，整个map阶段结束

shuffle过程

1.每个map有一个环形内存缓冲区，用于存储map的输出。默认大小100MB（io.sort.mb属性），一旦达到阀值0.8（io.sort.spill.percent）,一个后台线程把内容溢写到(spilt)磁盘的指定目录（mapred.local.dir）下的一个新建文件中。
2.写磁盘前，要partition,sort。如果有combiner，combine排序后数据。
3.等最后记录写完，合并全部文件为一个分区且排序的文件。

1.Reducer通过Http方式得到输出文件的特定分区的数据。
2.排序阶段合并map输出。然后走Reduce阶段。
3.reduce执行完之后，写入到HDFS中。

reduce阶段

reduce任务处理
2.1 框架对多个map任务的输出，按照不同的分区，通过网络copy到不同的reduce节点。这个过程称作shuffle。
2.2 框架对reduce端接收的[map任务输出的]相同分区的<k2,v2>数据进行合并、排序、分组。
2.3 框架调用Reducer类中的reduce方法，reduce方法的形参是<k2,{v2…}>，输出是<k3,v3>。一个<k2,{v2…}>调用一次reduce函数。程序员可以覆盖reduce函数，实现自己的逻辑。
2.4 框架把reduce的输出保存到HDFS中。
至此，整个reduce阶段结束。
例子：实现WordCountApp

二、MapReduce代码实现

案例一：单行单单词实现字频统计

1、使用Hadoop自带的mapreduce实现wordcount

　　在Linux随便目录编辑文件，写入单行单词若干随机，然后上传到hdfs上

　　使用Hadoop自带的mapreduce执行wordcount

　　执行成功，在Hadoop客户端查看结果

2、通过Java代码实现MapReduce

代码部分

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

// 单词统计
public class MR01 {
    /**
     * map阶段 <K1:偏移量 V1 每行数据 K2 V2>
     */
    public static class WordMapper extends Mapper<LongWritable,Text,Text,LongWritable>{
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line=value.toString();//想要对每行数据处理 需要转化为String
            int v=1;
            context.write(new Text(line),new LongWritable(v));
        }
    }
    /**
     * reduce阶段 k3 v3
     * 聚合
     */
    public static class WordReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            int count=0;
            for (LongWritable value : values) {
                count+=value.get();
            }
            context.write(key,new LongWritable(count));
        }
    }
    public static void main(String[] args) throws Exception{
        // 配置mapreduce
        Job job = Job.getInstance();
        job.setJobName("第一个mr程序 单词统计");
        job.setJarByClass(MR01.class);
        //Map端所在类的位置
        job.setMapperClass(WordMapper.class);
        //reduce端所在类的位置
        job.setReducerClass(WordReducer.class);
        //指定map端kv的输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //指定reduce端kv的输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        //指定路径
        Path input = new Path("/words.txt");
        Path out = new Path("/output");//输出路径不能已存在
        FileInputFormat.addInputPath(job,input);
        FileOutputFormat.setOutputPath(job,out);
        //启动
        job.waitForCompletion(true);
        System.out.println("正在运行mr");
    }
}

打包

通过xftp上传jar包实现mapreduce

查看结果

多一个空格是因为我们没有处理空格，其他与Hadoop自带的mapreduce相同

案例二：单行多单词

1、上传数据

2、MapReduce代码实现

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * 单词统计
 * map阶段 分组 java:1 java:1 hadoop:1 hadoop:1
 * reduce阶段 聚合 java:{1,1} hadoop:{1,1} java:2
 */
public class MR02 {

    public static class WordMapper extends Mapper<LongWritable,Text,Text,LongWritable>{
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] words = line.split(",");
            for (String word : words) {
                context.write(new Text(word),new LongWritable(1));
            }
        }
    }

    //中间通过suffer阶段合并排序
    // key:{1,1,1,1,1,1....}
    public static class WordReduce extends Reducer<Text,LongWritable,Text,LongWritable>{
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            //统计每个单词的累加结果
            int count = 0;
            for (LongWritable value : values) {
                count+=value.get();
            }
            context.write(key,new LongWritable(count));
        }
    }
    // maia方法中构建mapreduce任务 通过Job类构建
    public static void main(String[] args) throws Exception {
        Job job = Job.getInstance();
        job.setJobName("统计每行多少单词");
        job.setJarByClass(MR02.class);

        //mapreduce的输出格式
        job.setMapperClass(WordMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setReducerClass(WordReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //指定路径 注意:输出路径不能已存在
        Path input = new Path("/word.txt");
        Path output = new Path("/output");
        FileInputFormat.addInputPath(job,input);
        //路径不能已存在
        // 手动加上 如果存在就删除 FileSystem
        FileSystem fileSystem = FileSystem.get(new Configuration());
        if (fileSystem.exists(output)){
            fileSystem.delete(output,true);//true代表迭代删除多级目录
        }
        FileOutputFormat.setOutputPath(job,output);

        //启动job
        job.waitForCompletion(true);
        System.out.println("统计一行多少个单词");
    }

}

打包上传实现

可以通过master：8088查看日志记录

案例三：根据学生表求各班年龄和

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class MRClazzAgeSum {
    public static class ClazzMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] split = line.split(",");
            int age = Integer.parseInt(split[2]);//使用int包装类方法实现字符串转为int类型
            String clazz = split[4];
            context.write(new Text(clazz), new LongWritable(age));
        }
    }

    public static class ClazzReduce extends Reducer<Text, LongWritable, Text, LongWritable> {
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            long sum = 0;
            for (LongWritable value : values) {
                sum += value.get();
            }
            context.write(key, new LongWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        Job job = Job.getInstance();
        //设置reducemask处理个数
        job.setNumReduceTasks(2);
        job.setJobName("按照班级求age");
        job.setJarByClass(MRClazzAgeSum.class);
        //map输出格式
        job.setMapperClass(ClazzMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //reduce输出格式
        job.setReducerClass(ClazzReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //指定路径 注意:输出路径不能已存在
        Path input = new Path("/data/student/students.txt");
        FileInputFormat.addInputPath(job, input);
        Path output = new Path("/out");
        //判断路径是否存在，如果存在就删除
        FileSystem fs = FileSystem.get(new Configuration());
        if (fs.exists(output)) {
            fs.delete(output, true);
        }


        FileOutputFormat.setOutputPath(job, output);

        //启动job
        job.waitForCompletion(true);
        System.out.println("按照班级求age");
    }
}

上传jar包运行

案例四：没有reduce任务求班级性别为男的信息

package com.shujia.mr;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class MR04Sex {
    public static class SexMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] split = line.split(",");
            String sex = split[3];
            if("男".equals(sex)){
                context.write(new Text(line),NullWritable.get());
            }
        }
    }

    public static void main(String[] args) throws Exception {
        Job job = Job.getInstance();
        /**
         * reduce如果没有必须通过参数设置为0
         * mapreduce中有一个默认的reduce代码 并且reduce task默认为1
         */
        job.setNumReduceTasks(0);
        job.setJarByClass(MR04Sex.class);
        job.setJobName("sex");
        job.setMapperClass(SexMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        Path input = new Path("/students.txt");
        FileInputFormat.addInputPath(job,input);
        //路径不能已存在
        // 手动加上 如果存在就删除 FileSystem
        Path output = new Path("/output");
        FileSystem fs = FileSystem.get(new Configuration());
        if(fs.exists(output)){
            fs.delete(output,true);
        }
        FileOutputFormat.setOutputPath(job,output);

        //启动job
        job.waitForCompletion(true);
        System.out.println("sex");
        //运行过程中手动指定具体的类
        // hadoop jar xxxxx.jar com.shujia.mr.具体的类名
    }
}

打包上传

执行mr任务

结果

案例五、预聚合（统计男生人数）

package com.shujia.mr;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * combine 预聚合
 * map之后reduce之前
 */
public class MR05Gender {
    public static class GenderMapper extends Mapper<LongWritable,Text,Text,LongWritable>{
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] split = line.split(",");
            if("男".equals(split[3])){
                context.write(new Text("男"),new LongWritable(1));
            }
        }
    }
    //combine 预聚合 一个发生在reduce之前reduce端
    public static class  CombineReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            long count=0;
            for (LongWritable value : values) {
                count+=value.get();
            }
            context.write(key,new LongWritable(count));
        }
    }
    public static class  GenderReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            long count=0;
            for (LongWritable value : values) {
                count+=value.get();
            }
            context.write(key,new LongWritable(count));
        }
    }

    public static void main(String[] args) throws Exception {
        Job job = Job.getInstance();
        job.setJobName("combine 预聚合");
        //map端
        job.setJarByClass(MR05Gender.class);
        job.setMapperClass(GenderMapper.class);
        job.setMapOutputKeyClass(Text.class);
        //combine 预聚合
        job.setCombinerClass(CombineReducer.class);
        //reduce端
        job.setMapOutputValueClass(LongWritable.class);
        job.setReducerClass(GenderReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        Path input = new Path("/students.txt");
        FileInputFormat.addInputPath(job,input);
        //路径不能已存在
        // 手动加上 如果存在就删除 FileSystem
        Path output = new Path("/output");
        FileSystem fs = FileSystem.get(new Configuration());
        if(fs.exists(output)){
            fs.delete(output,true);
        }
        FileOutputFormat.setOutputPath(job,output);

        //启动job
        job.waitForCompletion(true);
        System.out.println("combine 预聚合");
    }
}

结果

案例六 File

package com.shujia.mr;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class MR06Files {
    public static class FilesMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            context.write(value,NullWritable.get());
        }
    }
    public static void main(String[] args) throws Exception {
        Job job = Job.getInstance();
        job.setNumReduceTasks(0);
        job.setMapOutputValueClass(FilesMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        Path input1 = new Path("/students.txt");
        FileInputFormat.addInputPath(job,input1);
        Path input2 = new Path("/score.txt");
        FileInputFormat.addInputPath(job,input2);
        //路径不能已存在
        // 手动加上 如果存在就删除 FileSystem
        Path output = new Path("/output");
        FileSystem fs = FileSystem.get(new Configuration());
        if(fs.exists(output)){
            fs.delete(output,true);
        }
        FileOutputFormat.setOutputPath(job,output);

        //启动job
        job.waitForCompletion(true);
    }
}

案例七 join实现求每个学生成绩之和

package com.shujia.mr;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.ArrayList;

// 学生信息,总分

/**
 * map:
 *      1.判断数据来源
 *      2.根据来源打标记
 * reduce
 *      3.循环获取数据
 *      4.判断数据标记 获取的是什么数据
 *      5.根据不同的数据 做不同的处理
 */
public class MR07Join {
    public static class JoinMapper extends Mapper<LongWritable,Text,Text,Text>{
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            //通过context获取切片(一部分是/student.txt 一部分是score)
            InputSplit inputSplit = context.getInputSplit();
            FileSplit fileSplit=(FileSplit)inputSplit;
            //路径有两个 stu sco
            String path = fileSplit.getPath().toString();
            //判断路径 /student.txt /data/student.txt /data/students/student.txt
            if(path.contains("students")){
                String id = value.toString().split(",")[0];
                String stu="$"+value.toString();
                context.write(new Text(id),new Text(stu));
            }else {
                String id = value.toString().split(",")[0];
                String stu="#"+value.toString();
                context.write(new Text(id),new Text(stu));
            }
        }
    }
    // id:{$信息,#1001,0,98,#成绩,#成绩}
    public static class JoinReducer extends Reducer<Text,Text,Text,NullWritable>{
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//            StringBuffer sb=new StringBuffer();
//            for (Text text : values) {
//                sb.append(text.toString());
//            }
//            context.write(new Text(sb.toString()),NullWritable.get());
            String stu=null;
            ArrayList<Integer> scos = new ArrayList<Integer>();
            for (Text value : values) {//可能是成绩 可能是信息
                String s = value.toString();
                if(s.startsWith("$")){//是学生信息
                    stu=s.substring(1);
                }else { //学生成绩
                    int scoce = Integer.parseInt(s.split(",")[2]);
                    scos.add(scoce);
                }
            }
            // 先求和 在拼接
            long sum=0;
            for (Integer sco : scos) {
                sum+=sco;
            }
            stu=stu+","+sum;
            context.write(new Text(stu),NullWritable.get());
        }
    }
    public static void main(String[] args) throws Exception {
        Job job = Job.getInstance();
        job.setJobName("join");
        job.setJarByClass(MR07Join.class);
        //map
        job.setMapperClass(JoinMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        //reduce
        job.setReducerClass(JoinReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        Path input1 = new Path("/students.txt");
        FileInputFormat.addInputPath(job,input1);
        Path input2 = new Path("/score.txt");
        FileInputFormat.addInputPath(job,input2);
        //路径不能已存在
        // 手动加上 如果存在就删除 FileSystem
        Path output = new Path("/output");
        FileSystem fs = FileSystem.get(new Configuration());
        if(fs.exists(output)){
            fs.delete(output,true);
        }
        FileOutputFormat.setOutputPath(job,output);

        //启动job
        job.waitForCompletion(true);
    }
}

转自：https://www.cnblogs.com/lycc0210/p/15582930.html

相关文章

归档