一、开发环境中需要安装和配置如下
安装JDK,配置JDK环境变量(jdk1.8)
安装Scala,配置JDK环境变量(scala2.11.8)
最好安装一个Maven,虽然Idea已经集成自带的有Maven
测试环境中已经安装有Zookeeper集群,Kafka需要用到(3.4.5)
测试环境中已经安装有Kafka集群(1.1.0)
测试环境中已经安装有Spark集群(2.1.2)

二、创建Spark项目
1. 打开Idea
2.配置Maven
如果是Window系统 依次打开 File –> Settings –> Build,Execution,Deployment –> Build Tools –> Maven ;如果是 Mac 系统 IntelliJ IDEA –> Preferences –> Build,Execution,Deployment –> Build Tools

选定Mavne的安装目录(也可以用IDEA自带的,那就不用修改此路径)

勾选Override,后修改Maven本地仓库所在的位置

 

修改Maven设置文件,添加本地仓库位置的设置,同时添加国内镜像,Maven设置文件setting.xml默认会加载主目录下~/.m2/setting.xml的文件,所以以防设置没有生效而下载的依赖包又保存到C盘了,最好此目录下也放置一份,在<settings>标签里添加<localRepository>Maven_repo路径</localRepository>,然后再在<settings>标签内的<mirrors>中添加阿里的国内镜像

<!– 配置阿里云的镜像仓库 –>
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>

3.安装Scala插件
我们做Spark基本以Scala开发为主,因此先在Idea中安装一下Scala插件

依次打开 File –> Settings –> Plugins –> Browse Repositories… 搜索Scala,安装即可

 

4.创建一个Maven项目
一次点击 File –> New –> Project…

左侧选中Maven,右侧选择我们项目的JDK版本,以及把Create from archetype勾上,

在下面找到 org.apache.camel.archetypes:camel-archetype-scala,点击Next

填写GroupID(这个一般是公司域名反写)和ArtifactId(项目名或者模块名)

 

5.修改项目Scala环境
快捷键 Ctrl + Shift + Alt + S

选中左侧的Global Libraries 点击 + 将我们本地的Scala加载到项目环境中,也可以点击工具中的Download… 按钮选择对应的版本下载

 

6.修改pom文件
pom文件中添加如下依赖和配置

<?xml version=”1.0″ encoding=”UTF-8″?>
<project xmlns=”http://maven.apache.org/POM/4.0.0″ xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xsi:schemaLocation=”http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd”>

<modelVersion>4.0.0</modelVersion>

<groupId>com.yore.spark</groupId>
<artifactId>spark-demo</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>

<name>spark-demo with integration Kafka</name>

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.binary.version>2.11</scala.binary.version>
<scala.version>2.11.8</scala.version>
<spark.version>2.1.2</spark.version>
<!–<spark.version>2.2.1</spark.version>–>
<kafka.version>1.1.0</kafka.version>
<commons-lang.version>2.6</commons-lang.version>
</properties>

<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>

<dependencies>
<!– scala –>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang.modules</groupId>
<artifactId>scala-xml_${scala.binary.version}</artifactId>
<version>1.0.6</version>
</dependency>

<!– spark –>
<!– spark-core –>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!– spark-streaming –>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!– spark-sql –>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!– spark-streaming-kafka-0-10 –>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!– spark-streaming-flume –>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!–Spark中的RPC是使用Akka实现的, –>
<dependency>
<groupId>com.typesafe.akka</groupId>
<artifactId>akka-actor_2.11</artifactId>
<version>2.5.4</version>
</dependency>

<!– fastjson –>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.47</version>
</dependency>

<!– commons-lang –>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>${commons-lang.version}</version>
</dependency>

<!– logging –>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.11.0</version>
<!–<scope>runtime</scope>–>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.11.0</version>
<!–<scope>runtime</scope>–>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.11.0</version>
<!–<scope>runtime</scope>–>
</dependency>

<!– specs –>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.4.3</version>
<scope>test</scope>
</dependency>

<!– testing –>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<!–<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>–>
<plugins>
<!– the Maven compiler plugin will compile Java source files –>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.7.0</version>
<configuration>
<!–<source>1.8</source>–>
<!–<target>1.8</target>–>
<encoding>${project.build.sourceEncoding}</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.5.3</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.yore.spark.kafka.SparkKafkaDemo</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>

<!– the Maven Scala plugin will compile Scala source files –>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>

<!–surefire plugin,avoid messy code when mvn test console –>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.21.0</version>
<configuration>
<skipTests>true</skipTests>
<forkMode>once</forkMode>
<argLine>-Dfile.encoding=UTF-8</argLine>
</configuration>
</plugin>
</plugins>
<resources>
<resource>
<directory>${basedir}/src/main/scala</directory>
<includes>
<include>**/*.properties</include>
<include>**/*.xml</include>
</includes>
</resource>
<resource>
<directory>${basedir}/src/main/resources</directory>
</resource>
</resources>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>

</project>
7.增加项目配置文件
在Spark 的Maven项目里的resources目录下新建一个my.properties文件,配置如下内容

# kafka configs
kafka.bootstrap.servers=cdh6:6667,cdh5:6667,cdh4:6667
kafka.topic.source=spark-kafka-demo
kafka.topic.sink=spark-sink-test
kafka.group.id=spark_demo_gid1
8.新建包
在scala下创建com.yore.spark包

在上面包下分别新建kafka包和utils包

在utils包下新建Scala Class ,Name: PropertiesUtil,Kind:Object

package com.yore.spark.utils

import java.util.Properties

/**
* Properties的工具类
*
* Created by yore on 2017-11-9 14:05
*/
object PropertiesUtil {

/**
*
* 获取配置文件Properties对象
*
* @author yore
* @return java.util.Properties
*/
def getProperties() :Properties = {
val properties = new Properties()
//读取源码中resource文件夹下的my.properties配置文件
val reader = getClass.getResourceAsStream(“/my.properties”)
properties.load(reader)
properties
}

/**
*
* 获取配置文件中key对应的字符串值
*
* @author yore
* @return java.util.Properties
*/
def getPropString(key : String) : String = {
getProperties().getProperty(key)
}

/**
*
* 获取配置文件中key对应的整数值
*
* @author yore
* @return java.util.Properties
*/
def getPropInt(key : String) : Int = {
getProperties().getProperty(key).toInt
}

/**
*
* 获取配置文件中key对应的布尔值
*
* @author yore
* @return java.util.Properties
*/
def getPropBoolean(key : String) : Boolean = {
getProperties().getProperty(key).toBoolean
}

}
在kafka包下新建Scala Class ,Name: KafkaSink,Kind:Object和Name: SparkKafkaDemo,Kind:Object

KafkaSink.scala

package com.yore.spark.kafka

import java.util.concurrent.Future

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata}

/**
* 手动实现一个KafkaSink类,将数据发送到Kafka<br/>
* 在构造时传入高阶函数,获得一个生产者<br/>
*
* 同时创建一个对应的伴生对象,定义apply方法,这样使用时不用new
*
* This is the key idea that allows us to work around running into NotSerializableExceptions.
*
* Created by yore on 2017-12-14 9:40
*/
class KafkaSink[K,V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
lazy val producer = createProducer()

/** 发送消息 */
def send(topic : String, key : K, value : V) : Future[RecordMetadata] =
producer.send(new ProducerRecord[K,V](topic,key,value))
def send(topic : String, value : V) : Future[RecordMetadata] =
producer.send(new ProducerRecord[K,V](topic,value))
}

object KafkaSink {
import scala.collection.JavaConversions._
def apply[K, V](config: Map[String, Object]): KafkaSink[K, V] = {
val createProducerFunc = () => {
val producer = new KafkaProducer[K, V](config)
sys.addShutdownHook {
// Ensure that, on executor JVM shutdown, the Kafka producer sends
// any buffered messages to Kafka before shutting down.
producer.close()
}
producer
}
new KafkaSink(createProducerFunc)
}
def apply[K, V](config: java.util.Properties): KafkaSink[K, V] = apply(config.toMap)
}

SparkKafkaDemo.scala

package com.sinosig.spark.kafka

import java.util.Properties

import com.alibaba.fastjson.{JSON, JSONObject}
import com.sinosig.spark.utils.PropertiesUtil
import org.apache.commons.lang3.StringUtils
import org.apache.kafka.common.serialization.{StringDeserializer, StringSerializer}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Milliseconds, StreamingContext}

/**
* SparkStreaming
* kafka –> Spark –> Kafka
*
* Created by yore on 2018-06-29 9:44
*/
object SparkKafkaDemo extends App{
// default a Logger Object
val LOG = org.slf4j.LoggerFactory.getLogger(SparkKafkaDemo.getClass)

/*if (args.length < 2) {
System.err.println(s”””
|Usage: DirectKafkaWordCount <brokers> <topics>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
|
“””.stripMargin)
System.exit(1)
}*/
// 设置日志级别
Logger.getLogger(“org.apache.spark”).setLevel(Level.WARN)
Logger.getLogger(“org.apache.spark.sql”).setLevel(Level.WARN)

val Array(brokers, topics , outTopic) = /*args*/ Array(
PropertiesUtil.getPropString(“kafka.bootstrap.servers”),
PropertiesUtil.getPropString(“kafka.topic.source”) ,
PropertiesUtil.getPropString(“kafka.topic.sink”)
)

// Create context
/* 第一种方式 */
val sparkConf = new SparkConf().setMaster(“local[2]”)
sparkConf.setAppName(“spark-kafka-demo1”)
val ssc = new StreamingContext(sparkConf,Milliseconds(1000))

/* 第二种方式 */
/*val spark = SparkSession.builder()
.appName(“spark-kafka-demo1”)
.master(“local[2]”)
.getOrCreate()
// 引入隐式转换方法,允许ScalaObject隐式转换为DataFrame
import spark.implicits._
val ssc = new StreamingContext(spark.sparkContext,Seconds(1))*/

// 设置检查点
ssc.checkpoint(“spark_demo_cp1”)

// Create direct Kafka Stream with Brokers and Topics
// 注意:这个Topic最好是Array形式的,set形式的匹配不上
//var topicSet = topics.split(“,”)/*.toSet*/
val topicsArr = topics.split(“,”)

// set Kafka Properties
val kafkaParams = Map[String,Object](
“bootstrap.servers” -> brokers,
“key.deserializer” -> classOf[StringDeserializer],
“value.deserializer” -> classOf[StringDeserializer],
“group.id” -> PropertiesUtil.getPropString(“kafka.group.id”),
“auto.offset.reset” -> “latest”,
“enable.auto.commit” -> (false: java.lang.Boolean)
)

/**
* createStream是Spark和Kafka集成包0.8版本中的方法,它是将offset交给ZK来维护的
*
* 在0.10的集成包中使用的是createDirectStream,它是自己来维护offset,
* 速度上要比交给ZK维护要快很多,但是无法进行offset的监控。
* 这个方法只有3个参数,使用起来最为方便,但是每次启动的时候默认从Latest offset开始读取,
* 或者设置参数auto.offset.reset=”smallest”后将会从Earliest offset开始读取。
*
* 官方文档@see <a href=”http://spark.apache.org/docs/2.1.2/streaming-kafka-0-10-integration.html”>Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)</a>
*
*/
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topicsArr, kafkaParams)
)

/** Kafak sink */
// 广播KafkaSink
val kafkaProducer: Broadcast[KafkaSink[String, String]] = {
val kafkaProducerConfig = {
val p = new Properties()
p.setProperty(“bootstrap.servers”, brokers)
p.setProperty(“key.serializer”, classOf[StringSerializer].getName)
p.setProperty(“value.serializer”, classOf[StringSerializer].getName)
p
}
LOG.info(“kafka producer init done!”)
ssc.sparkContext.broadcast(KafkaSink[String, String](kafkaProducerConfig))
}

var jsonObject = new JSONObject()
stream.filter(record =>{
// 过滤掉不符合要求的数据
try {
// println(“$$$\t” + record.key + “\t” + record.value)
jsonObject = JSON.parseObject(record.value)
}catch {
case e : Exception =>{
LOG.error(“转换为JSON时发生了异常!\t{}”,e.getMessage)
}
}
// 如果不为空字符时,为null,返回false过滤,否则为true通过
StringUtils.isNotEmpty(record.value) && null != jsonObject
}).map(record =>{
/** 这个地方可以写自己的业务逻辑代码,因为本次是测试,简单返回一个元组 */
jsonObject = JSON.parseObject(record.value)
// 返出一个元组,(时间戳,json的数据日期,json的关系人姓名)
(System.currentTimeMillis(),
jsonObject.getString(“date_dt”),
jsonObject.getString(“relater_name”)
)
}).foreachRDD(rdd =>{
if(!rdd.isEmpty()){
rdd.foreach(kafkaTuple =>{
//向Kafka发送数据
kafkaProducer.value.send(
outTopic,
kafkaTuple._1 + “\t”+ kafkaTuple._2 + “\t” + kafkaTuple._3
)
//打印到控制台
println(kafkaTuple._1 + “\t”+ kafkaTuple._2 + “\t” + kafkaTuple._3)
})
}
})

// 启动
ssc.start()
//等待关闭
ssc.awaitTermination()

}
三、启动测试
(1)远程访问Kafka节点,创建两个Topic

[root@cdh6 kafka]# ./bin/sh kafka-topics.sh –create –zookeeper cdh2:2181,cdh3:2181,cdh4:2181,cdh5:2181,cdh6:2181 –partitions 3 –replication-factor 1 –topic spark-kafka-demo
[root@cdh6 kafka]# ./bin/sh kafka-topics.sh –create –zookeeper cdh2:2181,cdh3:2181,cdh4:2181,cdh5:2181,cdh6:2181 –partitions 1 –replication-factor 1 –topic spark-sink-test
(2)运行SparkKafkaDemo

 

(3)在Kafka节点上,以控制台形式启动一个生产者

[root@cdh6 kafka]# ./bin/kafka-console-producer.sh –broker-list cdh6:6667,cdh5:6667,cdh4:6667 –topic spark-kafka-demo

(4)复制一条Json数据到生产者的控制台,格式如

{“date_dt”: “1516095986393”,”relater_name”: “yore”}
回车发送

 

(5)查看Idea控制台运行情况,是否报错,打印的信息

 

(6)在Kafka节点上,以控制台形式启动一个消费者,看结果是否已经推进来

[root@cdh6 kafka]# ./bin/ kafka-console-consumer.sh –bootstrap-server cdh6:6667,cdh5:6667,cdh4:6667 –from-beginning –topic spark-sink-test

 

四、打包
使用IDEA中的Maven项目的打包工具直接打包

在项目中右侧找到Maven Projects –> 项目名字(spark-demo with integration Kafka)–> package 点击开始打包。如果右侧找不到Maven Projects 可以点击一下Idea工具左下角的图标。
打包完成会在项目的根目录下target/目录下生成两个jar包
spark-demo-1.0-SNAPSHOT.jar
spark-demo-1.0-SNAPSHOT-jar-with-dependencies.jar
分别是不带依赖的项目的源码编译的jar包,带所有依赖的jar

 

五、提交到集群
到target/将带依赖的jar包上传的Spark集群以standalone client 方式提交到集群运运行

[root@cdh4 spark2.1]# ./bin/spark-submit –master spark://cdh4:7077 –deploy-mode client \

–executor-memory 2g –total-executor-cores 2 –driver-memory 2G \

–class com.yore.spark.kafka.SparkKafkaDemo ~/yore/spark-demo-1.0-SNAPSHOT-jar-with-dependencies.jar
在source的Topic中推送一条数据,用client模式可以在控制台查看打印的运行结果,同时查看落数据Topic中是否有数据

————————————————
版权声明:本文为CSDN博主「YoreYuan」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/github_39577257/article/details/81151137