How to Deploy Spark Application on Cluster

Last updated on May 30 2022
Shakuntala Deskmukh

Table of Contents

How to Deploy Spark Application on Cluster

Apache Spark – Deployment

Spark application, using spark-submit, may be a shell command used to deploy the Spark application on a cluster. It uses all respective cluster managers through a consitent interface. Therefore, you are doing not need to configure your application for each one.

Example

Let us take the equivalent example of word count using shell commands. Here, we consider the equivalent example as a spark application.

Sample Input

The subsequent text is the input data and the file named is in.txt.

people are not as beautiful as they look,

as they walk or as they talk.

they are only as beautiful as they love,

as they care as they share.

Look at the subsequent program −

SparkWordCount.scala

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark._ 




object SparkWordCount {

   def main(args: Array[String]) {




      val sc = new SparkContext( "local", "Word Count", "/usr/local/spark", Nil, Map(), Map())

                              

      /* local = master URL; Word Count = application name; */ 

      /* /usr/local/spark = Spark Home; Nil = jars; Map = environment */

      /* Map = variables to work nodes */

      /*creating an inputRDD to read text file (in.txt) through Spark context*/

      val input = sc.textFile("in.txt")

      /* Transform the inputRDD into countRDD */

                              

      val count = input.flatMap(line ⇒ line.split(" "))

      .map(word ⇒ (word, 1))

      .reduceByKey(_ + _)

      

      /* saveAsTextFile method is an action that effects on the RDD */ 

      count.saveAsTextFile("outfile")

      System.out.println("OK");

   }

}

Save the above program into a file named SparkWordCount.scala and place it in a user-defined directory named spark-application.

Note − While transforming the inputRDD into countRDD, we are using flatMap() for tokenizing the lines (from text file) into words, map() method for counting the word frequency and reduceByKey() method for counting each word repetition.

Use the subsequent steps to submit this application. Execute all steps in the spark-application directory through the terminal.

Step 1: Download Spark Jar

Spark core jar is required for compilation, therefore, download spark-core_2.10-1.3.0.jar from the subsequent link Spark core jar and move the jar file from download directory to spark-application directory.

Step 2: Compile program

Compile the above program using the command given below. This command should be executed from the spark-application directory. Here, /usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar is a Hadoop support jar taken from Spark library.

$ scalac -classpath “spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar” SparkPi.scala

Step 3: Create a JAR

Create a jar file of the spark application using the subsequent command. Here, wordcount is the file name for jar file.

jar -cvf wordcount.jar SparkWordCount*.class spark-core_2.10-1.3.0.jar/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar

Step 4: Submit spark application

Submit the spark application using the subsequent command −

spark-submit –class SparkWordCount –master local wordcount.jar

If it is executed successfully, then you will find the output given below. The OK letting in the subsequent output is for user identification and that is the last line of the program. If you carefully read the subsequent output, you will find different things, such as −

  • successfully started service ‘sparkDriver’ on port 42954
  • MemoryStore started with capacity 267.3 MB
  • Started SparkUI at http://192.168.1.217:4040
  • Added JAR file:/home/hadoop/piapplication/count.jar
  • ResultStage 1 (saveAsTextFile at SparkPi.scala:11) finished in 0.566 s
  • Stopped Spark web UI at http://192.168.1.217:4040
  • MemoryStore cleared

15/07/08 13:56:04 INFO Slf4jLogger: Slf4jLogger started

15/07/08 13:56:04 INFO Utils: Successfully started service ‘sparkDriver’ on port 42954.

15/07/08 13:56:04 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.217:42954]

15/07/08 13:56:04 INFO MemoryStore: MemoryStore started with capacity 267.3 MB

15/07/08 13:56:05 INFO HttpServer: Starting HTTP Server

15/07/08 13:56:05 INFO Utils: Successfully started service ‘HTTP file server’ on port 56707.

15/07/08 13:56:06 INFO SparkUI: Started SparkUI at http://192.168.1.217:4040

15/07/08 13:56:07 INFO SparkContext: Added JAR file:/home/hadoop/piapplication/count.jar at http://192.168.1.217:56707/jars/count.jar with timestamp 1436343967029

15/07/08 13:56:11 INFO Executor: Adding file:/tmp/spark-45a07b83-42ed-42b3b2c2-823d8d99c5af/userFiles-df4f4c20-a368-4cdd-a2a7-39ed45eb30cf/count.jar to class loader

15/07/08 13:56:11 INFO HadoopRDD: Input split: file:/home/hadoop/piapplication/in.txt:0+54

15/07/08 13:56:12 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2001 bytes result sent to driver

(MapPartitionsRDD[5] at saveAsTextFile at SparkPi.scala:11), which is now runnable

15/07/08 13:56:12 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at SparkPi.scala:11)

15/07/08 13:56:13 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at SparkPi.scala:11) finished in 0.566 s

15/07/08 13:56:13 INFO DAGScheduler: Job 0 finished: saveAsTextFile at SparkPi.scala:11, took 2.892996 s

OK

15/07/08 13:56:13 INFO SparkContext: Invoking stop() from shutdown hook

15/07/08 13:56:13 INFO SparkUI: Stopped Spark web UI at http://192.168.1.217:4040

15/07/08 13:56:13 INFO DAGScheduler: Stopping DAGScheduler

15/07/08 13:56:14 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

15/07/08 13:56:14 INFO Utils: path = /tmp/spark-45a07b83-42ed-42b3-b2c2823d8d99c5af/blockmgr-ccdda9e3-24f6-491b-b509-3d15a9e05818, already present as root for deletion.

15/07/08 13:56:14 INFO MemoryStore: MemoryStore cleared

15/07/08 13:56:14 INFO BlockManager: BlockManager stopped

15/07/08 13:56:14 INFO BlockManagerMaster: BlockManagerMaster stopped

15/07/08 13:56:14 INFO SparkContext: Successfully stopped SparkContext

15/07/08 13:56:14 INFO Utils: Shutdown hook called

15/07/08 13:56:14 INFO Utils: Deleting directory /tmp/spark-45a07b83-42ed-42b3b2c2-823d8d99c5af

15/07/08 13:56:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

Step 5: Checking output

After successful execution of the program, you will find the directory named outfile in the spark-application directory.

The subsequent commands are used for opening and checking the list of files in the outfile directory.

$ cd outfile

$ ls

Part-00000 part-00001 _SUCCESS

The commands for checking output in part-00000 file are −

$ cat part-00000

(people,1)

(are,2)

(not,1)

(as,8)

(beautiful,2)

(they, 7)

(look,1)

The commands for checking output in part-00001 file are −

$ cat part-00001

(walk, 1)

(or, 1)

(talk, 1)

(only, 1)

(love, 1)

(care, 1)

(share, 1)

Go through the subsequent section to know more about the ‘spark-submit’ command.

Spark-submit Syntax

spark-submit [options] <app jar | python file> [app arguments]

Options

The table given below describes a list of options

S.No Option Description
1 –master spark://host:port, mesos://host:port, yarn, or local.
2 –deploy-mode Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”) (Default: client).
3 –class Your application’s main class (for Java / Scala apps).
4 –name A name of your application.
5 –jars Comma-separated list of local jars to include on the driver and executor classpaths.
6 –packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.
7 –repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with –packages.
8 –py-files Comma-separated list of .zip, .egg, or .py files to place on the PYTHON PATH for Python apps.
9 –files Comma-separated list of files to be placed in the working directory of each executor.
10 –conf (prop=val) Arbitrary Spark configuration property.
11 –properties-file Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.
12 –driver-memory Memory for driver (e.g. 1000M, 2G) (Default: 512M).
13 –driver-java-options Extra Java options to pass to the driver.
14 –driver-library-path Extra library path entries to pass to the driver.
15 –driver-class-path Extra class path entries to pass to the driver.

Note that jars added with –jars are automatically included in the classpath.

16 –executor-memory Memory per executor (e.g. 1000M, 2G) (Default: 1G).
17 –proxy-user User to impersonate when submitting the application.
18 –help, -h Show this help message and exit.
19 –verbose, -v Print additional debug output.
20 –version Print the version of current Spark.
21 –driver-cores NUM Cores for driver (Default: 1).
22 –supervise If given, restarts the driver on failure.
23 –kill If given, kills the driver specified.
24 –status If given, requests the status of the driver specified.
25 –total-executor-cores Total cores for all executors.
26 –executor-cores Number of cores per executor. (Default : 1 in YARN mode, or all available cores on the worker in standalone mode).

 

So, this brings us to the end of blog. This Tecklearn ‘How to Deploy Spark Application on Cluster’ helps you with commonly asked questions if you are looking out for a job in Apache Spark and Scala and Big Data Developer. If you wish to learn Apache Spark and Scala and build a career in Big Data Hadoop domain, then check out our interactive, Apache Spark and Scala Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/apache-spark-and-scala-certification/

Apache Spark and Scala Training

About the Course

Tecklearn Spark training lets you master real-time data processing using Spark streaming, Spark SQL, Spark RDD and Spark Machine Learning libraries (Spark MLlib). This Spark certification training helps you master the essential skills of the Apache Spark open-source framework and Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. You will also understand the role of Spark in overcoming the limitations of MapReduce. Upon completion of this online training, you will hold a solid understanding and hands-on experience with Apache Spark.

Why Should you take Apache Spark and Scala Training?

  • The average salary for Apache Spark developer ranges from approximately $93,486 per year for Developer to $128,313 per year for Data Engineer. – Indeed.com
  • Wells Fargo, Microsoft, Capital One, Apple, JPMorgan Chase & many other MNC’s worldwide use Apache Spark across industries.
  • Global Spark market revenue will grow to $4.2 billion by 2022 with a CAGR of 67% Marketanalysis.com

What you will Learn in this Course?

Introduction to Scala for Apache Spark

  • What is Scala
  • Why Scala for Spark
  • Scala in other Frameworks
  • Scala REPL
  • Basic Scala Operations
  • Variable Types in Scala
  • Control Structures in Scala
  • Loop, Functions and Procedures
  • Collections in Scala
  • Array Buffer, Map, Tuples, Lists

Functional Programming and OOPs Concepts in Scala

  • Functional Programming
  • Higher Order Functions
  • Anonymous Functions
  • Class in Scala
  • Getters and Setters
  • Custom Getters and Setters
  • Constructors in Scala
  • Singletons
  • Extending a Class using Method Overriding

Introduction to Spark

  • Introduction to Spark
  • How Spark overcomes the drawbacks of MapReduce
  • Concept of In Memory MapReduce
  • Interactive operations on MapReduce
  • Understanding Spark Stack
  • HDFS Revision and Spark Hadoop YARN
  • Overview of Spark and Why it is better than Hadoop
  • Deployment of Spark without Hadoop
  • Cloudera distribution and Spark history server

Basics of Spark

  • Spark Installation guide
  • Spark configuration and memory management
  • Driver Memory Versus Executor Memory
  • Working with Spark Shell
  • Resilient distributed datasets (RDD)
  • Functional programming in Spark and Understanding Architecture of Spark

Playing with Spark RDDs

  • Challenges in Existing Computing Methods
  • Probable Solution and How RDD Solves the Problem
  • What is RDD, It’s Operations, Transformations & Actions Data Loading and Saving Through RDDs
  • Key-Value Pair RDDs
  • Other Pair RDDs and Two Pair RDDs
  • RDD Lineage
  • RDD Persistence
  • Using RDD Concepts Write a Wordcount Program
  • Concept of RDD Partitioning and How It Helps Achieve Parallelization
  • Passing Functions to Spark

Writing and Deploying Spark Applications

  • Creating a Spark application using Scala or Java
  • Deploying a Spark application
  • Scala built application
  • Creating application using SBT
  • Deploying application using Maven
  • Web user interface of Spark application
  • A real-world example of Spark and configuring of Spark

Parallel Processing

  • Concept of Spark parallel processing
  • Overview of Spark partitions
  • File Based partitioning of RDDs
  • Concept of HDFS and data locality
  • Technique of parallel operations
  • Comparing coalesce and Repartition and RDD actions

Machine Learning using Spark MLlib

  • Why Machine Learning
  • What is Machine Learning
  • Applications of Machine Learning
  • Face Detection: USE CASE
  • Machine Learning Techniques
  • Introduction to MLlib
  • Features of MLlib and MLlib Tools
  • Various ML algorithms supported by MLlib

Integrating Apache Flume and Apache Kafka

  • Why Kafka, what is Kafka and Kafka architecture
  • Kafka workflow and Configuring Kafka cluster
  • Basic operations and Kafka monitoring tools
  • Integrating Apache Flume and Apache Kafka

Apache Spark Streaming

  • Why Streaming is Necessary
  • What is Spark Streaming
  • Spark Streaming Features
  • Spark Streaming Workflow
  • Streaming Context and DStreams
  • Transformations on DStreams
  • Describe Windowed Operators and Why it is Useful
  • Important Windowed Operators
  • Slice, Window and ReduceByWindow Operators
  • Stateful Operators

Improving Spark Performance

  • Learning about accumulators
  • The common performance issues and troubleshooting the performance problems

DataFrames and Spark SQL

  • Need for Spark SQL
  • What is Spark SQL
  • Spark SQL Architecture
  • SQL Context in Spark SQL
  • User Defined Functions
  • Data Frames and Datasets
  • Interoperating with RDDs
  • JSON and Parquet File Formats
  • Loading Data through Different Sources

Scheduling and Partitioning in Apache Spark

  • Concept of Scheduling and Partitioning in Spark
  • Hash partition and range partition
  • Scheduling applications
  • Static partitioning and dynamic sharing
  • Concept of Fair scheduling
  • Map partition with index and Zip
  • High Availability
  • Single-node Recovery with Local File System and High Order Functions

Got a question for us? Please mention it in the comments section and we will get back to you.

 

 

0 responses on "How to Deploy Spark Application on Cluster"

Leave a Message

Your email address will not be published. Required fields are marked *