How to Install Apache Spark on your system

Last updated on May 30 2022
Shakuntala Deskmukh

Table of Contents

How to Install Apache Spark on your system

Apache Spark – Installation

Spark is Hadoop’s sub-project. Therefore, it’s better to install Spark into a Linux based system. The subsequent steps show how to install Apache Spark.

Step 1:

How to Install Apache Spark on your system
Apache Spark – Installation
Spark is Hadoop’s sub-project. Therefore, it’s better to install Spark into a Linux based system. The subsequent steps show how to install Apache Spark.
Step 1: Verifying Java Installation
Java installation is one of the mandatory things in installing Spark. Try the subsequent command to verify the JAVA version.
$java -version
If Java is already, installed on your system, you get to ascertain the subsequent response −
java version “1.7.0_71”
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before proceeding to next step.
Step 2: Verifying Scala installation
You should Scala language to implement Spark. So allow us verify Scala installation using subsequent command.
$scala -version
If Scala is already installed on your system, you get to ascertain the subsequent response −
Scala code runner version 2.11.6 — Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala installation.
Step 3: Downloading Scala
Download the latest version of Scala by visit the subsequent link Download Scala. For this tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in the download folder.
Step 4: Installing Scala
Follow the below given steps for installing Scala.
Extract the Scala tar file
Type the subsequent command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Move Scala software files
Use the subsequent commands for moving the Scala software files, to respective directory (/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Set PATH for Scala
Use the subsequent command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
Verifying Scala Installation
After installation, it’s better to verify it. Use the subsequent command for verifying Scala installation.
$scala -version
If Scala is already installed on your system, you get to ascertain the subsequent response −
Scala code runner version 2.11.6 — Copyright 2002-2013, LAMP/EPFL
Step 5: Downloading Apache Spark
Download the latest version of Spark by visiting the subsequent link Download Spark. For this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it, you will find the Spark tar file in the download folder.
Step 6: Installing Spark
Follow the steps given below for installing Spark.
Extracting Spark tar
The subsequent command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
Moving Spark software files
The subsequent commands for moving the Spark software files to respective directory (/usr/local/spark).
$ su –
Password:

# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Setting up the environment for Spark
Add the subsequent line to ~/.bashrc file. It means adding the location, where the spark software file are located to the PATH variable.
export PATH=$PATH:/usr/local/spark/bin
Use the subsequent command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
Step 7: Verifying the Spark Installation
Write the subsequent command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the subsequent output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service ‘HTTP class server’ on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to possess evaluated.
Spark context available as sc
scala>

So, this brings us to the end of blog. This Tecklearn ‘How to install Apache Spark on your system’ helps you with commonly asked questions if you are looking out for a job in Apache Spark and Scala and Big Data Developer. If you wish to learn Apache Spark and Scala and build a career in Big Data Hadoop domain, then check out our interactive, Apache Spark and Scala Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

Apache Spark and Scala Certification


Apache Spark and Scala Training
About the Course
Tecklearn Spark training lets you master real-time data processing using Spark streaming, Spark SQL, Spark RDD and Spark Machine Learning libraries (Spark MLlib). This Spark certification training helps you master the essential skills of the Apache Spark open-source framework and Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. You will also understand the role of Spark in overcoming the limitations of MapReduce. Upon completion of this online training, you will hold a solid understanding and hands-on experience with Apache Spark.
Why Should you take Apache Spark and Scala Training?
• The average salary for Apache Spark developer ranges from approximately $93,486 per year for Developer to $128,313 per year for Data Engineer. – Indeed.com
• Wells Fargo, Microsoft, Capital One, Apple, JPMorgan Chase & many other MNC’s worldwide use Apache Spark across industries.
• Global Spark market revenue will grow to $4.2 billion by 2022 with a CAGR of 67% Marketanalysis.com
What you will Learn in this Course?
Introduction to Scala for Apache Spark
• What is Scala
• Why Scala for Spark
• Scala in other Frameworks
• Scala REPL
• Basic Scala Operations
• Variable Types in Scala
• Control Structures in Scala
• Loop, Functions and Procedures
• Collections in Scala
• Array Buffer, Map, Tuples, Lists
Functional Programming and OOPs Concepts in Scala
• Functional Programming
• Higher Order Functions
• Anonymous Functions
• Class in Scala
• Getters and Setters
• Custom Getters and Setters
• Constructors in Scala
• Singletons
• Extending a Class using Method Overriding
Introduction to Spark
• Introduction to Spark
• How Spark overcomes the drawbacks of MapReduce
• Concept of In Memory MapReduce
• Interactive operations on MapReduce
• Understanding Spark Stack
• HDFS Revision and Spark Hadoop YARN
• Overview of Spark and Why it is better than Hadoop
• Deployment of Spark without Hadoop
• Cloudera distribution and Spark history server
Basics of Spark
• Spark Installation guide
• Spark configuration and memory management
• Driver Memory Versus Executor Memory
• Working with Spark Shell
• Resilient distributed datasets (RDD)
• Functional programming in Spark and Understanding Architecture of Spark
Playing with Spark RDDs
• Challenges in Existing Computing Methods
• Probable Solution and How RDD Solves the Problem
• What is RDD, It’s Operations, Transformations & Actions Data Loading and Saving Through RDDs
• Key-Value Pair RDDs
• Other Pair RDDs and Two Pair RDDs
• RDD Lineage
• RDD Persistence
• Using RDD Concepts Write a Wordcount Program
• Concept of RDD Partitioning and How It Helps Achieve Parallelization
• Passing Functions to Spark
Writing and Deploying Spark Applications
• Creating a Spark application using Scala or Java
• Deploying a Spark application
• Scala built application
• Creating application using SBT
• Deploying application using Maven
• Web user interface of Spark application
• A real-world example of Spark and configuring of Spark
Parallel Processing
• Concept of Spark parallel processing
• Overview of Spark partitions
• File Based partitioning of RDDs
• Concept of HDFS and data locality
• Technique of parallel operations
• Comparing coalesce and Repartition and RDD actions
Machine Learning using Spark MLlib
• Why Machine Learning
• What is Machine Learning
• Applications of Machine Learning
• Face Detection: USE CASE
• Machine Learning Techniques
• Introduction to MLlib
• Features of MLlib and MLlib Tools
• Various ML algorithms supported by MLlib
Integrating Apache Flume and Apache Kafka
• Why Kafka, what is Kafka and Kafka architecture
• Kafka workflow and Configuring Kafka cluster
• Basic operations and Kafka monitoring tools
• Integrating Apache Flume and Apache Kafka
Apache Spark Streaming
• Why Streaming is Necessary
• What is Spark Streaming
• Spark Streaming Features
• Spark Streaming Workflow
• Streaming Context and DStreams
• Transformations on DStreams
• Describe Windowed Operators and Why it is Useful
• Important Windowed Operators
• Slice, Window and ReduceByWindow Operators
• Stateful Operators
Improving Spark Performance
Wo• Learning about accumulators
• The common performance issues and troubleshooting the performance problems
DataFrames and Spark SQL
• Need for Spark SQL
• What is Spark SQL
• Spark SQL Architecture
• SQL Context in Spark SQL
• User Defined Functions
• Data Frames and Datasets
• Interoperating with RDDs
• JSON and Parquet File Formats
• Loading Data through Different Sources
Scheduling and Partitioning in Apache Spark
• Concept of Scheduling and Partitioning in Spark
• Hash partition and range partition
• Scheduling applications
• Static partitioning and dynamic sharing
• Concept of Fair scheduling
• Map partition with index and Zip
• High Availability
• Single-node Recovery with Local File System and High Order Functions
Got a question for us? Please mention it in the comments section and we will get back to you.

Verifying Java Installation

Java installation is one of the mandatory things in installing Spark. Try the subsequent command to verify the JAVA version.

$java -version

If Java is already, installed on your system, you get to ascertain the subsequent response −

java version “1.7.0_71”

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)

Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

In case you do not have Java installed on your system, then Install Java before proceeding to next step.

Step 2: Verifying Scala installation

You should Scala language to implement Spark. So allow us verify Scala installation using subsequent command.

$scala -version

If Scala is already installed on your system, you get to ascertain the subsequent response −

Scala code runner version 2.11.6 — Copyright 2002-2013, LAMP/EPFL

In case you don’t have Scala installed on your system, then proceed to next step for Scala installation.

Step 3: Downloading Scala

Download the latest version of Scala by visit the subsequent link Download Scala. For this tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in the download folder.

Step 4: Installing Scala

Follow the below given steps for installing Scala.

Extract the Scala tar file

Type the subsequent command for extracting the Scala tar file.

$ tar xvf scala-2.11.6.tgz

Move Scala software files

Use the subsequent commands for moving the Scala software files, to respective directory (/usr/local/scala).

$ su – 

Password: 

# cd /home/Hadoop/Downloads/ 

# mv scala-2.11.6 /usr/local/scala 

# exit

Set PATH for Scala

Use the subsequent command for setting PATH for Scala.

$ export PATH = $PATH:/usr/local/scala/bin

Verifying Scala Installation

After installation, it’s better to verify it. Use the subsequent command for verifying Scala installation.

$scala -version

If Scala is already installed on your system, you get to ascertain the subsequent response −

Scala code runner version 2.11.6 — Copyright 2002-2013, LAMP/EPFL

Step 5: Downloading Apache Spark

Download the latest version of Spark by visiting the subsequent link Download Spark. For this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it, you will find the Spark tar file in the download folder.

Step 6: Installing Spark

Follow the steps given below for installing Spark.

Extracting Spark tar

The subsequent command for extracting the spark tar file.

$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz

Moving Spark software files

The subsequent commands for moving the Spark software files to respective directory (/usr/local/spark).

$ su – 

Password:  

# cd /home/Hadoop/Downloads/ 

# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark 

# exit

Setting up the environment for Spark

Add the subsequent line to ~/.bashrc file. It means adding the location, where the spark software file are located to the PATH variable.

export PATH=$PATH:/usr/local/spark/bin

Use the subsequent command for sourcing the ~/.bashrc file.

$ source ~/.bashrc

Step 7: Verifying the Spark Installation

Write the subsequent command for opening Spark shell.

$spark-shell

If spark is installed successfully then you will find the subsequent output.

Spark assembly has been built with Hive, including Datanucleus jars on classpath

Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties

15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop

15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop

15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;

ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)

15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server

15/06/04 15:25:23 INFO Utils: Successfully started service ‘HTTP class server’ on port 43292.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ ‘_/

/___/ .__/\_,_/_/ /_/\_\ version 1.4.0

/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)

Type in expressions to possess evaluated.

Spark context available as sc

scala>

So, this brings us to the end of blog. This Tecklearn ‘How to install Apache Spark on your system’ helps you with commonly asked questions if you are looking out for a job in Apache Spark and Scala and Big Data Developer. If you wish to learn Apache Spark and Scala and build a career in Big Data Hadoop domain, then check out our interactive, Apache Spark and Scala Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

Apache Spark and Scala Certification

Apache Spark and Scala Training

About the Course

Tecklearn Spark training lets you master real-time data processing using Spark streaming, Spark SQL, Spark RDD and Spark Machine Learning libraries (Spark MLlib). This Spark certification training helps you master the essential skills of the Apache Spark open-source framework and Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. You will also understand the role of Spark in overcoming the limitations of MapReduce. Upon completion of this online training, you will hold a solid understanding and hands-on experience with Apache Spark.

Why Should you take Apache Spark and Scala Training?

• The average salary for Apache Spark developer ranges from approximately $93,486 per year for Developer to $128,313 per year for Data Engineer. – Indeed.com

• Wells Fargo, Microsoft, Capital One, Apple, JPMorgan Chase & many other MNC’s worldwide use Apache Spark across industries.

• Global Spark market revenue will grow to $4.2 billion by 2022 with a CAGR of 67% Marketanalysis.com

What you will Learn in this Course?

Introduction to Scala for Apache Spark

• What is Scala

• Why Scala for Spark

• Scala in other Frameworks

• Scala REPL

• Basic Scala Operations

• Variable Types in Scala

• Control Structures in Scala

• Loop, Functions and Procedures

• Collections in Scala

• Array Buffer, Map, Tuples, Lists

Functional Programming and OOPs Concepts in Scala

• Functional Programming

• Higher Order Functions

• Anonymous Functions

• Class in Scala

• Getters and Setters

• Custom Getters and Setters

• Constructors in Scala

• Singletons

• Extending a Class using Method Overriding

Introduction to Spark

• Introduction to Spark

• How Spark overcomes the drawbacks of MapReduce

• Concept of In Memory MapReduce

• Interactive operations on MapReduce

• Understanding Spark Stack

• HDFS Revision and Spark Hadoop YARN

• Overview of Spark and Why it is better than Hadoop

• Deployment of Spark without Hadoop

• Cloudera distribution and Spark history server

Basics of Spark

• Spark Installation guide

• Spark configuration and memory management

• Driver Memory Versus Executor Memory

• Working with Spark Shell

• Resilient distributed datasets (RDD)

• Functional programming in Spark and Understanding Architecture of Spark

Playing with Spark RDDs

• Challenges in Existing Computing Methods

• Probable Solution and How RDD Solves the Problem

• What is RDD, It’s Operations, Transformations & Actions Data Loading and Saving Through RDDs

• Key-Value Pair RDDs

• Other Pair RDDs and Two Pair RDDs

• RDD Lineage

• RDD Persistence

• Using RDD Concepts Write a Wordcount Program

• Concept of RDD Partitioning and How It Helps Achieve Parallelization

• Passing Functions to Spark

Writing and Deploying Spark Applications

• Creating a Spark application using Scala or Java

• Deploying a Spark application

• Scala built application

• Creating application using SBT

• Deploying application using Maven

• Web user interface of Spark application

• A real-world example of Spark and configuring of Spark

Parallel Processing

• Concept of Spark parallel processing

• Overview of Spark partitions

• File Based partitioning of RDDs

• Concept of HDFS and data locality

• Technique of parallel operations

• Comparing coalesce and Repartition and RDD actions

Machine Learning using Spark MLlib

• Why Machine Learning

• What is Machine Learning

• Applications of Machine Learning

• Face Detection: USE CASE

• Machine Learning Techniques

• Introduction to MLlib

• Features of MLlib and MLlib Tools

• Various ML algorithms supported by MLlib

Integrating Apache Flume and Apache Kafka

• Why Kafka, what is Kafka and Kafka architecture

• Kafka workflow and Configuring Kafka cluster

• Basic operations and Kafka monitoring tools

• Integrating Apache Flume and Apache Kafka

Apache Spark Streaming

• Why Streaming is Necessary

• What is Spark Streaming

• Spark Streaming Features

• Spark Streaming Workflow

• Streaming Context and DStreams

• Transformations on DStreams

• Describe Windowed Operators and Why it is Useful

• Important Windowed Operators

• Slice, Window and ReduceByWindow Operators

• Stateful Operators

Improving Spark Performance

• Learning about accumulators

• The common performance issues and troubleshooting the performance problems

DataFrames and Spark SQL

• Need for Spark SQL

• What is Spark SQL

• Spark SQL Architecture

• SQL Context in Spark SQL

• User Defined Functions

• Data Frames and Datasets

• Interoperating with RDDs

• JSON and Parquet File Formats

• Loading Data through Different Sources

Scheduling and Partitioning in Apache Spark

• Concept of Scheduling and Partitioning in Spark

• Hash partition and range partition

• Scheduling applications

• Static partitioning and dynamic sharing

• Concept of Fair scheduling

• Map partition with index and Zip

• High Availability

• Single-node Recovery with Local File System and High Order Functions

Got a question for us? Please mention it in the comments section and we will get back to you.

 

 

 

 

 

 

0 responses on "How to Install Apache Spark on your system"

Leave a Message

Your email address will not be published. Required fields are marked *