Deep dive into Advanced programming in Spark

Last updated on May 30 2022
Shakuntala Deskmukh

Table of Contents

Deep dive into Advanced programming in Spark

Advanced Spark Programming

Spark contains two different types of shared variables − one is broadcast variables and second is accumulators.

  • Broadcast variables − used to efficiently, distribute large values.
  • Accumulators − used to aggregate the knowledge of particular collection.

Broadcast Variables

Broadcast variables allow the programmer to stay a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node, a copy of a large input dataset, in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a group of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common knowledge needed by tasks within each stage.

The knowledge broadcasted this manner is cached in serialized form and is deserialized before running each task. This means that explicitly creating broadcast variables, is only useful when tasks across multiple stages need the equivalent knowledge or when caching the knowledge in deserialized form is important.

Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code given below shows this −

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

Output

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster, so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after its broadcast, in order to ensure that all nodes get the equivalent value of the broadcast variable.

Accumulators

Accumulators are variables that are only “added” to through an associative operation and can therefore, be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages (NOTE − this is not yet supported in Python).

An accumulator is created from an initial value v by calling SparkContext.accumulator(v). Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python). However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.

The code given below shows an accumulator being used to add up the elements of an array −

scala> val accum = sc.accumulator(0)




scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

If you want to see the output of above code then use the subsequent command −

scala> accum.value

Output

res2: Int = 10

Numeric RDD Operations

Spark allows you to do different operations on numeric knowledge, using one of the predefined API methods. Spark’s numeric operations are implemented with a streaming algorithm that allows building the model, one element at a time.

These operations are computed and returned as a StatusCounter object by calling status() method.

The subsequent is a list of numeric methods available in StatusCounter.

S.No Methods & Meaning
1 count()

Number of elements in the RDD.

2 Mean()

Average of the elements in the RDD.

3 Sum()

Total value of the elements in the RDD.

4 Max()

Maximum value among all elements in the RDD.

5 Min()

Minimum value among all elements in the RDD.

6 Variance()

Variance of the elements.

7 Stdev()

Standard deviation.

If you would like to use only one of these methods, you’ll call the corresponding method directly on RDD.

So, this brings us to the end of blog. This Tecklearn ‘Deep dive into Advanced programming in Spark’ helps you with commonly asked questions if you are looking out for a job in Apache Spark and Scala and Big Data Developer. If you wish to learn Apache Spark and Scala and build a career in Big Data Hadoop domain, then check out our interactive, Apache Spark and Scala Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/apache-spark-and-scala-certification/

Apache Spark and Scala Training

About the Course

Tecklearn Spark training lets you master real-time data processing using Spark streaming, Spark SQL, Spark RDD and Spark Machine Learning libraries (Spark MLlib). This Spark certification training helps you master the essential skills of the Apache Spark open-source framework and Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. You will also understand the role of Spark in overcoming the limitations of MapReduce. Upon completion of this online training, you will hold a solid understanding and hands-on experience with Apache Spark.

Why Should you take Apache Spark and Scala Training?

  • The average salary for Apache Spark developer ranges from approximately $93,486 per year for Developer to $128,313 per year for Data Engineer. – Indeed.com
  • Wells Fargo, Microsoft, Capital One, Apple, JPMorgan Chase & many other MNC’s worldwide use Apache Spark across industries.
  • Global Spark market revenue will grow to $4.2 billion by 2022 with a CAGR of 67% Marketanalysis.com

What you will Learn in this Course?

Introduction to Scala for Apache Spark

  • What is Scala
  • Why Scala for Spark
  • Scala in other Frameworks
  • Scala REPL
  • Basic Scala Operations
  • Variable Types in Scala
  • Control Structures in Scala
  • Loop, Functions and Procedures
  • Collections in Scala
  • Array Buffer, Map, Tuples, Lists

Functional Programming and OOPs Concepts in Scala

  • Functional Programming
  • Higher Order Functions
  • Anonymous Functions
  • Class in Scala
  • Getters and Setters
  • Custom Getters and Setters
  • Constructors in Scala
  • Singletons
  • Extending a Class using Method Overriding

Introduction to Spark

  • Introduction to Spark
  • How Spark overcomes the drawbacks of MapReduce
  • Concept of In Memory MapReduce
  • Interactive operations on MapReduce
  • Understanding Spark Stack
  • HDFS Revision and Spark Hadoop YARN
  • Overview of Spark and Why it is better than Hadoop
  • Deployment of Spark without Hadoop
  • Cloudera distribution and Spark history server

Basics of Spark

  • Spark Installation guide
  • Spark configuration and memory management
  • Driver Memory Versus Executor Memory
  • Working with Spark Shell
  • Resilient distributed datasets (RDD)
  • Functional programming in Spark and Understanding Architecture of Spark

Playing with Spark RDDs

  • Challenges in Existing Computing Methods
  • Probable Solution and How RDD Solves the Problem
  • What is RDD, It’s Operations, Transformations & Actions Data Loading and Saving Through RDDs
  • Key-Value Pair RDDs
  • Other Pair RDDs and Two Pair RDDs
  • RDD Lineage
  • RDD Persistence
  • Using RDD Concepts Write a Wordcount Program
  • Concept of RDD Partitioning and How It Helps Achieve Parallelization
  • Passing Functions to Spark

Writing and Deploying Spark Applications

  • Creating a Spark application using Scala or Java
  • Deploying a Spark application
  • Scala built application
  • Creating application using SBT
  • Deploying application using Maven
  • Web user interface of Spark application
  • A real-world example of Spark and configuring of Spark

Parallel Processing

  • Concept of Spark parallel processing
  • Overview of Spark partitions
  • File Based partitioning of RDDs
  • Concept of HDFS and data locality
  • Technique of parallel operations
  • Comparing coalesce and Repartition and RDD actions

Machine Learning using Spark MLlib

  • Why Machine Learning
  • What is Machine Learning
  • Applications of Machine Learning
  • Face Detection: USE CASE
  • Machine Learning Techniques
  • Introduction to MLlib
  • Features of MLlib and MLlib Tools
  • Various ML algorithms supported by MLlib

Integrating Apache Flume and Apache Kafka

  • Why Kafka, what is Kafka and Kafka architecture
  • Kafka workflow and Configuring Kafka cluster
  • Basic operations and Kafka monitoring tools
  • Integrating Apache Flume and Apache Kafka

Apache Spark Streaming

  • Why Streaming is Necessary
  • What is Spark Streaming
  • Spark Streaming Features
  • Spark Streaming Workflow
  • Streaming Context and DStreams
  • Transformations on DStreams
  • Describe Windowed Operators and Why it is Useful
  • Important Windowed Operators
  • Slice, Window and ReduceByWindow Operators
  • Stateful Operators

Improving Spark Performance

  • Learning about accumulators
  • The common performance issues and troubleshooting the performance problems

DataFrames and Spark SQL

  • Need for Spark SQL
  • What is Spark SQL
  • Spark SQL Architecture
  • SQL Context in Spark SQL
  • User Defined Functions
  • Data Frames and Datasets
  • Interoperating with RDDs
  • JSON and Parquet File Formats
  • Loading Data through Different Sources

Scheduling and Partitioning in Apache Spark

  • Concept of Scheduling and Partitioning in Spark
  • Hash partition and range partition
  • Scheduling applications
  • Static partitioning and dynamic sharing
  • Concept of Fair scheduling
  • Map partition with index and Zip
  • High Availability
  • Single-node Recovery with Local File System and High Order Functions

Got a question for us? Please mention it in the comments section and we will get back to you.

 

 

0 responses on "Deep dive into Advanced programming in Spark"

Leave a Message

Your email address will not be published. Required fields are marked *