Introduction to Apache Storm and Core Concepts of Apache Storm

Last updated on May 30 2022
Lalit Kolgaonkar

Table of Contents

Introduction to Apache Storm and Core Concepts of Apache Storm

Apache Storm – Introduction

What is Apache Storm?

Apache Storm may be a distributed real-time big data-processing system. Storm is meant to process vast amount of knowledge during a fault-tolerant and horizontal scalable method. it’s a streaming data framework that has the potential of highest ingestion rates. Though Storm is stateless, it manages distributed environment and cluster state via Apache ZooKeeper. it’s simple and you’ll execute all types of manipulations on real-time data in parallel.

Apache Storm is constant to be a pacesetter in real-time data analytics. Storm is straightforward to setup, operate and it guarantees that each message is going to be processed through the topology a minimum of once.

Apache Storm vs Hadoop

Basically, Hadoop and Storm frameworks are used for analyzing big data. Both of them complement one another and differ in some aspects. Apache Storm does all the operations except persistency, while Hadoop is sweet at everything but lags in real-time computation. the subsequent table compares the attributes of Storm and Hadoop.

Storm Hadoop
Real-time stream processing Batch processing
Stateless Stateful
Master/Slave architecture with ZooKeeper based coordination. The master node is called as nimbus and slaves are supervisors. Master-slave architecture with/without ZooKeeper based coordination. Master node is job tracker and slave node is task tracker.
A Storm streaming process can access tens of thousands of messages per second on cluster. Hadoop Distributed File System (HDFS) uses MapReduce framework to process vast amount of data that takes minutes or hours.
Storm topology runs until shutdown by the user or an unexpected unrecoverable failure. MapReduce jobs are executed in a sequential order and completed eventually.
Both are distributed and fault-tolerant
If nimbus / supervisor dies, restarting makes it continue from where it stopped, hence nothing gets affected. If the Job Tracker dies, all the running jobs are lost.

Use-Cases of Apache Storm

Apache Storm is extremely famous for real-time big data stream processing. For this reason, most of the businesses are using Storm as an integral a part of their system. Some notable examples are as follows −

Twitter − Twitter is using Apache Storm for its range of “Publisher Analytics products”. “Publisher Analytics Products” process each and every tweets and clicks in the Twitter Platform. Apache Storm is deeply integrated with Twitter infrastructure.

NaviSite − NaviSite is using Storm for Event log monitoring/auditing system. Every logs generated in the system will go through the Storm. Storm will check the message against the configured set of regular expression and if there is a match, then that particular message will be saved to the database.

Wego − Wego is a travel metasearch engine located in Singapore. Travel related data comes from many sources all over the world with different timing. Storm helps Wego to search real-time data, resolves concurrency issues and find the best match for the end-user.

Apache Storm Benefits

Here may be a list of the advantages that Apache Storm offers −

  • Storm is open source, robust, and user friendly. It might be utilized in small companies also as large corporations.
  • Storm is fault tolerant, flexible, reliable, and supports any programing language.
  • Allows real-time stream processing.
  • Storm is unbelievably fast because it’s enormous power of processing the data.
  • Storm can continue the performance even under increasing load by adding resources linearly. it’s highly scalable.
  • Storm performs data refresh and end-to-end delivery response in seconds or minutes depends upon the matter. it’s very low latency.
  • Storm has operational intelligence.
  • Storm provides guaranteed processing albeit any of the connected nodes within the cluster die or messages are lost.

Apache Storm – Core Concepts

Apache Storm reads raw stream of real-time data from one end and passes it through a sequence of small processing units and output the processed / useful information at the opposite end.

The following diagram depicts the core concept of Apache Storm.

d
Apache Storm

Let us now have a closer look at the components of Apache Storm −

Components Description
Tuple Tuple is the main data structure in Storm. It is a list of ordered elements. By default, a Tuple supports all data types. Generally, it is modelled as a set of comma separated values and passed to a Storm cluster.
Stream Stream is an unordered sequence of tuples.
Spouts Source of stream. Generally, Storm accepts input data from raw data sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue, etc. Otherwise you can write spouts to read data from datasources. “ISpout” is the core interface for implementing spouts. Some of the specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc.
Bolts Bolts are logical processing units. Spouts pass data to bolts and bolts process and produce a new output stream. Bolts can perform the operations of filtering, aggregation, joining, interacting with data sources and databases. Bolt receives data and emits to one or more bolts. “IBolt” is the core interface for implementing bolts. Some of the common interfaces are IRichBolt, IBasicBolt, etc.

Let’s take a real-time example of “Twitter Analysis” and see how it can be modelled in Apache Storm. The following diagram depicts the structure.

e
Twitter Analysis

The input for the “Twitter Analysis” comes from Twitter Streaming API. Spout will read the tweets of the users using Twitter Streaming API and output as a stream of tuples. one tuple from the spout will have a twitter username and one tweet as comma separated values. Then, this steam of tuples is going to be forwarded to the Bolt and therefore the Bolt will split the tweet into individual word, calculate the word count, and persist the knowledge to a configured datasource. Now, we will easily get the result by querying the datasource.

Topology

Spouts and bolts are connected together and that they form a topology. Real-time application logic is specified inside Storm topology. In simple words, a topology may be a directed graph where vertices are computation and edges are stream of data .

A simple topology starts with spouts. Spout emits the info to at least one or more bolts. Bolt represents a node within the topology having the littlest processing logic and therefore the output of a bolt are often emitted into another bolt as input.

Storm keeps the topology always running, until you kill the topology. Apache Storm’s main job is to run the topology and can run any number of topology at a given time.

Tasks

Now you’ve got a basic idea on spouts and bolts. they’re the littlest logical unit of the topology and a topology is made employing a single spout and an array of bolts. they ought to be executed properly during a particular order for the topology to run successfully. The execution of every and each spout and bolt by Storm is named as “Tasks”. In simple words, a task is either the execution of a spout or a bolt. At a given time, each spout and bolt can have multiple instances running in multiple separate threads.

Workers

A topology runs during a distributed manner, on multiple worker nodes. Storm spreads the tasks evenly on all the worker nodes. The worker node’s role is to concentrate for jobs and begin or stop the processes whenever a replacement job arrives.

Stream Grouping

Stream of knowledge flows from spouts to bolts or from one bolt to a different bolt. Stream grouping controls how the tuples are routed within the topology and helps us to know the tuples flow within the topology. There are four in-built groupings as explained below.

Shuffle Grouping

In shuffle grouping, an equal number of tuples is distributed randomly across all of the workers executing the bolts. the subsequent diagram depicts the structure.

f
Shuffle Grouping

Field Grouping

The fields with same values in tuples are grouped together and therefore the remaining tuples kept outside. Then, the tuples with an equivalent field value are sent forward to an equivalent worker executing the bolts. for instance, if the stream is grouped by the sector “word”, then the tuples with an equivalent string, “Hello” will move to an equivalent worker. the subsequent diagram shows how Field Grouping works.

g
Field Grouping

Global Grouping

All the streams are often grouped and forward to at least one bolt. This grouping sends tuples generated by all instances of the source to one target instance (specifically, pick the worker with lowest ID).

h
Global Grouping

All Grouping

All Grouping sends one copy of every tuple to all or any instances of the receiving bolt. this type of grouping is employed to send signals to bolts. All grouping is beneficial for join operations.

i
All Grouping

So, this brings us to the end of blog. This Tecklearn ‘Introduction to Apache Storm and Core Concepts of Apache Storm’ helps you with commonly asked questions if you are looking out for a job in Apache Storm and Big Data Domain.

If you wish to learn Apache Storm and build a career in Apache Storm or Big Data domain, then check out our interactive, Apace Storm Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/apache-strom-training/

Apache Storm Training

About the Course

Tecklearn Apache Storm training will give you a working knowledge of the open-source computational engine, Apache Storm. You will be able to do distributed real-time data processing and come up with valuable insights. You will learn about the deployment and development of Apache Storm applications in real world for handling Big Data and implementing various analytical tools for powerful enterprise-grade solutions. Upon completion of this online training, you will hold a solid understanding and hands-on experience with Apache Storm.

Why Should you take Apache Storm Training?

  • The average pay of Apache Storm Professional stands at $90,167 P.A – ​Indeed.com​​
  • Groupon, Twitter and many companies using Apache Storm for business purposes like real-time analytics and micro-batch processing.
  • Apache Storm is a free and open source, distributed real-time computation system for processing fast, large streams of data

What you will Learn in this Course?

Introduction to Apache Storm

  • Apache Storm
  • Apache Storm Data Model

Architecture of Storm

  • Apache Storm Architecture
  • Hadoop distributed computing
  • Apache Storm features

Installation and Configuration

  • Pre-requisites for Installation
  • Installation and Configuration

Storm UI

  • Zookeeper
  • Storm UI

Storm Topology Patterns

Got a question for us? Please mention it in the comments section and we will get back to you.

0 responses on "Introduction to Apache Storm and Core Concepts of Apache Storm"

Leave a Message

Your email address will not be published. Required fields are marked *