Top Apache Storm Interview Questions and Answers

Last updated on Feb 18 2022
Avinash M

Table of Contents

Explain what is Apache Storm? What are the components of Storm?

Apache storm is an open source distributed real-time computation system used for processing real time big data analytics. Unlike Hadoop batch processing, Apache storm does for real-time processing and can be used with any programming language.

Components of Apache Storm includes

  • Nimbus: It works as a Hadoop’s Job Tracker. It distributes code across the cluster, uploads computation for execution, allocate workers across the cluster and monitors computation and reallocates workers as needed
  • Zookeeper: It is used as a mediator for communication with the Storm Cluster
  • Supervisor: Interacts with Nimbus through Zookeeper, depending on the signals received from the Nimbus, it executes the process.

 Why Apache Storm is the first choice for Real Time Processing?

  • Easy to operate: Operating storm is quiet easy
  • Real fast: It can process 100 messages per second per node
  • Fault Tolerant: It detects the fault automatically and re-starts the functional attributes
  • Reliable: It guarantees that each unit of data will be executed at least once or exactly once
  • Scalable: It runs across a cluster of machine

Explain how data is stream flow in Apache Storm?

In Apache storm, data is stream flow with three components Spout, Bolt and Tuple

  • Spout: A spout is a source of data in Storm
  • Bolt: A bolt processes these data’s
  • Tuple: Data is passed as Tuple

Compare Spark & Storm

Criteria Spark Storm
Data operation Data at rest Data in motion
Parallel computation Task parallel Data parallel
Latency Few seconds Sub-second
Deploying the application Using Scala, Java, Python language Using Java API

What are the key benefits of using Storm for Real Time Processing?

  • Easy to operate :Operating storm is quiet easy.
  • Real fast :It can process 100 messages per second per node.
  • Fault Tolerant :It detects the fault automatically and re-starts the functional attributes.
  • Reliable :It guarantees that each unit of data will be executed at least once or exactly once.
  • Scalable :It runs across a cluster of machine

Does Apache act as a Proxy server?

Yes, It acts as proxy also by using the mod_proxy module. This module implements a proxy, gateway or cache for Apache. It implements proxying capability for AJP13 (Apache JServ Protocol version 1.3), FTP, CONNECT (for SSL),HTTP/0.9, HTTP/1.0, and (since Apache 1.3.23) HTTP/1.1. The module can be configured to connect to other proxy modules for these and other protocols.

How many distinct layers are of Storm’s Codebase?

There are three distinct layers to Storm’s codebase.
First : Storm was designed from the very beginning to be compatible with multiple languages. Nimbus is a Thrift service and topologies are defined as Thrift structures. The usage of Thrift allows Storm to be used from any language.
Second : all of Storm’s interfaces are specified as Java interfaces. So even though there’s a lot of Clojure in Storm’s implementation, all usage must go through the Java API. This means that every feature of Storm is always available via Java.
Third : Storm’s implementation is largely in Clojure. Line-wise, Storm is about half Java code, half Clojure code. But Clojure is much more expressive, so in reality the great majority of the implementation logic is in Clojure.

When do you call the cleanup method?

The cleanup method is called when a Bolt is being shutdown and should cleanup any resources that were opened. There’s no guarantee that this method will be called on the cluster: For instance, if the machine the task is running on blows up, there’s no way to invoke the method.

The cleanup method is intended when you run topologies in local mode (where a Storm cluster is simulated in process), and you want to be able to run and kill many topologies without suffering any resource leaks.

 How can we kill a topology?

To kill a topology, simply run:
storm kill {stormname}
Give the same name to storm kill as you used when submitting the topology.
Storm won’t kill the topology immediately. Instead, it deactivates all the spouts so that they don’t emit any more tuples, and then Storm waits Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS seconds before destroying all the workers. This gives the topology enough time to complete any tuples it was processing when it got killed.

What are the common configurations in Apache Storm?

There are a variety of configurations you can set per topology. A list of all the configurations you can set can be found here. The ones prefixed with “TOPOLOGY” can be overridden on a topology-specific basis (the other ones are cluster configurations and cannot be overridden). Here are some common ones that are set for a topology:

  1. TOPOLOGY_WORKERS :This sets the number of worker processes to use to execute the topology. For example, if you set this to 25, there will be 25 Java processes across the cluster executing all the tasks. If you had a combined 150 parallelism across all components in the topology, each worker process will have 6 tasks running within it as threads.
  2. TOPOLOGY_ACKER_EXECUTORS :This sets the number of executors that will track tuple trees and detect when a spout tuple has been fully processed By not setting this variable or setting it as null, Storm will set the number of acker executors to be equal to the number of workers configured for this topology. If this variable is set to 0, then Storm will immediately ack tuples as soon as they come off the spout, effectively disabling reliability.
  3. TOPOLOGY_MAX_SPOUT_PENDING :This sets the maximum number of spout tuples that can be pending on a single spout task at once (pending means the tuple has not been acked or failed yet). It is highly recommended you set this config to prevent queue explosion.
  4. TOPOLOGY_MESSAGE_TIMEOUT_SECS :This is the maximum amount of time a spout tuple has to be fully completed before it is considered failed. This value defaults to 30 seconds, which is sufficient for most topologies.
  5. TOPOLOGY_SERIALIZATIONS :You can register more serializers to Storm using this config so that you can use custom types within tuples.

While installing, why does Apache have three config files – srm.conf, access.conf and httpd.conf?

The first two are remnants from the NCSA times, and generally you should be fine if you delete the first two, and stick with httpd.conf.

  • conf :-This is the default file for the ResourceConfig directive in httpd.conf. It is processed after httpd.conf but before access.conf.
  • conf :-This is the default file for the AccessConfig directive in httpd.conf.It is processed after httpd.conf and srm.conf.
  • conf :-The httpd.conf file is well-commented and mostly self-explanatory.

 How to check for the httpd.conf consistency and any errors in it?

We can check syntax for httpd configuration file by using
following command.

httpd –S

This command will dump out a description of how Apache parsed the configuration file. Careful examination of the IP addresses and server names may help uncover configuration mistakes.

Explain when to use field grouping in Storm? Is there any time-out or limit to known field values?

Field grouping in storm uses a mod hash function to decide which task to send a tuple, ensuring which task will be processed in the correct order. For that, you don’t require any cache. So, there is no time-out or limit to known field values.

The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the “user-id” field, tuples with the same “user-id” will always go to the same task, but tuples with different “user-id”‘s may go to different tasks.

 What is mod_vhost_alias?

This module creates dynamically configured virtual hosts, by allowing the IP address and/or the Host: header of the HTTP request to be used as part of the path name to determine what files to serve. This allows for easy use of a huge number of virtual hosts with similar configurations.

Does Apache include a search engine?

Yes, Apache contains a Search engine. You can search a report name in Apache by using the “Search title”.

Explain how you can streamline log files using Apache storm?

To read from the log files, you can configure your spout and emit per line as it read the log. The output then can be assign to a bolt for analyzing.

Mention how storm application can be beneficial in financial services?

In financial services, Storm can be helpful in preventing

  • Securities fraud :
  1. Perform real-time anomaly detection on known patterns of activities and use learned patterns from prior modeling and simulations.
  2. Correlate transaction data with other streams (chat, email, etc.) in a cost-effective parallel processing environment.
  3. Reduce query time from hours to minutes on large volumes of data.
  4. Build a single platform for operational applications and analytics that reduces total cost of ownership (TCO)
  • Order routing :Order routing is the process by which an order goes from the end user to an exchange. An order may go directly to the exchange from the customer, or it may go first to a broker who then routes the order to the exchange.
  • Pricing :Pricing is the process whereby a business sets the price at which it will sell its products and services, and may be part of the business’s marketing plan.
  • Compliance Violations :compliance means conforming to a rule, such as a specification, policy, standard or law. Regulatory compliance describes the goal that organizations aspire to achieve in their efforts to ensure that they are aware of and take steps to comply with relevant laws and regulations. And any disturbance in regarding compliance is violations in compliance.

What is ServerType directive in Apache Server?

It defines whether Apache should spawn itself as a child process (standalone) or keep everything in a single process (inetd). Keeping it inetd conserves resources.
The ServerType directive is included in Apache 1.3 for background compatibility with older UNIX-based version of Apache. By default, Apache is set to standalone server which means Apache will run as a separate application on the server. The ServerType directive isn’t available in Apache 2.0.

In which folder are Java Application stored in Apache?

Java applications are not stored in Apache, it can be only connected to a other Java webapp hosting webserver using the mod_jk connector. mod_jk is a replacement to the elderly mod_jserv. It is a completely new Tomcat-Apache plug-in that handles the communication between Tomcat and Apache.Several reasons:

  • mod_jserv was too complex.Because it was ported from Apache/JServ, it brought with it lots of JServ specific bits that aren’t needed by Apache.
  • mod_jserv supported only Apache.Tomcat supports many web servers through a compatibility layer named the jk library. Supporting two different modes of work became problematic in terms of support, documentation and bug fixes. mod_jk should fix that.
  • The layered approachprovided by the jk library makes it easier to support both 3.x and Apache2.xx.
  • Better support for SSL.mod_jserv couldn’t reliably identify whether a request was made via HTTP or HTTPS. mod_jk can, using the newer Ajpv13 protocol.

Explain how you can streamline log files using Apache storm?

To read from the log files you can configure your spout and emit per line as it read the log.  The output then can be assign to a bolt for analyzing.

Explain what streams is and stream grouping in Apache storm?

In Apache Storm, stream is referred as a group or unbounded sequence of Tuples while stream grouping determines how stream should be partitioned among the bolt’s tasks.

List out different stream grouping in Apache storm?

  • Shuffle grouping
  • Fields grouping
  • Global grouping
  • All grouping
  • None grouping
  • Direct grouping
  • Local grouping

Mention what is the difference between Apache Kafka and Apache Storm?

  • Apache Kafka:It is a distributed and robust messaging system that can handle huge amount of data and allows passage of messages from one end-point to another.
  • Apache Storm:It is a real time message processing system, and you can edit or manipulate data in real time. Apache storm pulls the data from Kafka and applies some required manipulation.

Explain how to write the Output into a file using Storm?

In Spout, when you are reading file, make FileReader object in Open() method, as such that time it initializes the reader object for worker node. And use that object in nextTuple() method.

Explain when using field grouping in storm, is there any time-out or limit to known field values?

Field grouping in storm uses a mod hash function to decide which task to send a tuple, ensuring which task will be processed in the correct order. For that, you don’t require any cache.  So, there is no time-out or limit to known field values.

Apache Storm Vs Kafka

Comparison Storm Kafka
Data Source Kafka and other DBs FB and Twitter
Data Storage data exchange between an input to output streams EXT4 File System/XFS
Dependency Independent Depends on Zookeeper
Inventor Twitter LinkedIn
Language Support All languages supported Supports all, but Java is recommended
Latency Milli-Second latency Depends upon Data Source generally less than 1-2 seconds.
Primary Use Stream Processing Message Broker
Stream processing Micro-Batch Processing Small-Batch Processing
Type Real Time Message Processing Distributed Messaging System

What is the latest version of Apache Storm.

1.2.1 version released in February 2018

What are the programming languages supported to work with Apache Storm.

There is no specific language mentioned, as apache storm is flexible to work with any of the programming language.

What are components of Apache Storm?

Nimbus, Zookeeper and Supervisor are the components of Apache Storm.

 What is Nimbus used for?

Nimbus is also known as Master Node. Nimbus is used to track jobs of workers. All the code is distributed among the workers and allocated workers to clusters available. If at all any of the worker is needed with more number or resources, Nimbus has to provide extra resources to the workers.
Note: Nimbus is similar to Job Tracker in Hadoop

What is Zookeeper?

Zookeeper helps in communication among the storm cluster nodes. As the zookeeper is concerned only with coordination and not in messages, there exists not much workload.

Components of data stream flow in storm?

Data can be stream of flow with three components:

  • Tuple – Data to be flown
  • Spout – Data source
  • Bolt – Processes the Tuples

How log files can be stream lined?

First, configure spout to read the log files to emit one line and then analyse it by assigning to bolt.

What are types of stream groups in Apache storm?

All, none, local, global, field, shuffle and direct groupings are available in apache storm.plore

What is Topology_Message_Timeout_secs used for?

Time specified to process a message released from spout, and if at all the message is not processed, then the message is considered as fail.

How do you use Apache Storm as Proxy server?

Using mod-proxy module, it can be used as proxy server as well.

What is ZeroMQ?

While working with storm topologies, ZeroMQ helps to communicate tasks with each other.

What are java elements supported by Apache Storm?

There is no Java support for Apache storm.

What are ASP (Active Server Page) elements supported by Apache Storm?

There is no ASP support for Apache storm.

What is combiner aggregator?

To group tuples to unified field, combiner aggregator is used.

What are “spouts” and “bolts” in Storm?

A. Apache Storm uses custom created “spouts” and “bolts” to define information sources and manipulations to allow batch, distributed processing of streaming data.

What is directed acyclic graph (DAG) in Storm?

A. Storm application is designed as a “topology” in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline.

What are the main components in Storm architecture?

A. The Apache Storm cluster comprises of two main components; they are Nodes and Components.

What is TopologyBuilder class?

A. java.lang.Object -> org.apache.storm.topology.TopologyBuilder

public class TopologyBuilder
extends Object

TopologyBuilder exposes the Java API for specifying a topology for Storm to execute. Topologies are Thrift structures in the end, but since the Thrift API is so verbose, TopologyBuilder greatly eases the process of creating topologies.

The template for creating and submitting a topology looks something like:

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout(“1”, new TestWordSpout(true), 5);
builder.setSpout(“2”, new TestWordSpout(true), 3);
builder.setBolt(“3”, new TestWordCounter(), 3)
.fieldsGrouping(“1”, new Fields(“word”))
.fieldsGrouping(“2”, new Fields(“word”));
builder.setBolt(“4”, new TestGlobalCount())
.globalGrouping(“1”);

Map conf = new HashMap();
conf.put(Config.TOPOLOGY_WORKERS, 4);

StormSubmitter.submitTopology(“mytopology”, conf, builder.createTopology());

How can you Kill a topology in Storm?

A. To kill a topology, simply run:

storm kill {stormname}

What are Streams?

A. The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream’s tuples.

What tuples contain in Storm?

A. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

What is Kryo?

A. Storm uses Kryo for serialization. Kryo is a flexible and fast serialization library that produces small serializations.

What are Spouts?

A. A spout is a source of streams in a topology. Generally, spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API).

What are reliable or unreliable Spouts?

A. Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.

What are Bolts?

A. All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.

Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts.

What is Stream grouping?

A. A stream grouping defines how that stream should be partitioned among the bolt’s tasks.

What are the built-in stream groups in Storm?

A. There are eight built-in stream groupings in Storm, they are:

Shuffle grouping: Tuples are randomly distributed across the bolt’s tasks in a way such that each bolt is guaranteed to get an equal number of tuples.

Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the “user-id” field, tuples with the same “user-id” will always go to the same task, but tuples with different “user-id”‘s may go to different tasks.

Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.

All grouping: The stream is replicated across all the bolt’s tasks. Use this grouping with care.

Global grouping: The entire stream goes to a single one of the bolt’s tasks. Specifically, it goes to the task with the lowest id.

None grouping: This grouping specifies that you don’t care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually, though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).

Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams.

Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

What are Tasks?

A. Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBolt methods of TopologyBuilder.

Where does default configurations will be stored?

A. Every configuration has a default value defined in defaults.yaml in the Storm codebase

What happens when a worker dies?

A. When a worker dies, the supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus, Nimbus will reschedule the worker.

What happens when a node dies?

A. The tasks assigned to that machine will time-out and Nimbus will reassign those tasks to other machines.

Is Nimbus a single point of failure?

A. If you lose the Nimbus node, the workers will still continue to function. Additionally, supervisors will continue to restart workers if they die. However, without Nimbus, workers won’t be reassigned to other machines when necessary (like if you lose a worker machine).

 How does Storm guarantee data processing?

A. Storm provides mechanisms to guarantee data processing even if nodes die or messages are lost.

 What makes a running topology: worker processes, executors and tasks?

A. Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:

  • Worker processes
  • Executors (threads)
  • Tasks

A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.

An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).

A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time.

What are some of the best ways to get a worker to mysteriously and bafflingly die?

A. Do you have write access to the log directory

  • Are you blowing out your heap?
  • Are all the right libraries installed on all of the workers?
  • Is the zookeeper hostname still set to localhost?
  • Did you supply a correct, unique hostname — one that resolves back to the machine — to each worker, and put it in the storm conf file?

Have you opened firewall/securitygroup permissions bidirectionally among a) all the workers, b) the storm master, c) zookeeper? Also, from the workers to any kafka/kestrel/database/etc that your topology accesses? Use netcat to poke the appropriate ports and be sure.

 Can a Trident topology have Multiple Streams?

A. Can a Trident Topology work like a workflow with conditional paths (if-else)? e.g. A Spout (S1) connects to a bolt (B0) which based on certain values in the incoming tuple routes them to either bolt (B1) or bolt (B2) but not both.

A Trident “each” operator returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:

Stream s = topology.each(…).groupBy(…).aggregate(…)
Stream branch1 = s.each(…, FilterA)
Stream branch2 = s.each(…, FilterB)
You can join streams with join, merge or multiReduce.

Which components are used for stream flow of data?

For streaming of data flow, three components are used

  • Bolt :-Bolts represent the processing logic unit in Storm. One can utilize bolts to do any kind of processing such as filtering, aggregating, joining, interacting with data stores, talking to external systems etc. Bolts can also emit tuples (data messages) for the subsequent bolts to process. Additionally, bolts are responsible to acknowledge the processing of tuples after they are done processing.
  • Spout :-Spouts represent the source of data in Storm. You can write spouts to read data from data sources such as database, distributed file systems, messaging frameworks etc. Spouts can broadly be classified into following –
    Reliable – These spouts have the capability to replay the tuples (a unit of data in data stream). This helps applications achieve ‘at least once message processing’ semantic as in case of failures, tuples can be replayed and processed again. Spouts for fetching the data from messaging frameworks are generally reliable as these frameworks provide the mechanism to replay the messages.
    Unreliable – These spouts don’t have the capability to replay the tuples. Once a tuple is emitted, it cannot be replayed irrespective of whether it was processed successfully or not. This type of spouts follow ‘at most once message processing’ semantic.
  • Tuple :-The tuple is the main data structure in Storm. A tuple is a named list of values, where each value can be any type. Tuples are dynamically typed — the types of the fields do not need to be declared. Tuples have helper methods like getInteger and getString to get field values without having to cast the result. Storm needs to know how to serialize all the values in a tuple. By default, Storm knows how to serialize the primitive types, strings, and byte arrays. If you want to use another type, you’ll need to implement and register a serializer for that type.

What is the use of Zookeeper in Storm?

Storm uses Zookeeper for coordinating the cluster. Zookeeper is not used for message passing, so the load that Storm places on Zookeeper is quite low. Single node Zookeeper clusters should be sufficient for most cases, but if you want failover or are deploying large Storm clusters you may want larger Zookeeper clusters. Instructions for deploying Zookeeper are here.A few notes about Zookeeper deployment :

  1. It’s critical that you run Zookeeper under supervision, since Zookeeper is fail-fast and will exit the process if it encounters any error case. See here for more details.
  2. It’s critical that you set up a cron to compact Zookeeper’s data and transaction logs. The Zookeeper daemon does not do this on its own, and if you don’t set up a cron, Zookeeper will quickly run out of disk space.

What is ZeroMQ?

ZeroMQ is “a library which extends the standard socket interfaces with features traditionally provided by specialized messaging middleware products”. Storm relies on ZeroMQ primarily for task-to-task communication in running Storm topologies.

What does it mean for a message to be

A tuple coming off a spout can trigger thousands of tuples to be created based on it. Consider, for example,

the streaming word count topology:TopologyBuilder builder = new TopologyBuilder();

builder.setSpout(“sentences”, new KestrelSpout(“kestrel.backtype.com”,

22133,

“sentence_queue”,

new StringScheme()));

builder.setBolt(“split”, new SplitSentence(), 10)

.shuffleGrouping(“sentences”);

builder.setBolt(“count”, new WordCount(), 20)

.fieldsGrouping(“split”, new Fields(“word”));

This topology reads sentences off a Kestrel queue, splits the sentences into its constituent words, and then emits for each word the number of times it has seen that word before. A tuple coming off the spout triggers many tuples being created based on it: a tuple for each word in the sentence and a tuple for the updated count for each word.
Storm considers a tuple coming off a spout “fully processed” when the tuple tree has been exhausted and every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a specified timeout. This timeout can be configured on a topology-specific basis using the Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS configuration and defaults to 30 seconds.

What is combinerAggregator?

A CombinerAggregator is used to combine a set of tuples into a single field. It has the following signature:

public interface CombinerAggregator {

T init (TridentTuple tuple);

T combine(T val1, T val2);

T zero();

}

Storm calls the init() method with each tuple, and then repeatedly calls the combine()method until the partition is processed. The values passed into the combine() method are partial aggregations, the result of combining the values returned by calls to init().

Is it necessary to kill the topology while updating the running topology?

Yes, to update a running topology, the only option currently is to kill the current topology and resubmit a new one. A planned feature is to implement a Storm swap command that swaps a running topology with a new one, ensuring minimal downtime and no chance of both topologies processing tuples at the same time.

 How Storm UI can be used in topology?

Storm UI is used in monitoring the topology. The Storm UI provides information about errors happening in tasks and fine-grained stats on the throughput and latency performance of each component of each running topology.

 Why does not Apache include SSL?

SSL (Secure Socket Layer) data transport requires encryption, and many governments have restrictions upon the import, export, and use of encryption technology. If Apache included SSL in the base package, its distribution would involve all sorts of legal and bureaucratic issues, and it would no longer be freely available. Also, some of the technology required to talk to current clients using SSL is patented by RSA Data Security, who restricts its use without a license.

Does Apache include any sort of database integration?

Apache is a Web (HTTP) server, not an application server. The base package does not include any such functionality. PHP project and the mod_perl project allow you to work with databases from within the Apache environment.

Tell me Is running apache as a root is a security risk?

No. Root process opens port 80, but never listens to it, so no user will actually enter the site with root rights. If you kill the root process, you will see the other roots disappear as well.

 What is Multiviews?

MultiViews search is enabled by the MultiViews Options. It is the general name given to the Apache server’s ability to provide language-specific document variants in response to a request. This is documented quite thoroughly in the content negotiation description page. In addition, Apache Week carried an article on this subject entitled It then chooses the best match to the client’s requirements, and returns that document.

Can we use Active server pages(ASP) with Apache?

Apache Web Server package does not include ASP support.
However, a number of projects provide ASP or ASP-like functionality for Apache. Some of these are:

  • Apache:ASP :-Apache ASP provides Active Server Pages port to the Apache Web Server with Perl scripting only, and enables developing of dynamic web applications with session management and embedded Perl code. There are also many powerful extensions, including XML taglibs, XSLT rendering, and new events not originally part of the ASP AP.
  • mod_mono :-It is an Apache 2.0/2.2/2.4.3 module that provides ASP.NET support for the web’s favorite server, Apache. It is hosted inside Apache. Depending on your configuration, the Apache box could be one or a dozen of separate processes, all of these processes will send their ASP.NET requests to the mod-mono-server process. The mod-mono-server process in turn can host multiple independent applications. It does this by using Application Domains to isolate the applications from each other, while using a single Mono virtual machine.

Explain what is Toplogy_Message_Timeout_secs in Apache storm?

It is the maximum amount of time allotted to the topology to fully process a message released by a spout. If the message in not acknowledged in given time frame, Apache Storm will fail the message on the spout.

 Mention the difference between Apache Kafka and Apache Storm?

  • Apache Kafka :It is a distributed and robust messaging system that can handle huge amount of data and allows passage of messages from one end-point to another. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers.
  • Apache Storm :It is a real time message processing system, and you can edit or manipulate data in real-time. Storm pulls the data from Kafka and applies some required manipulation. It makes it easy to reliably process unbounded streams of data, doing real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use.

 Mention what is the difference between Apache HBase and Storm?

                           Apache Storm                               Apache HBase
  • It provides data processing in real-time
  • It processes the data but not store
  • You will streamline your data where data is           processed in real time, so that alerts and actions can be raised if needed
  • It offers you low-latency reads of processed data for querying later
  • It stores the data but does not store
  • ______________

Mention how storm application can be beneficial in financial services?

In financial services, Storm can be helpful in preventing

  • Securities fraud
  • Order routing
  • Pricing
  • Compliance Violations

Explain what is Topology_Message_Timeout_secs in Apache Storm?

The maximum amount of time allotted to the topology to fully process a message released by a spout. If the message in not acknowledged in given time frame, Apache storm will fail the message on the spout.

Explain how message is fully processed in Apache Storm?

By calling the nextTuple procedure or method on the Spout, Storm requests a tuple from the Spout.  The Spout avails the SpoutoutputCollector given in the open method to discharge a tuple to one of its output streams. While discharging a tuple, the Spout allocates a “message id” that will be used to recognize the tuple later.

After that, the tuple gets sent to consuming bolts, and storm takes charge of tracking the tree of messages that is produced.  If the storm is confident that a tuple is processed thoroughly, then it can call the ack procedure on the originating Spout task with the message id that the Spout has given to the Storm.

What is the use of Supervisor?

Supervisor takes signals rom Nimbus through Zookeeper to execute the process. Supervisors are also known as Worker nodes.

What are the features of Apache Storm.

  • Reliable – All the data is ensured to be executed
  • Scalable – Machine’s cluster execution provides in scalability by parallel calculations.
  • Robust – Storm restarts workers when there is error/fault providing successful uninterrupted executions of other workers in the node.
  • Easy to operate – Standard configurations helps it easy to deploy and use.
  • Quick – Each node can process One million 100 byte messages.

Command to kill storm topology.

Storm kill {tecklearn_topology}
Where – tecklearn_topology is the name of the topology

Why is Apache Storm not provided with SSL.

In order to avoid legal or bureaucratic issues, Apache Storm avoids SSL.

Apache has search engine. True or False?

Yes. To search data with titles.

What are the components of running topology?

Three elements collectively define execution of tasks in the topology.

  • Worker processes
  • Executors
  • Tasks

Worker process are responsible to work with executors that belong to one or more components(i.e., spouts or bolts).
An executor is used to execute one or more processes as threads under worker process.
Tasks refer to the number of subtasks included in the cluster to perform a whole operation of the topology. One task may contain more workers, resulting in more number of workers than total tasks together.

What are Nodes?

A. There are two types of nodes are there in Storm, they are, Master Node and Worker Node.

The Master Node executes a daemon Nimbus which assigns tasks to machines and monitors their performances.

The Worker Node runs the daemon called Supervisor which assigns the tasks to other worker node and operates them as per the need.

What are the Components of Storm?

A. Components- Storm has three critical components, viz., Topology, Stream, and Spout. Topology is a network made of Stream and Spout.

The stream is an unbounded pipeline of tuples and Spout is the source of the data streams which converts the data into the tuple of streams and sends to the bolts to be processed.

What are Storm Topologies?

A. The logic for a real-time application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings.

What happens when Storm kill a topology?

A. Storm won’t kill the topology immediately. Instead, it deactivates all the spouts so that they don’t emit any more tuples, and then Storm waits Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS seconds before destroying all the workers. This gives the topology enough time to complete any tuples it was processing when it got killed.

How can you update a running topology?

A. To update a running topology, the only option currently is to kill the current topology and resubmit a new one.

 What does storm swap command do?

A. A planned feature is to implement a storm swap command that swaps a running topology with a new one, ensuring minimal downtime and no chance of both topologies processing tuples at the same time.

 How can you monitor topologies?

A. The best place to monitor a topology is using the Storm UI. The Storm UI provides information about errors happening in tasks and fine-grained stats on the throughput and latency performance of each component of each running topology.

 What are Workers?

A. Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology.

How many types of built-in schedulers are there in Storm?

A. Storm now has 4 kinds of built-in schedulers:

  • DefaultScheduler,
  • IsolationScheduler,
  • MultitenantScheduler,
  • ResourceAwareScheduler.

What happens when Nimbus or Supervisor daemons die?

A. The Nimbus and Supervisor daemons are designed to be fail-fast (process self-destructs whenever any unexpected situation is encountered) and stateless (all state is kept in Zookeeper or on disk).

The Nimbus and Supervisor daemons must be run under supervision using a tool like daemon tools or monit. So, if the Nimbus or Supervisor daemons die, they restart like nothing happened.

Most notably, no worker processes are affected by the death of Nimbus or the Supervisors. This is in contrast to Hadoop, where if the JobTracker dies, all the running jobs are lost.

What rules of thumb can you give me for configuring Storm+Trident?

A. number of workers a multiple of number of machines; parallelism a multiple of number of workers; number of kafka partitions a multiple of number of spout parallelism

  • Use one worker per topology per machine
  • Start with fewer, larger aggregators, one per machine with workers on it
  • Use the isolation scheduler
  • Use one acker per worker — 0.9 makes that the default, but earlier versions do not.
  • Enable GC logging; you should see very few major GCs if things are in reasonable shape.
  • Set the trident batch millis to about 50% of your typical end-to-end latency.

Start with a max spout pending that is for sure too small — one for trident, or the number of executors for storm — and increase it until you stop seeing changes in the flow. You’ll probably end up with something near 2*(throughput in recs/sec)*(end-to-end latency) (2x the Little’s law capacity).

Why am I getting a NotSerializableException/IllegalStateException when my topology is being started up?

A. Within the Storm lifecycle, the topology is instantiated and then serialized to byte format to be stored in ZooKeeper, prior to the topology being executed.

Within this step, if a spout or bolt within the topology has an initialized unserializable property, serialization will fail. If there is a need for a field that is unserializable, initialize it within the bolt or spout’s prepare method, which is run after the topology is delivered to the worker.

So, this brings us to the end of the Apache Storm Interview Questions blog.This Tecklearn ‘Top Apache Storm Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in Apache Storm or Big Data Domain. If you wish to learn Apache Storm and build a career in Big Data domain, then check out our interactive, Apache Storm Training, that comes with 24*7 support to guide you throughout your learning period.

https://www.tecklearn.com/course/apache-strom-training/

Apache Storm Training

About the Course

Tecklearn Apache Storm training will give you a working knowledge of the open-source computational engine, Apache Storm. You will be able to do distributed real-time data processing and come up with valuable insights. You will learn about the deployment and development of Apache Storm applications in real world for handling Big Data and implementing various analytical tools for powerful enterprise-grade solutions. Upon completion of this online training, you will hold a solid understanding and hands-on experience with Apache Storm.

Why Should you take Apache Storm Training?

  • The average pay of Apache Storm Professional stands at $90,167 P.A – ​Indeed.com​​
  • Groupon, Twitter and many companies using Apache Storm for business purposes like real-time analytics and micro-batch processing.
  • Apache Storm is a free and open source, distributed real-time computation system for processing fast, large streams of data

What you will Learn in this Course?

Introduction to Apache Storm

  • Apache Storm
  • Apache Storm Data Model

Architecture of Storm

  • Apache Storm Architecture
  • Hadoop distributed computing
  • Apache Storm features

Installation and Configuration

  • Pre-requisites for Installation
  • Installation and Configuration

Storm UI

  • Zookeeper
  • Storm UI

Storm Topology Patterns

Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "Top Apache Storm Interview Questions and Answers"

Leave a Message

Your email address will not be published. Required fields are marked *