Introduction to Apache Cassandra, History and Architecture

Last updated on May 30 2022
Lalit Kolgaonkar

Table of Contents

Introduction to Apache Cassandra, History and Architecture

Cassandra – Introduction

Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is a type of NoSQL database. Let us first understand what a NoSQL database does.

NoSQLDatabase

A NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.

The primary objective of a NoSQL database is to have

  • simplicity of design,
  • horizontal scaling, and
  • finer control over availability.

NoSql databases use different data structures compared to relational databases. It makes some operations faster in NoSQL. The suitability of a given NoSQL database depends on the problem it must solve.

NoSQL vs. Relational Database

The following table lists the points that differentiate a relational database from a NoSQL database.

Relational Database NoSql Database
Supports powerful query language. Supports very simple query language.
It has a fixed schema. No fixed schema.
Follows ACID (Atomicity, Consistency, Isolation, and Durability). It is only “eventually consistent”.
Supports transactions. Does not support transactions.

Besides Cassandra, we have the following NoSQL databases that are quite popular −

  • Apache HBase − HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as a part of Apache Hadoop project and runs on top of HDFS, providing BigTable-like capabilities for Hadoop.
  • MongoDB − MongoDB is a cross-platform document-oriented database system that avoids using the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas making the integration of data in certain types of applications easier and faster.

What is Apache Cassandra?

Apache Cassandra is an open source, distributed and decentralized/distributed storage system (database), for managing very large amounts of structured data spread out across the world. It provides highly available service with no single point of failure.

Listed below are some of the notable points of Apache Cassandra −

  • It is scalable, fault-tolerant, and consistent.
  • It is a column-oriented database.
  • Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.
  • Created at Facebook, it differs sharply from relational database management systems.
  • Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model.
  • Cassandra is being used by some of the biggest companies such as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Features of Cassandra

Cassandra has become so popular because of its outstanding technical features. Given below are some of the features of Cassandra:

  • Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.
  • Always on architecture − Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.
  • Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore, it maintains a quick response time.
  • Flexible data storage − Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.
  • Easy data distribution − Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.
  • Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
  • Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.

 

 

History of Cassandra

Cassandra was initially developed at Facebook by two Indians Avinash Lakshman (one of the authors of Amazon’s Dynamo) and Prashant Malik. It was developed to power the Facebook inbox search feature.

The following points specify the most important happenings in Cassandra history:

  • Cassandra was developed at Facebook by Avinash Lakshman and Prashant Malik.
  • It was developed for Facebook inbox search feature.
  • It was open sourced by Facebook in July 2008.
  • It was accepted by Apache Incubator in March 2009.
  • Cassandra is a top level project of Apache since February 2010.

 

 

Cassandra Architecture

Cassandra was designed to handle big data workloads across multiple nodes without a single point of failure. It has a peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.

  • In Cassandra, each node is independent and at the same time interconnected to other nodes. All the nodes in a cluster play the same role.
  • Every node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.
  • In the case of failure of one node, Read/Write requests can be served from other nodes in the network.

Data Replication in Cassandra

In Cassandra, nodes in a cluster act as replicas for a given piece of data. If some of the nodes are responded with an out-of-date value, Cassandra will return the most recent value to the client. After returning the most recent value, Cassandra performs a read repair in the background to update the stale values.

See the following image to understand the schematic view of how Cassandra uses data replication among the nodes in a cluster to ensure no single point of failure.

image001 12
Data Replication

Components of Cassandra

The main components of Cassandra are:

  • Node: A Cassandra node is a place where data is stored.
  • Data center: Data center is a collection of related nodes.
  • Cluster: A cluster is a component which contains one or more data centers.
  • Commit log: In Cassandra, the commit log is a crash-recovery mechanism. Every write operation is written to the commit log.
  • Mem-table: A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
  • SSTable: It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
  • Bloom filter: These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.

Cassandra Query Language

Cassandra Query Language (CQL) is used to access Cassandra through its nodes. CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.

The client can approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data.

Write Operations

Every write activity of nodes is captured by the commit logs written in the nodes. Later the data will be captured and stored in the mem-table. Whenever the mem-table is full, data will be written into the SStable data file. All writes are automatically partitioned and replicated throughout the cluster. Cassandra periodically consolidates the SSTables, discarding unnecessary data.

image002 6
Write

Read Operations

In Read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the appropriate SSTable which contains the required data.

There are three types of read request that is sent to replicas by coordinators.

  • Direct request
  • Digest request
  • Read repair request

The coordinator sends direct request to one of the replicas. After that, the coordinator sends the digest request to the number of replicas specified by the consistency level and checks if the returned data is an updated data.

After that, the coordinator sends digest request to all the remaining replicas. If any node gives out of date value, a background read repair request will update that data. This process is called read repair mechanism.

image003 9
Read

So, this brings us to the end of blog. This Tecklearn ‘Introduction to Apache Cassandra, History and Architecture’ helps you with commonly asked questions if you are looking out for a job in Cassandra and No-SQL Database Domain.

If you wish to learn HBase and build a career in Cassandra or No-SQL Database domain, then check out our interactive, Apache Cassandra Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/apache-cassandra-training/

 

Apache Cassandra Training

About the Course

Take your career to the next level as a certified Apache Cassandra developer by acquiring all the skills through our hands-on training sessions. Tecklearn’s Apache Cassandra Certification Training is designed by professionals as per the industry requirements and demands. This Cassandra Certification Training helps you to master the concepts of Apache Cassandra including Cassandra Architecture, its features, Cassandra Data Model, and its Administration. Our Cassandra certification training course lets you master the high availability NoSQL distributed database.

Why Should you take Apache Cassandra Training?

  • The average salary of a Software Engineer with Apache Cassandra skill is $120,500 per year. – Payscale.com
  • Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more companies that have large, active data sets.
  • Apache Cassandra is one of the most widely used NoSQL database. It offers features such as Fault Tolerance, Scalability, Flexible Data Storage and its efficient writes, which makes it the perfect database for various purposes.

What you will Learn in this Course?

Introduction to Big Data, and Cassandra

  • What is Big Data
  • Limitations of RDBMS
  • NoSQL and it’s Characteristics
  • CAP Theorem
  • Basic concepts of Cassandra
  • Features of Cassandra

Cassandra Data model, Installation and setup

  • Installation of Cassandra
  • Key concepts and deployment of non-relational database, column-oriented database, Data Model – column, column family

Cassandra Architecture

  • Explain the Architecture of Cassandra
  • Different Layers of Cassandra Architecture
  • Partitioning and Snitches
  • Explain Vnodes and How Read and Write Path works
  • Understand Compaction, Anti-Entropy and Tombstone
  • Describe Repairs in Cassandra

Deep Dive into Cassandra Database

  • Describe Different Data Types Used in Cassandra
  • Explain Collection Types
  • Describe What are CRUD Operations
  • Implement Insert, Select, Update and D        elete of various elements
  • Implement Various Functions Used in Cassandra
  • Describe Importance of Roles and Indexing

Backup & Restore and Performance Tuning

  • Learn backup and restore functionality and its importance
  • Create a snapshot using Nodetool utility
  • Restore a snapshot
  • Understand how to choose the right balance of the following resources: memory, CPU, disks, number of nodes, and network.
  • Understand all the logs created by Cassandra
  • Explain the purpose of different log files
  • Configure the log files
  • Learn about Performance Tuning
  • Integration with Spark and Kafka

Advance Modelling

  • Rules of Cassandra data modelling
  • Modelling data around queries
  • Creating table for data queries

Deploying the IDE for Cassandra applications

  • Learning key drivers
  • Deploying the IDE for Cassandra applications and cluster connection
  • Data query implementation

Cassandra Administration

  • Understanding Node Tool Utility
  • Cluster management using Command Line Interface
  • Management and Monitoring using DataStax Ops Center

Cassandra API and Summarization

  • Cassandra client connectivity
  • Connection pool internals
  • Cassandra API
  • Features and concepts of Hector client
  • Thrift, JAVA code and Summarization

Got a question for us? Please mention it in the comments section and we will get back to you.

0 responses on "Introduction to Apache Cassandra, History and Architecture"

Leave a Message

Your email address will not be published. Required fields are marked *