Overview of How Cassandra Stores its data

Last updated on May 30 2022
Lalit Kolgaonkar

Table of Contents

Overview of How Cassandra Stores its data

Cassandra – Data Model

The data model of Cassandra is significantly different from what we normally see in an RDBMS. This blog provides an overview of how Cassandra stores its data.

Cluster

Cassandra database is distributed over several machines that operate together. The outermost container is known as the Cluster. For failure handling, every node contains a replica, and in case of a failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them.

Keyspace

Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace in Cassandra are −

  • Replication factor − It is the number of machines in the cluster that will receive copies of the same data.
  • Replica placement strategy − It is nothing but the strategy to place replicas in the ring. We have strategies such as simple strategy (rack-aware strategy), old network topology strategy (rack-aware strategy), and network topology strategy (datacenter-shared strategy).
  • Column families − Keyspace is a container for a list of one or more column families. A column family, in turn, is a container of a collection of rows. Each row contains ordered columns. Column families represent the structure of your data. Each keyspace has at least one and often many column families.

The syntax of creating a Keyspace is as follows −

CREATE KEYSPACE Keyspace name

WITH replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’ : 3};

The following illustration shows a schematic view of a Keyspace.

image001 2
Keyspace

Column Family

A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns. The following table lists the points that differentiate a column family from a table of relational databases.

Relational Table Cassandra column Family
A schema in a relational model is fixed. Once we define certain columns for a table, while inserting data, in every row all the columns must be filled at least with a null value. In Cassandra, although the column families are defined, the columns are not. You can freely add any column to any column family at any time.
Relational tables define only columns and the user fills in the table with values. In Cassandra, a table contains columns, or can be defined as a super column family.

A Cassandra column family has the following attributes −

  • keys_cached − It represents the number of locations to keep cached per SSTable.
  • rows_cached − It represents the number of rows whose entire contents will be cached in memory.
  • preload_row_cache − It specifies whether you want to pre-populate the row cache.

Note − Unlike relational tables where a column family’s schema is not fixed, Cassandra does not force individual rows to have all the columns.

The following figure shows an example of a Cassandra column family.

image002 5
Column Family

Column

A column is the basic data structure of Cassandra with three values, namely key or column name, value, and a time stamp. Given below is the structure of a column.

image003 2
Column

SuperColumn

A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns.

Generally column families are stored on disk in individual files. Therefore, to optimize performance, it is important to keep columns that you are likely to query together in the same column family, and a super column can be helpful here.Given below is the structure of a super column.

image004 2
SuperColumn

Data Models of Cassandra and RDBMS

The following table lists down the points that differentiate the data model of Cassandra from that of an RDBMS.

 

RDBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
It has a fixed schema. Cassandra has a flexible schema.
In RDBMS, a table is an array of arrays. (ROW x COLUMN) In Cassandra, a table is a list of “nested key-value pairs”. (ROW x COLUMN key x COLUMN value)
Database is the outermost container that contains data corresponding to an application. Keyspace is the outermost container that contains data corresponding to an application.
Tables are the entities of a database. Tables or column families are the entity of a keyspace.
Row is an individual record in RDBMS. Row is a unit of replication in Cassandra.
Column represents the attributes of a relation. Column is a unit of storage in Cassandra.
RDBMS supports the concepts of foreign keys, joins. Relationships are represented using collections.

 

So, this brings us to the end of blog. This Tecklearn ‘Overview of How Cassandra Stores its data’ helps you with commonly asked questions if you are looking out for a job in Cassandra and No-SQL Database Domain.

If you wish to learn HBase and build a career in Cassandra or No-SQL Database domain, then check out our interactive, Apache Cassandra Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/apache-cassandra-training/

Apache Cassandra Training

About the Course

Take your career to the next level as a certified Apache Cassandra developer by acquiring all the skills through our hands-on training sessions. Tecklearn’s Apache Cassandra Certification Training is designed by professionals as per the industry requirements and demands. This Cassandra Certification Training helps you to master the concepts of Apache Cassandra including Cassandra Architecture, its features, Cassandra Data Model, and its Administration. Our Cassandra certification training course lets you master the high availability NoSQL distributed database.

Why Should you take Apache Cassandra Training?

  • The average salary of a Software Engineer with Apache Cassandra skill is $120,500 per year. – Payscale.com
  • Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more companies that have large, active data sets.
  • Apache Cassandra is one of the most widely used NoSQL database. It offers features such as Fault Tolerance, Scalability, Flexible Data Storage and its efficient writes, which makes it the perfect database for various purposes.

What you will Learn in this Course?

Introduction to Big Data, and Cassandra

  • What is Big Data
  • Limitations of RDBMS
  • NoSQL and it’s Characteristics
  • CAP Theorem
  • Basic concepts of Cassandra
  • Features of Cassandra

Cassandra Data model, Installation and setup

  • Installation of Cassandra
  • Key concepts and deployment of non-relational database, column-oriented database, Data Model – column, column family

Cassandra Architecture

  • Explain the Architecture of Cassandra
  • Different Layers of Cassandra Architecture
  • Partitioning and Snitches
  • Explain Vnodes and How Read and Write Path works
  • Understand Compaction, Anti-Entropy and Tombstone
  • Describe Repairs in Cassandra

Deep Dive into Cassandra Database

  • Describe Different Data Types Used in Cassandra
  • Explain Collection Types
  • Describe What are CRUD Operations
  • Implement Insert, Select, Update and D        elete of various elements
  • Implement Various Functions Used in Cassandra
  • Describe Importance of Roles and Indexing

Backup & Restore and Performance Tuning

  • Learn backup and restore functionality and its importance
  • Create a snapshot using Nodetool utility
  • Restore a snapshot
  • Understand how to choose the right balance of the following resources: memory, CPU, disks, number of nodes, and network.
  • Understand all the logs created by Cassandra
  • Explain the purpose of different log files
  • Configure the log files
  • Learn about Performance Tuning
  • Integration with Spark and Kafka

Advance Modelling

  • Rules of Cassandra data modelling
  • Modelling data around queries
  • Creating table for data queries

Deploying the IDE for Cassandra applications

  • Learning key drivers
  • Deploying the IDE for Cassandra applications and cluster connection
  • Data query implementation

Cassandra Administration

  • Understanding Node Tool Utility
  • Cluster management using Command Line Interface
  • Management and Monitoring using DataStax Ops Center

Cassandra API and Summarization

  • Cassandra client connectivity
  • Connection pool internals
  • Cassandra API
  • Features and concepts of Hector client
  • Thrift, JAVA code and Summarization

Got a question for us? Please mention it in the comments section and we will get back to you.

0 responses on "Overview of How Cassandra Stores its data"

Leave a Message

Your email address will not be published. Required fields are marked *