Let's meet Apache Cassandra - Ashic Mahtab

What is it?

Cassandra is known for being the fastest database in the industry where writes operations are concerned:

4x better in writes
2x better in reads
12x better in reads/updates

Distributed (makes easier spin up new servers on the fly transparently)
Fault tolerant
Tunable consistency (Visual NoSQL systems use CAP theorem: Consistency, Availability and Partition tolerance)

Self healing
Built-in replication
Query language: CQL (Cassandra Query Language) similar to SQL with some extensions

Ex1:

SELECT total_purchases FROM SALES
USING CONSISTENCY QUORUM
WHERE customer_id = 5

Ex1:

UPDATE SALES
USING CONSISTENCY ONE
SET total_purchases = 500000
WHERE customer_id = 4

Clients: Is an Apache solution with plenty of contributors. Other apps allow you to extend Cassandra like Solr for full-text search or Spark for analytic purposes (it runs locally to each node and builds reports on the results).
FAST: LSM (the log-structured merge-tree) if Cassandra has to wait to write data in the disk, it will write in an special memory partition avoiding delays.

System architecture

Cluster: or ring is a group of nodes with the same data
Node: single machine which runs Cassandra.
Datacenter: sometimes we want to group our nodes in a datacenter for geographical reasons. Imagine an e-commerce web with two data centres: west and east coast. Customers from east coast will access the east coast data centres but in reality they would have access to the whole cluster which contains both data centres. You can define different data centres within your machine.
Keyspace: Generally there is one per app. It resembles the schema concept from RDMS. It doesn't stipulate any structure like in ER model. The content of the keyspace can be column families, each with different number of columns or different columns. Only point in common with a schema is that it contains a number of "objects", which are talbes in RDBM systems and here are column families or super columns.
Table
Row
Column

Data Replication

Data stored in partitions
Partitions are replicated
Replication can be cross datacenter

Reading and writing in Cassandra

Cassandra is a peer-to-peer, read/write anywhere architecture, so any user can connect to any node in any data center and read/write the data they need with all writes being partititioned and replicated for them automatically throughout the cluster

Writes in Cassandra

First write to a commit log for durability
Then written toa memtable in memory
Once the memtable becomes full, it is flushed to an SSTable (Sorted String table)
Writes are atomic at the row level; all columns are written or updated, or none are.

Tunable Data consistency

Choose between strong and eventual cosistency (all to any node responding) depending on the need
Can be done on a per-operation basis, and for both reads and writes
handles Multi-data center operations

Selecting a strategy

Any – a write must succeed on any available node
One – a write must succeed on any node responsible for that row (either primary or replica)
Quorum – a write must succeed on a quorum of replica nodes (determined by (replication_factor /2 )+ 1
Local_Quorum - a write must succeed on a quorum of replica nodes in the same data center as the coordinator node
Each_Quorum - a write must succeed on a quorum of replica nodes in all data centers
All – a write must succeed on all replica nodes for a row key

Selecting a Strategy for Reads

One – reads from the closest node holding the data
Quorum – returns a result from a quorum of servers with the most recent timestamp for the data
Local_Quorum - returns a result from a quorum of servers with the most recent timestamp for the data in the same data center as the coordinator node
Each_Quorum - returns a result from a quorum of servers with the most recent timestamp in all data centers
All – returns a result from all replica nodes for a row key

Demo

First download our development environment from

https://www.vagrantup.com/
from the folder run: vagrant up

Go to www.datastax.com. DataStax makes free smart start installers available for Cassandra that include:

The most up-to-date Cassandra version that is production quality
A version of DataStax OpsCenter, which is a visual, browser-based management tool for managing and monitoring Cassandra
Drivers and connectors for popular development languages
Same database and application
Automatic configuration assistance for ensuring optimal performance and setup for either standalone or cluster implementations
http://www.datastax.com/download

Now we can download our first solution

https://github.com/heartysoft/vagrant-sparkcassandra
In the downloaded solution copy and paste the jdk v1.8 in the following folder:
...\vagrant-sparkcassandra\modules\jdk\files\oracle\jdk8\
git submodule init
git submodule update

Visual studio nuget package Cassandra csharp package

References

https://en.wikipedia.org/wiki/Log-structured_merge-tree
http://es.slideshare.net/DataStax/understanding-data-consistency-in-apache-cassandra

Let's meet Apache Cassandra - Ashic Mahtab

What is it?

System architecture

Data Replication

Reading and writing in Cassandra

Writes in Cassandra

Tunable Data consistency

Selecting a strategy

Selecting a Strategy for Reads

Demo

References

0 comments:

Post a Comment

Word cloud

Popular Posts

My Blog List

Blog Archive

Microsoft Certified Professional

Microsoft Specialist

Total Pageviews

Microsoft Technology Associate