WHAT'S NEW?
Loading...

Let's meet Apache Cassandra - Ashic Mahtab

What is it?

Cassandra is known for being the fastest database in the industry where writes operations are concerned:
  • 4x better in writes
  • 2x better in reads
  • 12x better in reads/updates


  • Distributed (makes easier spin up new servers on the fly transparently)
  • Fault tolerant
  • Tunable consistency (Visual NoSQL systems use CAP theorem: Consistency, Availability and Partition tolerance)
  • Self healing
  • Built-in replication
  • Query language: CQL (Cassandra Query Language) similar to SQL with some extensions
    • Ex1: 
      • SELECT total_purchases FROM SALES
      • USING CONSISTENCY QUORUM
      • WHERE customer_id = 5
    • Ex1:
      • UPDATE SALES
      • USING CONSISTENCY ONE
      • SET total_purchases = 500000
      • WHERE customer_id = 4 
  • Clients: Is an Apache solution with plenty of contributors. Other apps allow you to extend Cassandra like Solr for full-text search or Spark for analytic purposes (it runs locally to each node and builds reports on the results).
  • FAST: LSM (the log-structured merge-tree) if Cassandra has to wait to write data in the disk, it will write in an special memory partition avoiding delays.

System architecture

  • Cluster: or ring is a group of nodes with the same data
  • Node: single machine which runs Cassandra.
  • Datacenter: sometimes we want to group our nodes in a datacenter for geographical reasons. Imagine an e-commerce web with two data centres: west and east coast. Customers from east coast will access the east coast data centres but in reality they would have access to the whole cluster which contains both data centres. You can define different data centres within your machine. 
  • Keyspace: Generally there is one per app. It resembles the schema concept from RDMS. It doesn't stipulate any structure like in ER model. The content of the keyspace can be column families, each with different number of columns or different columns. Only point in common with a schema is that it contains a number of "objects", which are talbes in RDBM systems and here are column families or super columns.
  • Table
  • Row
  • Column

Data Replication

  • Data stored in partitions
  • Partitions are replicated
  • Replication can be cross datacenter


Reading and writing in Cassandra


Cassandra is a peer-to-peer, read/write anywhere architecture, so any user can connect to any node in any data center and read/write the data they need with all writes being partititioned and replicated for them automatically throughout the cluster

Writes in Cassandra


First write to a commit log for durability
Then written toa memtable in memory
Once the memtable becomes full, it is flushed to an SSTable (Sorted String table)
Writes are atomic at the row level; all columns are written or updated, or none are.

Tunable Data consistency


Choose between strong and eventual cosistency (all to any node responding) depending on the need
Can be done on a per-operation basis, and for both reads and writes
handles Multi-data center operations

Selecting a strategy


  • Any – a write must succeed on any available node
  • One – a write must succeed on any node responsible for that row (either primary or replica)
  • Quorum – a write must succeed on a quorum of replica nodes (determined by (replication_factor /2 )+ 1
  • Local_Quorum - a write must succeed on a quorum of replica nodes in the same data center as the coordinator node
  • Each_Quorum - a write must succeed on a quorum of replica nodes in all data centers
  • All – a write must succeed on all replica nodes for a row key 


Selecting a Strategy for Reads


  • One – reads from the closest node holding the data
  • Quorum – returns a result from a quorum of servers with the most recent timestamp for the data
  • Local_Quorum - returns a result from a quorum of servers with the most recent timestamp for the data in the same data center as the coordinator node
  • Each_Quorum - returns a result from a quorum of servers with the most recent timestamp in all data centers
  • All – returns a result from all replica nodes for a row key


Demo

First download our development environment from

  • https://www.vagrantup.com/
  • from the folder run: vagrant up

  • Go to www.datastax.com. DataStax makes free smart start installers available for Cassandra that include:
    • The most up-to-date Cassandra version that is production quality
    • A version of DataStax OpsCenter, which is a visual, browser-based management tool for managing and monitoring Cassandra
    • Drivers and connectors for popular development languages
    • Same database and application
    • Automatic configuration assistance for ensuring optimal performance and setup for either standalone or cluster implementations
    • http://www.datastax.com/download

  • Now we can download our first solution
    • https://github.com/heartysoft/vagrant-sparkcassandra
    • In the downloaded solution copy and paste the jdk v1.8 in the following folder:
    • ...\vagrant-sparkcassandra\modules\jdk\files\oracle\jdk8\
    • git submodule init
    • git submodule update
Visual studio nuget package Cassandra csharp package

References

0 comments:

Post a Comment