Since I didn’t know anything about Apache Cassandra (beside being a distributed database) I could probably replace the 8 in the title to any number.. Here is a summary of few things I learned in Cassandra Day today.
- What is Apache Cassandra – open source distributed database donated to Apache foundation by Facebook (actually first to google code and then to ASF). It is decentralized – meaning all nodes are born equal and there are no masters or slaves. It is schema-full database see here more about the Myth of Schema-less databases. It supports replications, i.e. data redundancy with default replication factor of 3 and multiple data centers (both physical and virtual). and you can control the consistency level (see next). Therefore it is AP in the CAP theorem. It uses CQL – Cassandra Query Language.
- Controlling consistency – having the data replicated in several nodes one can read and write data in 3 ways. The higher consistency level the longer the latency is.
- One – reading or writing to one node is enough.
- Quorum – i.e. value was written \ retrieved at least from half of the relevant nodes. Latest win – when retrieving the data from several nodes the value with the latest time-stamp counts.
- All – need to write \ retrieve data from all the relevant nodes. T Note that all mode is dangerous since it cancels the high available. If one of the replicas is not available we will get no answer.
- Data Modelling is important – Isn’t it right to every database? Well yes. But the issue here is again the trade off. One on hand I feel that data modelling is sometimes neglected on NoSQL databases since we can just through data there and scale it. On the other hand, due to some limitations (see CQL next..) data modelling for Cassandra is quite opposite to best practices in RDBMS. Know your queries in advance and build the schema (e.g keyspace) accordingly.
- CQL – Cassandra Query Language. At the first talk they said “Yes, it is exactly like SQL”. Well, not exactly. First – no joins, this influences dramatically about the data modelling. Limited aggregations functionality. Update command always work even if record does not exist (can be controlled). See here more about CQL vs SQL.
- CQL containers – there are 3 types of containers
- Set – a container of items sorted by the type compare operand.
- List – a container of items sorted by the insertion operand
- Map – a key, value container sorted by the type compare operand of the keys. This is kind of a hack which allow you to have it a bit schema less. Up to 64k items in a map.
- Spark connector – connecting to one of the most trending technologies.
- Solr integration – Cassandra is not a document database. However, sometime users how choose Cassandra as their main solution have indexing \ search needs. Datastax have a patch which also allow you to search the data in RAM (data which was not yet written to the Lucene indices).
- How to go on from here? How can I learn more –
There are much more for me to learn about Apache Cassandra and more things I learned in this day but this is a short review.