/
Cassandra Report (Public - Cycle 3)

Cassandra Report (Public - Cycle 3)

This document reports on the evaluation of Cassandra as a potential store for ZTF / LSST detections in place of storing those detections in a Relational Database. The report pertains to Cassandra only. A brief comparison with storing the detections on a distributed file store is mentioned, but any performance issues need to be understood in the light of the bare minimum VM resources used to create the Cassandra cluster.

 

 

Introduction

Relational Database Storage

Until now, all detections from ZTF have been stored in a relational database. The database has coped well since May 2018 with average ~100,000 detections per day and the database of detections is very easily queried using SQL. The current database has 166.7 million detections (candidates) and 11.6 million objects.

LSST will produce up to 10 million detections per night, and assuming 300 full clear nights per year, that would result in 3 billion detections per year, with a total of 30 billion over the 10 year survey, assuming each year’s data is kept in the same database. Relational databases are generally good at dealing with a few billion rows, but beyond that, backups & replication get very difficult and scalability becomes a problem. (The current ATLAS solution contains about 5.5 billion objects and 24.2 billion detections at end of April 2021, but it is very clunky to back up, requiring 6 hours of downtime each time the backup runs. The large tables are too big for replication, and would need to be converted to a different format to enable it which is triple the current size.)

It could be argued that if we ONLY deal with 3 billion detections annually, these could possibly be stored in a relational database, which is frozen at the end of each survey year (and crossmatched for subsequent years). This would retain the ability to query in the most flexible manner possible. However, the system might be at the edge of its capabilities if the 3 billion detections/year turns out to be an underestimate and we need to scale up.

Storing Detections Outside the Relational Database

Hence, a decision has been made during this cycle that ZTF/LSST detections will NOT be stored in a relational database, and that only summary data will be stored there.

There are distributed SQL solutions, e.g. qServe (based on MySQL, potentially used within LSST), PostgreSQL with Q3C spatial indexing (as used in Gaia) and CitusData (a distributed Microsoft product). Exploration of these techonologies is beyond the scope of this document, though some initial experience of qServe within LSST:UK indicates its usefulness for curating large survey data, but not necessarily that it is well adapted for live updates.

Once the decision is made to store the detections outside the relational database, regardless of the how detections are stored (filesystem, distributed database, etc), the data can no longer be joined easily with the object summary data (retained in the relational database), as used in Lasair version 2.0. This has two immediate implications:

  1. The detection data is no longer part of any SQL queries we offer to the user (unless simulated via an API). Web interface code and an API can be written that presents to the user the feel of SQL but in the background the actual filtering of the data is done completely differently (and may involve more work by the server than a simple SQL join). For example, “return to the user all objects that have ANY detection within n-days of a GW event”. In the background the code performs a two stage lookup - it grabs all the objects coincident with a GW contour, the code then collects all the detections for all the objects within that contour, and returns ONLY those objects with a detection within n-days of that event.

  2. Filtering of the detection data by arbitrary parameters becomes more difficult. While this in principle be done via some non-relational database technologies (especially if the data is stored in a single location), the reason for choosing these technologies is generally because they allow the data to be stored in a distributed (partitioned) manner. The distributed nature of the data makes arbitrary lookups more difficult because the data is generally distributed by one or two properties. As soon as these properties are decided it makes lookups by other properties more difficult (since this will involve a “full table scan”). Scanning the data by arbitrary properties can create an unacceptable load on the cluster, since all the nodes need to be involved in retrieving the data. The data is unlikely to be indexed on that arbitrary parameter and queries by this will result in timeouts and unnecessary network gossip as the nodes attempt to retrieve the data.

 

No-SQL Technologies

There are numerous “No-SQL” technologies. In the past we briefly explored Apple’s FoundationDB, but not in enough detail to decide wether or not this would be a useful technology. The following is a table containing a non-exhaustive list of possible technologies to explore. (NOTE - should add links to each tech.)

Key-Value Stores

Dynamo, Voldemort, Citrusleaf, Membase, Riak, Tokyo Cabinet, FoundationDB

“Bigtable” Clones

Google Bigtable, Cassandra, HBASE, Hypertable

Document Databases

CouchOne, MongoDB, Terrastore, OrientDB

Graph Databases

FlockDB, AllegroGraph, DEX, InfoGrid, Neo4J, Sones, DataStax Enterprise (built on top of Cassandra)

I could probably have explored more of the options above in less detail, but in the end decided to explore in anger just one technology. Perhaps some of the others were unfairly excluded. Some of my ATLAS colleagues briefly explored HBASE, but with no conclusions (though I recently discovered that this technology is being used by the Fink broker). Likewise FoundationDB. Some of my colleagues in the Zooniverse had explored MongoDB, but are now actually moving away from it, implying that it does not meet their immediate needs. I was concerned (perhaps unnecessarily) that Google Bigtable would tie us into the Google ecosystem.

Cassandra

I can’t remember how Cassandra got recommended. Major corporations are using it (e.g. Facebook, Netflix). I have since discovered that other teams within LSST are using Cassandra (e.g. Antares and ALeRCE brokers) with others in core LSST conducting serious evaluations.

Open Source

Cassandra is open source and free (though once can sign up for support, e.g. to DataStax Enterprise). This means that one is not tied into an ecosystem (e.g. Amazon, Google, Microsoft, Oracle, etc).

Built in Replication

One of the things we have really struggled with in relational databases is data replication and keeping those replicas up to date. Closely related is resilience. What happens when the database becomes unavailable? Cassandra comes with replication built in from the start.

No Unwieldy Control Nodes

All nodes are equal, kind of. (In fact there are some “Seed nodes” whose maintenance requires slightly different approach.) There is no unwieldy Control Node or an unwieldy process to turn something into a Control Node. This means that adding and removing nodes is straightforward. The “gossip” (yes, the protocol is called “gossip”) that happens between nodes determines how they discover each other. Little user involvement is required.

Cassandra Query Language

Cassandra comes with its own query language, CQL, that looks very like SQL. This turned out to be a mixed blessing. I liked how easily I could adapt code originally written to insert large amounts of data into a MySQL database into code that inserted lots of data into Cassandra. It also allowed me to introduce a spatial indexing layer (in the form of a WHERE clause) in a similar manner as previously written for MySQL (though the query language has some shortcomings that needed to be addressed before the layer would work).

CQL is NOT necessarily something we would expose to the users. We would almost certainly write an intermediate layer to interpret an SQL-like query that the users might have built via the external interface or API.

Unlike some No-SQL technologies (e.g. MongoDB), there is no built-in spatial indexing technology for Cassandra. This must be grafted on. Fortunately we have experience with MySQL of doing this and adapting it to Cassandra was made fairly easy by CQL. (See later.)

In my opinion, CQL is great for developers, but it does give the impression that the data is more queryable than it actually is, hence I would not be keen to simply expose the CQL interface to the user community.

The query language also has some foibles that must be clearly understood. (Another reason for not exposing this to the general user community.) For example, table joins are not allowed, even within Cassandra itself. Aggregation queries are not generally allowed (max, min, group by, etc). Even the count(*) query, which IS allowed, is NOT recommended (since it requires a full table scan, and will generally time out when initiated on a large table). Range queries are difficult or not allowed (especially BETWEEN) and OR statements are generally not allowed (though there are ways around this).

Overview of Terminology

The following table is a quick overview of terminology in Cassandra compared with relational databases. Cassandra is a key-value store. How these are specified determines how easily the data is accessed.

Cassandra Data Model

Relational Data Model

Cassandra Data Model

Relational Data Model