This document reports on the evaluation of Cassandra as a potential store for ZTF / LSST detections in place of storing those detections in a Relational Database. The report pertains to Cassandra only. A brief comparison with storing the detections on a distributed file store is mentioned, but any performance issues need to be understood in the light of the bare minimum VM resources used to create the Cassandra cluster.

1 Introduction
- - 1.1.1 Relational Database Storage
  - 1.1.2 Storing Detections Outside the Relational Database
- 1.2 No-SQL Technologies
- 1.3 Cassandra
  - 1.3.1 Open Source
  - 1.3.2 Built in Replication
  - 1.3.3 No Unwieldy Control Nodes
  - 1.3.4 Cassandra Query Language
- 1.4 Overview of Terminology
  - 1.4.1 Keyspaces
  - 1.4.2 Tables
  - 1.4.3 Partition Keys
    - 1.4.3.1 Partition by Object ID
    - 1.4.3.2 Partition by Spatial Indexing
  - 1.4.4 Clustering Key(s)
    - 1.4.4.1 Clustering Columns in Current Lasair Implementation
    - 1.4.4.2 Clustering Columns for Spatial Indexing (ATLAS)
  - 1.4.5 INSERT and SELECT, not UPDATE or DELETE
- 1.5 Cassandra Deployments
  - 1.5.1 Node deployment & Replication Factor
  - 1.5.2 Deployment & Testing inside OpenStack
  - 1.5.3 Deployment & Testing at QUB
    - 1.5.3.1 Insert Testing
    - 1.5.3.2 Read Testing
    - 1.5.3.3 Failure Testing
- 1.6 Problems with Current Deployments
  - 1.6.1 Loss of Data & Design Antipattern
  - 1.6.2 Current Clustering Key is Sub Optimal
  - 1.6.3 Testing on a Live Platform
- 1.7 Summary
  - 1.7.1 What has been deployed?
  - 1.7.2 Advantages over filesystem based lightcurves?
  - 1.7.3 Other Possible Advantages
  - 1.7.4 Advantages of Spatial Indexing in Cassandra
  - 1.7.5 Disadvantages
  - 1.7.6 Unknowns
  - 1.7.7 Data Partitioning by Object ID or by HTM ID?
  - 1.7.8 Recent Information
  - 1.7.9 Recommendations

Introduction

Relational Database Storage

Until now, all detections from ZTF have been stored in a relational database. The database has coped well since May 2018 with average ~100,000 detections per day and the database of detections is very easily queried using SQL. The current database has 166.7 million detections (candidates) and 11.6 million objects.

LSST will produce up to 10 million detections per night, and assuming 300 full clear nights per year, that would result in 3 billion detections per year, with a total of 30 billion over the 10 year survey, assuming each year’s data is kept in the same database. Relational databases are generally good at dealing with a few billion rows, but beyond that, backups & replication get very difficult and scalability becomes a problem. (The current ATLAS solution contains about 5.5 billion objects and 24.2 billion detections at end of April 2021, but it is very clunky to back up, requiring 6 hours of downtime each time the backup runs. The large tables are too big for replication, and would need to be converted to a different format to enable it which is triple the current size.)

It could be argued that if we ONLY deal with 3 billion detections annually, these could possibly be stored in a relational database, which is frozen at the end of each survey year (and crossmatched for subsequent years). This would retain the ability to query in the most flexible manner possible. However, the system might be at the edge of its capabilities if the 3 billion detections/year turns out to be an underestimate and we need to scale up.

Storing Detections Outside the Relational Database

Hence, a decision has been made during this cycle that ZTF/LSST detections will NOT be stored in a relational database, and that only summary data will be stored there.

There are distributed SQL solutions, e.g. qServe (based on MySQL, potentially used within LSST), PostgreSQL with Q3C spatial indexing (as used in Gaia) and CitusData (a distributed Microsoft product). Exploration of these techonologies is beyond the scope of this document, though some initial experience of qServe within LSST:UK indicates its usefulness for curating large survey data, but not necessarily that it is well adapted for live updates.

Once the decision is made to store the detections outside the relational database, regardless of the how detections are stored (filesystem, distributed database, etc), the data can no longer be joined easily with the object summary data (retained in the relational database), as used in Lasair version 2.0. This has two immediate implications:

The detection data is no longer part of any SQL queries we offer to the user (unless simulated via an API). Web interface code and an API can be written that presents to the user the feel of SQL but in the background the actual filtering of the data is done completely differently (and may involve more work by the server than a simple SQL join). For example, “return to the user all objects that have ANY detection within n-days of a GW event”. In the background the code performs a two stage lookup - it grabs all the objects coincident with a GW contour, the code then collects all the detections for all the objects within that contour, and returns ONLY those objects with a detection within n-days of that event.
Filtering of the detection data by arbitrary parameters becomes more difficult. While this in principle be done via some non-relational database technologies (especially if the data is stored in a single location), the reason for choosing these technologies is generally because they allow the data to be stored in a distributed (partitioned) manner. The distributed nature of the data makes arbitrary lookups more difficult because the data is generally distributed by one or two properties. As soon as these properties are decided it makes lookups by other properties more difficult (since this will involve a “full table scan”). Scanning the data by arbitrary properties can create an unacceptable load on the cluster, since all the nodes need to be involved in retrieving the data. The data is unlikely to be indexed on that arbitrary parameter and queries by this will result in timeouts and unnecessary network gossip as the nodes attempt to retrieve the data.

No-SQL Technologies

There are numerous “No-SQL” technologies. In the past we briefly explored Apple’s FoundationDB, but not in enough detail to decide wether or not this would be a useful technology. The following is a table containing a non-exhaustive list of possible technologies to explore. (NOTE - should add links to each tech.)

Key-Value Stores	Dynamo, Voldemort, Citrusleaf, Membase, Riak, Tokyo Cabinet, FoundationDB
“Bigtable” Clones	Google Bigtable, Cassandra, HBASE, Hypertable
Document Databases	CouchOne, MongoDB, Terrastore, OrientDB
Graph Databases	FlockDB, AllegroGraph, DEX, InfoGrid, Neo4J, Sones, DataStax Enterprise (built on top of Cassandra)

I could probably have explored more of the options above in less detail, but in the end decided to explore in anger just one technology. Perhaps some of the others were unfairly excluded. Some of my ATLAS colleagues briefly explored HBASE, but with no conclusions (though I recently discovered that this technology is being used by the Fink broker). Likewise FoundationDB. Some of my colleagues in the Zooniverse had explored MongoDB, but are now actually moving away from it, implying that it does not meet their immediate needs. I was concerned (perhaps unnecessarily) that Google Bigtable would tie us into the Google ecosystem.

Cassandra

I can’t remember how Cassandra got recommended. Major corporations are using it (e.g. Facebook, Netflix). I have since discovered that other teams within LSST are using Cassandra (e.g. Antares and ALeRCE brokers) with others in core LSST conducting serious evaluations.

Open Source

Cassandra is open source and free (though once can sign up for support, e.g. to DataStax Enterprise). This means that one is not tied into an ecosystem (e.g. Amazon, Google, Microsoft, Oracle, etc).

Built in Replication

One of the things we have really struggled with in relational databases is data replication and keeping those replicas up to date. Closely related is resilience. What happens when the database becomes unavailable? Cassandra comes with replication built in from the start.

No Unwieldy Control Nodes

All nodes are equal, kind of. (In fact there are some “Seed nodes” whose maintenance requires slightly different approach.) There is no unwieldy Control Node or an unwieldy process to turn something into a Control Node. This means that adding and removing nodes is straightforward. The “gossip” (yes, the protocol is called “gossip”) that happens between nodes determines how they discover each other. Little user involvement is required.

Cassandra Query Language

Cassandra comes with its own query language, CQL, that looks very like SQL. This turned out to be a mixed blessing. I liked how easily I could adapt code originally written to insert large amounts of data into a MySQL database into code that inserted lots of data into Cassandra. It also allowed me to introduce a spatial indexing layer (in the form of a WHERE clause) in a similar manner as previously written for MySQL (though the query language has some shortcomings that needed to be addressed before the layer would work).

CQL is NOT necessarily something we would expose to the users. We would almost certainly write an intermediate layer to interpret an SQL-like query that the users might have built via the external interface or API.

Unlike some No-SQL technologies (e.g. MongoDB), there is no built-in spatial indexing technology for Cassandra. This must be grafted on. Fortunately we have experience with MySQL of doing this and adapting it to Cassandra was made fairly easy by CQL. (See later.)

In my opinion, CQL is great for developers, but it does give the impression that the data is more queryable than it actually is, hence I would not be keen to simply expose the CQL interface to the user community.

The query language also has some foibles that must be clearly understood. (Another reason for not exposing this to the general user community.) For example, table joins are not allowed, even within Cassandra itself. Aggregation queries are not generally allowed (max, min, group by, etc). Even the count(*) query, which IS allowed, is NOT recommended (since it requires a full table scan, and will generally time out when initiated on a large table). Range queries are difficult or not allowed (especially BETWEEN) and OR statements are generally not allowed (though there are ways around this).

Overview of Terminology

The following table is a quick overview of terminology in Cassandra compared with relational databases. Cassandra is a key-value store. How these are specified determines how easily the data is accessed.

Cassandra Data Model	Relational Data Model

Cassandra Data Model	Relational Data Model
Keyspace	Database Schema
Column Family	Table
Partition Key	Primary Key
Clustering Key (per node)	Key / Index (all database)
Column / Key Name	Column Name
Column Value	Column Value

Keyspaces

Keyspaces are the equivalent of Database Schemas in MySQL. It is within the keyspaces that the replication factor is specified. The general wisdom is that the replication factor is set to at least 3. (3 copies of the data are stored on different nodes.) This means that Cassandra nodes can be lost and the data still accessed and recovered transparently.

Tables

Tables (column families) are the equivalent of tables in relational databases. The tables are a flat structure. There are no table joins so all the data needs to be specified in the main table.

Partition Keys

The partition key is the approximately the equivalent of the database table primary key. There can be only one partition key (which can constitute more than one column). The partition key controls how the data is distributed to each node in the Cassandra cluster. Cassandra hashes the values of these keys to determine where to place the data. Ideally MOST of the information we would want to access would lie on a single node. Data requested that crosses nodes increases the chatter between the nodes and potentially slows down access.

For Lasair an obvious choice for Partition Key is the Object ID. However, one could also choose the Hierarchical Triangular Mesh spatial index level, or at least part of one.

Partition by Object ID

ZTF detections are already pre-associated with ZTF objects before they are passed to Kafka for consumption by Lasair. This offers an obvious partition key.

E.g.: The current partition key is the object ID (orange column). A hash of this determines which node the data row is curated on. The detection ID (candid) is currently used as a clustering key, but I propose this is changed to jd + candid in future, so we can filter on object ID and time in one query.

objectId	candid	ra	dec	jd	magpsf	sigmapsf	fid

objectId	candid	ra	dec	jd	magpsf	sigmapsf	fid
ZTF21aawxalc	1571160155615015000	121.1756238	-13.3685071	2459325.6601505	18.9818	0.16846	2

Partition by Spatial Indexing

We have extensive experience of using Hierarchical Triangular Mesh for spatial indexing with MySQL. HTM is just a quad tree index in which the celestial sphere is divided into increasing numbers of “spherical triangles”.

The diagram above shows how the different levels are created by subdividing each triangle into 4. They are often represented by base 4 characters, or binary (2 bits per level) or decimal. We frequently use Level 16, whose triangles are around 4 arcsec per side. The system is hierarchical, Level 16 is a superset of Level 13 and Level 13 is a superset of Level 10.

E.g.:

Level	Decimal	Binary	Base4

Level	Decimal	Binary	Base4
16	54680902005	110010111011001111000101100101110101	N02323033011211311
13	854389093	110010111011001111000101100101	N02323033011211
10	13349829	110010111011001111000101	N02323033011

We could use Level 16 and be done with it, but since Cassandra can’t do range queries, the number of triangles required in an “IN” query for even a modest search of a few arcminutes would generate many thousands of triangles. See here for more details. One solution might be to take advantage of the hierarchical nature of the triangle IDs and split into chunks of increasingly fine levels. E.g. The following table shows that we are partitioning by Level 10 triangles (orange column below).

htm10	htm13	htm16	mjd	ra	dec	m	dminst	filter

htm10	htm13	htm16	mjd	ra	dec	m	dminst	filter
S30012213103	001	222	59258.66173805	273.57833	-11.52511	18.478	0.338	c

Clustering Key(s)

A clustering key is a secondary index. The index is only available to the specific node on which the data resides. An example in Lasair would be the (M)JD of the detection.

Clustering Columns in Current Lasair Implementation

Note that it is currently set to be the detection ID, which is fairly meaningless, but I plan to rebuild the table with the JD + detection ID as the clustering key. The clustering key is indicated by the blue column (currently candid) in the Lasair detections (candidates) table above.

Clustering Columns for Spatial Indexing (ATLAS)

In the ATLAS implementation, the clustering is using finer levels intra node (level 13 and level 16), and then MJD.

I built an API into my “gkhtm” python & C++ code (pip install gkhtm) adapted from MySQL to Cassandra which took into account the oddities of the CQL language. (No ranges, no OR/BETWEEN statements.) The code accepts an RA, Dec and cone radius and delivers a WHERE clause that can be appended to an existing SELECT statement, hence allowing easy cone searching. The API chooses the HTM level based on the size of the cone search. HTM10 triangles are approximately 4 arcmin along one side, HTM13 are 0.9 arcmin along one side, and HTM16 triangles are about 4 arcsec along each side.

So the API currently uses level 16 when radius <= 15 arcsec, level 13 when radius > 15 arcsec and <= 200 arcsec, and level 10 when radius > 200 arcsec.

The following are three example queries that use the different triangle sizes based on their search radius. The vast majority of searches will be a few arcsec in size (attempting to aggregate detections into objects) but there may be circumstances were we want bigger search radii (e.g. GW high probability confidence areas). Note, however, that MJD ranges can only be included at HTM Level 16, because ALL clustering keys must be used in order.

from gkhtm._gkhtm import htmCircleRegionCassandra

whereClause = htmCircleRegionCassandra(272.40279,-9.97105,2.0)

for row in whereClause:

print(row)

where htm10 IN ('S30032030123') AND (htm13,htm16) IN (('120','201'),('120','202'),('120','203'),('120','231'),('120','232'),('120','233'))

Note in this query (a 50 arcsec radius) we get more than one where clause returned. This is a way round the problem of not being able to mix AND and OR clauses. (It’s not ideal having to hit the server more than once for a single coordinate search, but it works fine.)

whereClause = htmCircleRegionCassandra(272.40279,-9.97105,50.0)

for row in whereClause:

print(row)

where htm10 IN ('S30032030122') AND (htm13) IN (('110'),('112'),('113'),('220'),('300'))

where htm10 IN ('S30032030123') AND (htm13) IN (('010'),('011'),('013'),('100'),('101'),('102'),('103'),('110'),('111'),('112'),('113'),('120'),('121'),('122'),('123'),('130'),('131'),('132'),('133'),('320'),('321'),('322'),('323'))

whereClause = htmCircleRegionCassandra(272.40279,-9.97105,250.0)

for row in whereClause:

print(row)

where htm10 IN ('S30032030120','S30032030121','S30032030122','S30032030123','S30032030130','S30032030131','S30032030133','S30032030320','S30032030321','S30032030323')

INSERT and SELECT, not UPDATE or DELETE

Cassandra does insert and select well. It does NOT do UPDATEs, and does NOT do DELETEs. What actually happens is that Cassandra marks a row as defunct (erects a “tombstone”) if a row is updated (replaced) or deleted. Periodically a “compaction” process runs that cleans up the old data and rebuilds the tables on disk. For this reason, it is important to have headroom for the data rebuild process when allocating space to a Cassandra node.

Cassandra Deployments

Node deployment & Replication Factor

Two 5-node configurations have been deployed in Belfast and Edinburgh. Nodes are part of a “rack” and racks are part of a “data centre”. (Replication can be configured across racks and data centres for geographic resilience. Geographic resilience is NOT currently implemented in the current deployments.)

There is nothing special about 5 nodes other than I wanted to test the “replication factor” which most users recommend be set to 3. (There are a total of three copies of the data in the cluster of nodes.) 5 nodes allow us to remove nodes for replication experiments as well as add them. At the time of deployment we did NOT have the resources to deploy more nodes, but hopefully this will change in future. Our colleagues in LSST have experimented with up to 12 nodes so far. Additionally the nodes have been deployed with the barest minimum development specifications. When deploying live, I strongly recommend that we increase the resources per node to at least the minimum deployment specification. The general wisdom is to deploy many small nodes, not few large ones. This is because of the data shuffling that needs to happen when new nodes join or old ones are retired. Many small nodes also have the advantages of being able to exploit parallel queries and parallel inserts. (The speed of these increases linearly with the number of nodes.)

The diagram shows the current deployments. Blue segments represent the data partitions stored on that node. SN = Seed Node. Cassandra nodes are arranged in “racks” (generally co-located on a single site or network) and “data centres” (groups of racks that could be on different physical sites). Note that although the nodes are named “node1” to “node5”, how Cassandra itself arranges the nodes in the ring is up to Cassandra. The only thing the configuration file knows about are the two “seed” nodes (SN). This is important because if the nodes are rebuilt (and especially when their IP addresses change) Cassandra will decide which data is to be stored on the new node. If a node’s data is backed up, and the node crashes, restoring the data to a replacement node (with a different IP address) is not necessarily the right thing to do, since the new node may not necessarily be inserted into the same position in the ring. The simplest thing to do is let Cassandra reorganise its data accordingly, but this might be slow. (More experiments are ongoing to find out if a new node with a new IP address can be forced to sit in the same position. If this is the case, it makes the data reorganisation trivial.)

Deployment & Testing inside OpenStack

The OpenStack deployments consist of 5 virtual machines, each of which has 8VCPUs and 10GB RAM. This is absolutely the bare minimum spec for a development (not production) Cassandra node. See recommendations from DataStax (8CPU, 32GB bare minimum for load testing / production. Note that Cassandra will become CPU bound before it becomes I/O bound for inserts.) Each VM has a 1TB of spinning disk volume.

The system has been running since 2021 January and currently contains 20.6 million detections and 71.9 million non-detections (2021 April 26). The current data usage is about 6GB per node.

Very minimal load tests have been run in this environment, since it has been deployed “live”. Some speed test tables have been setup to run some rudimentary speed (read) tests. These indicated that Cassandra appears to be slower (factor of 2 or 3) for reads compared with the filesystem. (Get Roy’s tests.) It is true that Cassandra reads are slower than writes, but given the minimal specifications used, these results should be regarded with a healthy degree of skepticism.

At the time of this document we have not yet tested various failure scenarios - e.g. node recovery using backed up node data, node auto-recovery by Cassandra, node decommissioning, node addition. I will update the document (or reference the tests) once these cases have been executed.

Deployment & Testing at QUB

At QUB 5 nodes have been deployed across 5 separate machines in the QUB network. Each machine has at least 32 CPUs and 96GB of RAM (most of the nodes have 56 cores and 256GB RAM). 4TB of SSD is allocated on each machine to Cassandra.

For the load tests at QUB I acquired ALL ATLAS non-differenced photometry obtained over 4 years of two 314 square degree areas of the sky. (One in the galactic plane close to the galactic centre, and one outside it.)

Insert Testing

The data was split into 5 parts and loaded using 32 cores on each of the 5 nodes in parallel. I wrote some python 3 code (cassandraIngest) to insert the data (pip install gkhtm; pip install gkutils; pip install gkdbutils - should work on Mac, CentOS7 and Ubuntu18-20) In 24 hours, we can load 1 billion rows per node, and this scaled linearly, hence we were able to load 5 billion rows per day, using fairly inefficient python code. The whole 28 billion row dataset was loaded in 5.6 days.

The current data usage is about 1.2TB (SSD) per node.

Read Testing

Since the data is spatially indexed, we selected 0.5 million Gaia objects (19 < G < 12) in the two ATLAS fields centred on a supernova and a variable star. We used random subsets of these objects to select out 100, 1000, 10,000, and 100,000 lightcurves using just the Gaia sky positions. (Proper motion was ignored.) See speed test below.

I have written some python 3 code (coneSearchCassandra) to pull put random lightcurves from Cassandra with Spatial Indexing. (pip install gkhtm; pip install gkutils - should work on Mac, CentOS7 and Ubuntu18-20). The code uses an API I have written to convert sky positions into valid WHERE clauses. The lightcurves below show the supernova and variable star as extracted with this code. The only thing required was the coordinates of the object.

The following command to pull out 100,000 completely random lightcurves (from 0.5 million Gaia stars) and store them as CSV files in /tmp/atlas_lightcurves/ was run on a single node, and then simultaneously on all 5 nodes.

time coneSearchCassandra ~/config_cassandra_atlas.yaml \

~atls/galactic_centre_all_gaia_objects_2degrees_ra_dec_mag_12_19.txt \

--coordsfromfile --table=atlas_detections --nprocesses=32 --number=100000 \

--saveresults --resultslocation=/tmp/atlas_lightcurves \

--querycolumns=mjd,m,dminst,filter,ra,dec

A single fast node returned its results in 213 seconds. When the data was queried in parallel, the faster nodes (4 of 5) returned their results in 284 seconds. The slowest node (much lower spec, fewer CPUs, lower memory) returned its results in 315 seconds. Using the lowest speed node, this means we have retrieved 0.5 million lightcurves in 315 seconds, or 0.6ms per lightcurve.

Addition of further nodes would increase the speed slightly less than linearly (assuming all nodes specified identically.)

Failure Testing

I have tested the situation where node data has been backed up and restored to a new place on the same machine, with the same IP address. Note that Cassandra will NOT recognise the recovered node for a while and it could take hours for them to start chatting again. E.g. After the data was backed up I stopped Cassandra ungracefully on the machine to be upgraded. When the data was restored, the recovered node initially refused to acknowledge the existence of the other servers and the other servers ignored it. After about 6 hours elapsed, conversations (“gossip”) resumed and the restored node was reintegrated into the cluster. (This is not clearly explained in any documentation!)

Auto recovery of data, node commissioning and decommissioning has not yet been explored. I will update this document when they are complete. The issue with adding new nodes is that data may end up being redistributed by Cassandra. How quickly this happens is an urgent test which will be executed ASAP.

Problems with Current Deployments

Loss of Data & Design Antipattern

In mid-December 2020, the OpenStack cluster crashed. Most nodes were lost and had to be redeployed. The MySQL nodes were not lost, and even if they were, there was a daily dump of the data to fall back on. No such dump was done for Cassandra at the time, since we assumed that replication would allow the data to recover in the case of individual lost nodes. However, we lost ALL the nodes simultaneously, and all the filesystem data was lost too. Hence no data was present to be replicated! Additionally we have since learned that all the data stored on each node is actually stored on a central CephFS filesystem, so loss of it is a single point of failure. Cassandra nodes are designed to be deployed on servers with their own local storage. Hence, as well as the “gossip” traffic between nodes, there is also filesystem traffic to a shared node. Storage on a central NAS/SAN is a design antipattern!

Loss of data can be mitigated very easily in the current deployment by initiating a daily CSV dump in the same manner as the database SQL dump, but this will become unreasonably large (and take unreasonably long) when the storage reaches billions of rows.

An additional mitigation would be to backup the data area on each node to a remote machine. I have demonstrated that this is very easily done for ATLAS (where data volume was 1.2TB per node). But this has only been tested when the IP address of the new machine is the same as the old one. This will NOT be the case if we have to rebuild nodes from scratch in OpenStack. Any replacement node will get a new IP address. This matters, because of the way Cassandra self-organises the nodes into a ring. Recovery of old data onto a new node without properly decommissioning the old node and mapping the new IP address as its replacement will result in confusion as to how Cassandra allocates the data to each node. (The recovered data may end up being useless.)

Current Clustering Key is Sub Optimal

We should consider clustering by Object ID and JD, not by Object ID and Candid.

Testing on a Live Platform

Testing all failure scenarios (especially nodes) is not currently possible in the current Lasair platform because the data store is currently deployed and being used.

Summary

What has been deployed?

We have deployed two 5-node clusters of Cassandra. One is deployed in Edinburgh, contains Lasair data on spinning disk and is partitioned by Object ID. The other is deployed in Belfast, contains ATLAS data on SSD, and is partitioned by HTM ID (i.e. spatially indexed) and intra-node by MJD.

Advantages over filesystem based lightcurves?

Even in the current setup where we organise the data by objectID, the fact that we can also internally sort by (M)JD makes objectId + time range queries possible. (E.g. only return lightcurve data for a specified subset of objects that contain data within a specified time range.) This can’t be easily done via filesystem based setup. (The whole lightcurve must be returned and subsequently filtered.)

Other Possible Advantages

Storing the data in Cassandra opens up the possibility of integrating external search tools like ElasticSearch or Graph searching. These have not yet been explored in detail by Lasair, though other teams (ANTARES) are using a fusion of Cassandra and ElasticSearch.

We might also consider use of Cassandra to store image stamp data. Experiments have been done with “blob” storage, but nothing formally implemented yet.

Other LSST teams are using Cassandra, so there is a growing community of experts within the consortium.

Advantages of Spatial Indexing in Cassandra

Using a physical property, specifically the detection’s coordinates opens up an alternative way of retrieving the lightcurve from the detections store. The arbitrary object ID is not always accurate, and the spatial arrangement facilitates the search for moving objects in the Cassandra repository, especially if we index by both space and time.

Disadvantages

Retrieval of lightcurves from Cassandra appears to (up to 3 times?) slower than retrieval from filesystem based organisation as deployed within OpenStack. It could be argued that the Cassandra cluster is not correctly sized yet, and that the cluster was queried only via a single node (and not truly in parallel).

Cassandra required time and effort to learn, requires significant human effort to maintain (at least initially), and requires significant OpenStack resources (possibly also significant amounts of SSD), especially when compared with a Ceph filesystem which we get for free and treat largely as a black box.

As mentioned above the CQL interface does not allow much flexibility for data searching, though this might be overcome by grafting a search engine like ElasticSearch onto Cassandra (see advantages above).

This is not the devil we know. We have a lot of experience in relational databases and storing data on filesystems. We know relatively little about Cassandra.

Unknowns

We have not yet explored all the failure & recovery scenarios. How to replace nodes (same IP address / new IP address)? How do we restore single nodes? How do we add new ones?) We need to do this as soon as possible.

What is the optimal size of our cluster? We DO know that many small nodes are better than few larger ones. Not only does the data take less time to reorganise, but we gain the advantage of increased parallel processing. But we don’t know what the optimal size is. 5, 20, 100 nodes?

Data Partitioning by Object ID or by HTM ID?

Do both! Cassandra encourages duplication of data writes so that queries can be done in more than one way. One of the tables could end up being a “thin” table, where only basic information is stored (e.g. HTM IDs, time, ra, dec, mag, dmag, object ID, detection ID). The other table could contain the complete information if required (e.g. get other information by object ID).

Recent Information

We learned a few lessons from the broker workshop. Some brokers are using the thin table approach (duplicate a subset of columns, partition by different keys) and some the graph searching approach. (I know nothing about graph databases yet!)

Cassandra and Elasticsearch (analytic engine) is used by ANTARES. There appears to be a built in 10,000 row limit to the way ElasticSearch returns results. We don’t know if this is a major limitation yet.

Broker teams using Cassandra have reported that one needs to allocate at least DOUBLE the space on disk than just occupied by the data. This is because of periodic maintenance tasks (especially “compaction”, where deleted rows are removed.)

Recommendations

Formally adopt Cassandra as the store of detections indexed by Object ID and (M)JD.

Start storing a new table of named solar system objects.

Start storing detections (or a thin subset) in spatially indexed form, even though this is duplicate data.

Immediately evaluate failure and recovery scenarios. (Although this is urgently required, we already know how to backup and re-ingest the full dataset if necessary.)

Evaluate Cassandra as a store of image stamps. Note that this will have a serious affect on the size of the stored data, and may pose size problems if we want to store Cassandra tables on SSD, though this might be mitigated by judicious use of symlinks in the data directory (images to spinning disk, detections to SSD).

Evaluate “thin tables” as a means of increasing query flexibility. And evaluate query / graph engines that might integrate well with Cassandra.

Find an object with a large proper motion or a moving object inside the current two areas of the ATLAS data store and retrieve it by coordinates and time window. (I’ve done a simple query by both parameters using the API to generate the HTM WHERE clause, but is should be shown to work on a real moving object.)

Cassandra Report (Public - Cycle 3)

Introduction

Relational Database Storage

Storing Detections Outside the Relational Database

No-SQL Technologies

Cassandra

Open Source

Built in Replication

No Unwieldy Control Nodes

Cassandra Query Language

Overview of Terminology

Keyspaces

Tables

Partition Keys

Partition by Object ID

Partition by Spatial Indexing

Clustering Key(s)

Clustering Columns in Current Lasair Implementation

Clustering Columns for Spatial Indexing (ATLAS)

INSERT and SELECT, not UPDATE or DELETE

Cassandra Deployments

Node deployment & Replication Factor

Deployment & Testing inside OpenStack

Deployment & Testing at QUB

Insert Testing

Read Testing

Failure Testing

Problems with Current Deployments

Loss of Data & Design Antipattern

Current Clustering Key is Sub Optimal

Testing on a Live Platform

Summary

What has been deployed?

Advantages over filesystem based lightcurves?

Other Possible Advantages

Advantages of Spatial Indexing in Cassandra

Disadvantages

Unknowns

Data Partitioning by Object ID or by HTM ID?

Recent Information

Recommendations