Relational MySQL vs Cassandra NoSQL (and spatial indexing)

Cone Searching & HTM

Many of the queries we want to do are related to the position of objects in the sky. We use HTM - Hierarchical Triangular Mesh - to index the sky and utilise the API to augment our SQL queries. In this document I will attempt to use HTM with Cassandra. This may of course be a bad match of technologies (see issues below), and there may be other spatial indexing technologies that tie in better with Cassandra.

For our Example Data we will use the same 55 million row subset of Gaia DR2 that I used for the SSD / HDD Speed Tests. For the time being we will use HDD as our data store, because this is more a feasibility study than a speed test. (We can easily switch the back end to SSD if necessary.)

I’ve built a small C++ program that uses the HTM API to produce an example query for a known object in the Gaia DR2 database table. The cone search radius is set at 5 arcsec. We are using HTM Level 16, which is the finest grain HTM level we use in our databases at the moment (~ few arcsec sized triangles). Note the OR statements in the query below.

./HTMCircle 16 85.13154994520691 37.936764657367384 5 tcs_cat_gaia_dr2
select * from tcs_cat_gaia_dr2 where htm16ID between 
64759070916 and 64759070917
or htm16ID between 64759070919 and 64759070919
or htm16ID between 64759070936 and 64759070936
or htm16ID between 64759070938 and 64759070939
or htm16ID between 64759070960 and 64759070975
;

If we increase the radius to 50 arcsec, the API produces the following output. This is relevant to the Cassandra implementation below!

./HTMCircle 16 85.13154994520691 37.936764657367384 50 tcs_cat_gaia_dr2
select * from tcs_cat_gaia_dr2 where htm16ID between 
64758611968 and 64758612223
or htm16ID between 64758612352 and 64758612352
or htm16ID between 64758612864 and 64758612879
or htm16ID between 64758612896 and 64758612911
or htm16ID between 64758612916 and 64758612919
or htm16ID between 64759070720 and 64759071743
or htm16ID between 64759922688 and 64759922943
or htm16ID between 64759923264 and 64759923264
or htm16ID between 64759923520 and 64759923523
or htm16ID between 64759923526 and 64759923526
or htm16ID between 64759923534 and 64759923534
;

SQL Databases

We currently use MySQL as our relational database.

MariaDB, a complete drop-in replacement for MySQL is one of the main open source forks of MySQL. (We’ll come back to MariaDB later.) No spatial indexing system is inbuilt, so we use the HTM API to augment our queries.

Another open source alternative is PostgreSQL (used by the Gaia team, along with a spatial indexing implementation called Q3C - Quad Tree Cube - written by Sergey Koposov). We should consult experts in PostgreSQL to get opinions about scalability and replication.

NoSQL Databases

Meanwhile in the NoSQL world there are numerous alternatives - keyvalues stores (Dynamo, Voldemort, Citrusleaf, Membase, Riak, Tokyo Cabinet, FoundationDB), “bigtable” clones (Google Bigtable, Cassandra, HBASE, Hypertable) document databases (CouchOne, MongoDB, Terrastore, OrientDB) and graph databases (FlockDB, AllegroGraph, DEX, InfoGrid, Neo4J, Sones).

I don’t know anything about any of the NoSQL systems, except that I’ve started to explore Cassandra, our colleagues in ATLAS started to explore FoundationDB (but the experiments didn’t really go anywhere) and our Zooniverse colleagues have experience in MongoDB.

Cassandra keeps coming up in various meetings, and I like the SQL-like query language that can be used to query the data (with many caveats - e.g. no OR statements allowed).

Cassandra

Cassandra insists on a primary key value that can be made up of one or more columns of the data. In our case, we can make up the primary key from the HTM ID (e.g. HTM16ID) and the objectID, hence we can use the HTM ID to search for an object. In Cassandra Query Language (CQL), range queries can only be done on a single range, hence we can’t use multiple BETWEEN ( >= and <=) queries connected with OR statements. OR statements are not allowed. But we CAN use IN statements, so we could in principle expand out the above SQL statements to look like the following:

./HTMCircleAllIDs 16 85.13154994520691 37.936764657367384 5 tcs_cat_gaia_dr2
select * from tcs_cat_gaia_dr2 where htm16ID IN (
64759070916,64759070917,64759070919,64759070936,64759070938,
64759070939,64759070960,64759070961,64759070962,64759070963,
64759070964,64759070965,64759070966,64759070967,64759070968,
64759070969,64759070970,64759070971,64759070972,64759070973,
64759070974,64759070975);

That’s 22 triangles. This actually appears to work OK. But now let’s expand out to 50 arcsec. Now things start to get messy.

That’s 1580 triangles altogether. The query still works OK though. The obvious alternative is to have several different granularity levels (e.g. Levels 10, 13 and 16) and vary which HTM column to use depending on the radius. Fine in principle, but unfortunately Cassandra will only allow us to do an IN query on the first part of the index. If we choose to use different levels to represent our primary key (e.g. HTM16ID, HTM13ID, HTM10ID, objectID), the IN query MIGHT work for HTM13ID (though it runs into some “tombstone” issues if the number of triangles is large.) But it definitely WON’T work for HTM10ID. Of course, we could use a courser grained triangle level, but that might mean having to parse tens to hundreds of results to exclude the catalogue objects outside our chosen radius. E.g. the mean area of a level 10 HTM triangle is 63,734 arcsec squared (0.005 square degrees). There are 2*4**(HTM+1) triangles in the sky, so for level 10, a billion row all sky catalogue will have 119 objects per triangle on average.

One solution to this might be to store the table contents 3 times with the 3 different triangle levels, but, of course, that means the data has to be written 3 times on ingest! (Likewise all crossmatch catalogues would need to be written 3 times.) Note, however, that Cassandra positively encourages this practice.

Let’s take a closer look at Cassandra. The following information is based on this tutorial. Cassandra is a key-value store and doesn’t quite work like normal relational databases. For a start, they are NOT relational. “Tables” are groups (“families”) of columns and Cassandra encourages you to flatten your schema down into giant groups of columns. Here’s some terminology:

Cassandra Data Model	Relational Data Model

Cassandra Data Model	Relational Data Model
Keyspace	Database
Column Family	Table
Partition Key	Primary Key
Column Name/Key	Column Name
Column Value	Column Value

Note the “partition key” vs primary key terminology - we’ll come back to that. The partition key determines how the data is distributed across multiple nodes.

In Cassandra, replication is built in, and the replication factor determines how many other nodes the data is copied to. In the following example, objects with partition key hash of 10 would be stored in Node 1. If the replication factor is 3, then the data is replicated (clockwise) in to Nodes 2 and 3. Likewise, an object with a key hash of 83 would be stored in Node 4, and replicated to Nodes 1 and 2.

My installation of Cassandra is just a single Node onto my laptop. However, it seems surprisingly robust, if not exactly quick. I’m still figuring out the way to insert and read data multithreaded, so for the time being all access is single threaded. I’m sure times (below) can be improved. To do - test on a distributed system.

Partition Keys and Clustering Keys

Cassandra will NOT allow you to query any old column with any constraint. Unlike databases, you have to build the column families (tables) with the query in mind, otherwise you can’t easily pull out the information you need. Let’s take a look at a row of the Gaia DR2 table. The unique identifier column is “source_id”. But this doesn’t mean much in terms of organising the data.

HTM16	source_id	ra	dec	phot_g_mean_mag

HTM16	source_id	ra	dec	phot_g_mean_mag
54680902005	4309278608071975424	288.70392	9.99498	17.5

If the primary key is just the source_id, then we can only search by source_id. However, primary keys can be made of more than one column. If we now create the primary key as the group of HTM16 AND source_id, then we can now cone search by pulling out a list of HTM16s associated with an RA and Dec. The first part is the partition key, and the second part is a local clustering key.

Remember, we can’t use ranges of HTM16 triangle IDs, but we CAN use the “IN” statement. This is fine for a small radius, but remember what happens if we open the radius up - even to something as innocuous as 50 arcsec? Exploding numbers of HTM16s (over 1500 triangles above)!

Inside Lasair (and Sherlock, ATLAS, Pan-STARRS, etc) we can use more than one triangle level to mitigate the large lists (e.g. HTM10, HTM13 and HTM16). BUT - we only have ONE partition key in Cassandra, so we need to choose. We can’t use (for example) HTM10, HTM13, HTM16 plus the source_id. Why, because you can’t search on a “clustering key” on its own OR if it is secondary to another key.

A deeper look at HTM

We often forget the “H” (hierarchical) in HTM. HTM level 10 is a superset of HTM 13, which is a superset of HTM 16.

	Decimal	Binary	Base4

	Decimal	Binary	Base4
HTM16	54680902005	110010111011001111000101100101110101	N02323033011211311
HTM13	854389093	110010111011001111000101100101	N02323033011211
HTM10	13349829	110010111011001111000101	N02323033011

It turns out that in Cassandra, you can have multiple clustering columns provided the order is preserved when doing a query. In the cassandra query language (CQL) it is possible to have more than one column in an IN query.

In Cassandra, we can reorganise the above table, using elements of the (e.g.) base 4 HTM.

E.g.

HTM10	HTM13	HTM16	source_id	ra	dec	phot_g_mean_mag

HTM10	HTM13	HTM16	source_id	ra	dec	phot_g_mean_mag
N02323033011	211	311	4309278608071975424	288.70392	9.99498	17.5

Now, we can query using the HTM10 field on its own (the partition key), the HTM10 and HTM13 fields (= HTM level 13) and the HTM10 and HTM13 and HTM16 fields (= HTM level 16). (And the full set of fields + the source_id.)

E.g.

HTM10 query

HTM13 query

HTM16 query

Cassandra Loading and Querying Data

Remember our SSD vs HDD tests? I loaded up the 55 million rows of Gaia DR2 data into Cassandra. It took exactly 25 hours! (About 600 records per second.) Note though that inserts were being done single threaded, and not in “batch” mode.

I also pulled out 1,000,000 random rows based on the HTM16 id triangles from the RA and Dec of the million records. (1 hour and 50 minutes - or 6,600 seconds.) This is of course about 3 time slower than the HDD timings for one million rows. BUT we are not yet using Cassandra to its full capacity, and this was done single-threaded.

I’m certain these numbers will massively improve if we build a properly distributed Cassandra system. (To be done.) I’ve seen various statements online indicating that (e.g.) a 15 node cluster can cope with up to 120,000 inserts per second.

Cassandra and Lightcurves

Alternatives to spatial indexing could be to store the lightcurve in Cassandra and index by objectID and candidateID. This is actually easier to implement in Cassandra, but the caveats of how the data is queried still apply.

ObjectID	CandidateID	ra	decl	magpsf	fid

ObjectID	CandidateID	ra	decl	magpsf	fid
ZTF20aauwhfa	1200382622615015029	218.1603106	31.6709366	18.3553	1

Conclusions & Further Work

Cassandra CAN be used for data loading and cone searching using the mechanisms above. Additionally the HTM interface also allows for other types of spatial searches - not just cones. All we need are the triangle IDs.

Primary keys need to be chosen very carefully!

Extraction of data based on ranges of other columns require that the data be copied into separate tables and primary keys chosen accordingly.

This should be deployed as a group of machines into a properly distributed infrastructure (e.g. a group of openstack machines) to test both loading speed AND query speed.

Other Technologies to Consider

We’ve only just been made aware of a Qserv-like architecture for PostgreSQL - called CitusData. Gaia use PostgreSQL extensively along with Q3C spatial indexing, so it’s definitely worth considering.