Page Comparison

Cone Searching & HTM

Many of the queries we want to do are related to the position of objects in the sky. We use HTM - Hierarchical Triangular Mesh - to index the sky and utilise the API to augment our SQL queries. In this document I will attempt to use HTM with Cassandra. This may of course be a bad match of technologies (see issues below), and there may be other spatial indexing technologies that tie in better with Cassandra.

...

Partition Keys and Clustering Keys

Cassandra will now NOT allow you to query any old column with any constraint. Unlike databases, you have to build the column families (tables) with the query in mind, otherwise you can’t easily pull out the information you need. Let’s take a look at a row of the Gaia DR2 table. The unique identifier column is “source_id”. But this doesn’t mean much in terms of organising the data.

...

	Decimal	Binary	Base4
HTM16	54680902005	110010111011001111000101100101110101	N02323033011211311
HTM13	854389093	110010111011001111000101100101	N02323033011211
HTM10	13349829	110010111011001111000101	N02323033011

...

Remember our SSD vs HDD tests? I loaded up the 55 million rows of Gaia DR2 data into Cassandra. It took exactly 25 hours! (About 600 records per second.) Note though that inserts were being done single threaded, and not in “batch” mode.

I also pulled out 1,000,000 random rows based on the HTM16 id triangles from the RA and Dec of the million records. (1 hour and 50 minutes !)- or 6,600 seconds.) This is of course about 3 time slower than the HDD timings for one million rows. BUT we are not yet using Cassandra to its full capacity, and this was done single-threaded.

I’m certain these numbers will massively improve if we build a properly distributed Cassandra system. (To be done.) I’ve seen various statements online indicating that (e.g.) a 15 node cluster can cope with up to 120,000 inserts per second.

Cassandra and Lightcurves

Alternatives to spatial indexing could be to store the lightcurve in Cassandra and index by objectID and candidateID. This is actually easier to implement in Cassandra, but the caveats of how the data is queried still apply.

ObjectID	CandidateID	ra	decl	magpsf	fid
ZTF20aauwhfa	1200382622615015029	218.1603106	31.6709366	18.3553	1

Conclusions & Further Work

...

This should be deployed as a group of machines into a properly distributed infrastructure (e.g. a group of openstack machines) to test both loading speed AND query speed.

Other Technologies to Consider

We’ve only just been made aware of a Qserv-like architecture for PostgreSQL - called CitusData. Gaia use PostgreSQL extensively along with Q3C spatial indexing, so it’s definitely worth considering.

Versions Compared

Old Version 9

New Version Current

Key

Cone Searching & HTM

Partition Keys and Clustering Keys

Cassandra and Lightcurves

Conclusions & Further Work

Other Technologies to Consider