Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Cone Searching & HTM

Many of the queries we want to do are related to the position of objects in the sky. We use HTM - Hierarchical Triangular Mesh - to index the sky and utilise the API to augment our SQL queries. In this document I will attempt to use HTM with Cassandra. This may of course be a bad match of technologies (see issues below), and there may be other spatial indexing technologies that tie in better with Cassandra.

...

Cassandra will now allow you to query any old column with any constraint. Unlike databases, you have to build the column families (tables) with the query in mind, otherwise you can’t easily pull out the information you need. Let’s take a look at a row of the Gaia DR2 table. The unique identifier column is “source_id”. But this doesn’t mean much in terms of organising the data.

HTM16

source_id

ra

dec

phot_g_mean_mag

54680902005

4309278608071975424

288.70392

9.99498

17.5

If the primary key is just the source_id, then we can only search by source_id. However, primary keys can be made of more than one column. If we now create the primary key as the group of HTM16 AND source_id, then we can now cone search by pulling out a list of HTM16s associated with an RA and Dec. The first part is the partition key, and the second part is a local clustering key.

Remember, we can’t use ranges of HTM16 triangle IDs, but we CAN use the “IN” statement. This is fine for a small radius, but remember what happens if we open the radius up - even to something as innocuous as 50 arcsec? Exploding numbers of HTM16s (over 1500 triangles above)!

Inside Lasair (and Sherlock, ATLAS, Pan-STARRS, etc) we can use more than one triangle level to mitigate the large lists (e.g. HTM10, HTM13 and HTM16). BUT - we only have ONE partition key in Cassandra, so we need to choose. We can’t use (for example) HTM10, HTM13, HTM16 plus the source_id. Why, because you can’t search on a “clustering key” on its own OR if it is secondary to another key.

A deeper look at HTM

We often forget the “H” (hierarchical) in HTM. HTM level 10 is a superset of HTM 13, which is a superset of HTM 16.

Decimal

Binary

Base4

HTM16

54680902005

110010111011001111000101100101110101

N02323033011211311

HTM13

854389093

110010111011001111000101100101

N02323033011211

HTM10

13349829

110010111011001111000101

N02323033011

Cassandra Loading and Querying Data

Remember our SSD vs HDD tests? I loaded up the 55 million rows of Gaia DR2 data into Cassandra. It took exactly 25 hours! (About 600 records per second.)

I also pulled out 1,000,000 random rows based on the HTM16 id triangles from the RA and Dec of the million records. (1 hour and 50 minutes!)

I’m certain these numbers will massively improve if we build a properly distributed Cassandra system. To be done.