Cone Searching & HTM
Many of the queries we want to do are related to the position of objects in the sky. We use HTM - Hierarchical Triangular Mesh - to index the sky and utilise the API to augment our SQL queries. In this document I will attempt to use HTM with Cassandra. This may of course be a bad match of technologies (see issues below), and there may be other spatial indexing technologies that tie in better with Cassandra.
...
Decimal | Binary | Base4 | |
---|---|---|---|
HTM16 | 54680902005 | 110010111011001111000101100101110101 | N02323033011211311 |
HTM13 | 854389093 | 110010111011001111000101100101 | N02323033011211 |
HTM10 | 13349829 | 110010111011001111000101 | N02323033011 |
It turns out that in Cassandra, you can have multiple clustering columns provided the order is preserved when doing a query. In the cassandra query language (CQL) it is possible to have more than one column in an IN query.
In Cassandra, we can reorganise the above table, using elements of the (e.g.) base 4 HTM.
E.g.
HTM10 | HTM13 | HTM16 | source_id | ra | dec | phot_g_mean_mag |
---|---|---|---|---|---|---|
N02323033011 | 211 | 311 | 4309278608071975424 | 288.70392 | 9.99498 | 17.5 |
Now, we can query using the HTM10 field on its own (the partition key), the HTM10 and HTM13 fields (= HTM level 13) and the HTM10 and HTM13 and HTM16 fields (= HTM level 16). (And the full set of fields + the source_id.)
E.g.
HTM10 query
Code Block |
---|
select * from tcs_cat_gaia_dr2 where htm10 IN ('S23023222101','S23023222132'); |
HTM13 query
Code Block |
---|
select * from tcs_cat_gaia_dr2 where htm10 IN ('S23023222101','S23023222132')
AND (htm13) IN
('000','001','003','000','002','003'); |
HTM16 query
Code Block |
---|
select * from tcs_cat_gaia_dr2 where htm10 IN ('S23023222101','S23023222132')
AND (htm13,htm16) IN
(('000','100'),('001','200'),
('003','200'),('000','200'),
('002','100'),('003','100')); |
Cassandra Loading and Querying Data
...
I’m certain these numbers will massively improve if we build a properly distributed Cassandra system. To be done.
Conclusions & Further Work
Cassandra CAN be used for data loading and cone searching using the mechanisms above. Additionally the HTM interface also allows for other types of spatial searches - not just cones. All we need are the triangle IDs.
Primary keys need to be chosen very carefully!
Extraction of data based on ranges of other columns require that the data be copied into separate tables and primary keys chosen accordingly.
This should be deployed as a group of machines into a properly distributed infrastructure (e.g. a group of openstack machines) to test both loading speed AND query speed.