Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the Lasair-ZTF, users can type in SQL to query the database tables that represent past alerts. This interface encourages innovation through flexibility and wide scope. But these queries are being built by non-experts, and can be inefficient, or run for a very long time, or even fill up the temporary disk space. Once a user has built a static query, they can make it a streaming query, running many times as alerts are ingested, so that exciting science is immediately available to that user. It has been a challenge in Lasair-ZTF to unite the different semantics of static and streaming queries. We are investigating different dialects of SQL: MySQL, KSQL (Kafka), CQL (Cassandra), and ElasticSearch. It would be a challenge to have different dialects for static and for streaming queries. It may be necessary to control more closely the syntax of the user-generated queries, so that (a) selected attributes can only be from a pre-selected set, for example "mean_mag_g", "jd - 2400000.5 AS mjd", and (b) the constraint clauses can only come from a limited subset, for example "mean_mag_g < 19", "classification IN (SN, NT)". This approach follows the way Vizier tables are queried.

...

The LSST data will be parsed into relational and binary categories. The former is a set of database tables with relational integrity, and the latter a filesystem or blob store of some kind. Tradtional relational systems provide a wealth of access methods, optimisation mechanisms, and familiarity, while newer systmes score better on scalability and speed than over convenience and utility. It is crucial to size the database correctly to decide which path to take. The database will have a set of "diaObject" records that corresponds to multiple observations of a single star or galaxy; along with this "summary" is a set of "diaSource" records, one for each brightness measurement in the light curve. If these diaSource records are all represented in the relational database, it becomes much larger. An alternative is to define each light curve by a few "features" that are more useful than the individual diaSource records; the feature vector will also be easier and more efficient to query (no JOIN), and the database can be smaller and faster.

The first section (A) on ZTF is all truth, from 21 months of running Lasair-ZTF. For each different kind of database table, the number of rows in each table, the number of attributes in each record, and their accumulation as gigabytes per year. Notice that the noncandidates use as much space as the candidates (==detections), even though the noncandidate schema is so small.

The second section (B) uses the LSST data products definition (https://lse-163.lsst.io/) for the numbers of attributes in each record, and then for the millions/yr, just multiplies ZTF by 50. Given the attributes per record, that becomes a prediction of storage requirements.

A relational database with all the DIASources and DIAForcedSources would grow at
9.5 Tbyte per year (22 billion rows per year).
If, however, the light curves are in a blob store, the relational database would contain just the last row, objects with at least 3 detections, and it would grow at
0.5 Tbyte per year (0.1 billion rows per year).

Code Block
                   millions/yr  attributes gbytes/yr

(A) ZTF
ZTF candidates 49 113 66
ZTF noncandidates 394 4 60
ZTF objects 10 37 4
ZTF objects ncand>=3 2 37 1

(B) mult by 50 to get LSST numbers
LSST DIASources 2450 111 3300
LSST DIAForcedSources 19700 8 6000
LSST DIAObjects 500 396 2000
LSST DIAObjects ncand>=3 100 396 500

The attached data estimates databases sizes by extrapolating from ZTF.

Blob store

Images and perhaps light curves will be stored in a "blob store" -- each binary large object (blob) is identified by a identifier. This has been a file system in Lasair-ZTF. We need to be able to push in blobs fast when the ingest happens, and for a researcher, push out a list of blobs for a list of identifiers. Should be manageble and scalable.--

Hardware

Instead of the traditional spinning disk, a more expensive option is the solid-state disk (SSA). Sometimes SSA is much faster, sometimes not. We will present evidence on where the SSA is worth its extra cost. Of course, if we are using cloud services, whether academic or commercial, it may not be possible to ask for VMs that have SSA resources attached, especially if we want a large number of VMs in a parallel cluster.

...