Timing Lasair ingestion with SSD

Here are four timing for the existing Lasair-ZTF ingestion system.

  • Events are received from the Kafka mirrormaker.

  • Code under test is ingestStreamThreaded running with 8 threads

    • Process consists of three processes

      • reading from Kafka

      • computing, and

      • inserting into database.

  • Timings were done with the event topic of 23 Jan 2020

  • Code ran on lsstukdata1 aka lasair-dev.roe.ac.uk

Only receiving from Kafka

  • received 232205 events in 720 seconds = 3 ms per event

Ingestion and computing only

  • No database inserts

  • Handled 232205 events in 1303 seconds = 56 ms per event

Ingestion with Spinning Disk

  • all three ingestion processes

  • ingested 152205 in 11124 seconds = 73 ms per event

Ingestion with solid state disk

  • all three ingestion processes

  • ingested 152205 in 9771 seconds = 64 ms per event

 

Pure Database Ingestion Tests - HDD vs SSD

Before I ran the database tests, I also installed a tool called bonnie++ to test the latency of raw I/O for both HDD and SSD. The results are tricky to read, but indicate that for block sequential access, the speeds are similar, but for random access, SSD beats HDD by a large margin. I ran the tests on lasair-dev-db and also on 2 machines in Belfast (psdb2 and psdb3). Here are the results:

 

random seeks (ms)

sequential output (ms)

sequential input (ms)

 

random seeks (ms)

sequential output (ms)

sequential input (ms)

lasair-dev-db HDD

52

1225

1215

psdb3 HDD

117

326

515

lasair-dev-db NVME-SSD

3

40

38

psdb2 PCI-SSD

7

126

11

psdb3 SATA-SSD

4

227

3

Headline is that for random seeks (the vast majority of the queries we’ll get) the SSD is 17 times faster, sequential input & output is 30 times faster. Of course, this massive raw I/O speed increase will be slowed down by the interaction with the database process, so below are some raw database tests.

Database Speed Test (part 1 - Ingestion)

The following test completely bypasses the Kafka/Avro packet parsing, and just reads 55,331,759 records from a CSV file and forces them into the database. The code (python 2 at the moment) is currently in my PS1 git repo, which isn’t public yet, but will be soon. The code uses the maximum number of cores available. In this case we have allocated 14 cores to lasair-dev-db, so we use all 14.

Disk type

Elapsed ingest time (minutes)

Ingestion speed (rows/sec)

Disk type

Elapsed ingest time (minutes)

Ingestion speed (rows/sec)

HDD

168

5,489

SSD

165

5,589

Surprisingly, there was very little difference between HDD and SSD. It indicates that the main bottlenecks are the database processes. But note that this is for SEQUENTIAL WRITING. Random reading on the other hand shows a much greater difference.

Database Speed Test (part 2 - Random Cone Searching)

For the random cone searching test, it is CRITICAL that we randomly shuffle the input data before running the cone searches. The RESET CACHE command DOES NOT WORK! I did 10,000, 100,000 and 1 million cone searches of real objects from the 55 million gaia source table loaded above. This ensures that we are testing not only the lookup, but of the database to return the results back to us. In each case I search for the nearest object, so for a cone search of 10000 random objects, I should get 10,000 objects back from the database. The results are collated single-threaded, so they may slightly skew the overall turnaround time.

The command run is the following code - first shuffle the data then do the cone searches - example is for the million cone searches run. The databases are called “speedtest” (SSD) and “old_speedtest” (HDD).

 

shuf gaia_dr2_objects_random.tst > gaia_dr2_objects_random_1000000.tst (panstarrs) ken::lasair-dev-db { ~/gitrelease/ps1/code/utils/python }-> time python \ manyConeSearchesTest.py ../../../config/config_old_speedtest.yaml \ ~/gaia_dr2_objects_random_1000000.tst --table=tcs_cat_gaia_dr2 \ --searchradius=0.1 --nprocesses=14

In the following table, the searches were run 5 times (except for the million row search). The figure reported is the mean of the last 3 runs.

Number of Cone Searches

HDD (s)

SSD (s)

Number of Cone Searches

HDD (s)

SSD (s)

10,000

44

7

100,000

365 *first run only

68

1,000,000

1682 *first run only

676 *first run only

The results is that the SSD is about 6 times faster than the HDD to return the results of random cone searches. The million cone search result may have been skewed by the result collation or memory consumption (to be checked, and repeated), but even it is nearly triple the speed.

If you require this document in an alternative format, please contact the LSST:UK Project Managers lusc_pm@mlist.is.ed.ac.uk or phone +44 131 651 3577