Timing Lasair ingestion with SSD

Here are four timing for the existing Lasair-ZTF ingestion system.

Events are received from the Kafka mirrormaker.
Code under test is ingestStreamThreaded running with 8 threads
- Process consists of three processes
  - reading from Kafka
  - computing, and
  - inserting into database.
Timings were done with the event topic of 23 Jan 2020
Code ran on lsstukdata1 aka lasair-dev.roe.ac.uk
- Specifications

Only receiving from Kafka

received 232205 events in 720 seconds = 3 ms per event

Ingestion and computing only

No database inserts
Handled 232205 events in 1303 seconds = 56 ms per event

Ingestion with Spinning Disk

all three ingestion processes
ingested 152205 in 11124 seconds = 73 ms per event

Ingestion with solid state disk

all three ingestion processes
ingested 152205 in 9771 seconds = 64 ms per event

Pure Database Ingestion Tests - HDD vs SSD

Before I ran the database tests, I also installed a tool called bonnie++ to test the latency of raw I/O for both HDD and SSD. The results are tricky to read, but indicate that for block sequential access, the speeds are similar, but for random access, SSD beats HDD by a large margin. I ran the tests on lasair-dev-db and also on 2 machines in Belfast (psdb2 and psdb3). Here are the results:

	random seeks (ms)	sequential output (ms)	sequential input (ms)

	random seeks (ms)	sequential output (ms)	sequential input (ms)
lasair-dev-db HDD	52	1225	1215
psdb3 HDD	117	326	515
lasair-dev-db NVME-SSD	3	40	38
psdb2 PCI-SSD	7	126	11
psdb3 SATA-SSD	4	227	3

Headline is that for random seeks (the vast majority of the queries we’ll get) the SSD is 17 times faster, sequential input & output is 30 times faster. Of course, this massive raw I/O speed increase will be slowed down by the interaction with the database process, so below are some raw database tests.

Database Speed Test (part 1 - Ingestion)

The following test completely bypasses the Kafka/Avro packet parsing, and just reads 55,331,759 records from a CSV file and forces them into the database. The code (python 2 at the moment) is currently in my PS1 git repo, which isn’t public yet, but will be soon. The code uses the maximum number of cores available. In this case we have allocated 14 cores to lasair-dev-db, so we use all 14.

Disk type	Elapsed ingest time (minutes)	Ingestion speed (rows/sec)

Disk type	Elapsed ingest time (minutes)	Ingestion speed (rows/sec)
HDD	168	5,489
SSD	165	5,589

Surprisingly, there was very little difference between HDD and SSD. It indicates that the main bottlenecks are the database processes. But note that this is for SEQUENTIAL WRITING. Random reading on the other hand shows a much greater difference.

Database Speed Test (part 2 - Random Cone Searching)

For the random cone searching test, it is CRITICAL that we randomly shuffle the input data before running the cone searches. The RESET CACHE command DOES NOT WORK! I did 10,000, 100,000 and 1 million cone searches of real objects from the 55 million gaia source table loaded above. This ensures that we are testing not only the lookup, but of the database to return the results back to us. In each case I search for the nearest object, so for a cone search of 10000 random objects, I should get 10,000 objects back from the database. The results are collated single-threaded, so they may slightly skew the overall turnaround time.

The command run is the following code - first shuffle the data then do the cone searches - example is for the million cone searches run. The databases are called “speedtest” (SSD) and “old_speedtest” (HDD).

shuf gaia_dr2_objects_random.tst > gaia_dr2_objects_random_1000000.tst
(panstarrs) ken::lasair-dev-db { ~/gitrelease/ps1/code/utils/python }-> time python \
      manyConeSearchesTest.py ../../../config/config_old_speedtest.yaml \
      ~/gaia_dr2_objects_random_1000000.tst --table=tcs_cat_gaia_dr2 \
      --searchradius=0.1 --nprocesses=14

In the following table, the searches were run 5 times (except for the million row search). The figure reported is the mean of the last 3 runs.

Number of Cone Searches	HDD (s)	SSD (s)

Number of Cone Searches	HDD (s)	SSD (s)
10,000	44	7
100,000	365 *first run only	68
1,000,000	1682 *first run only	676 *first run only

The results is that the SSD is about 6 times faster than the HDD to return the results of random cone searches. The million cone search result may have been skewed by the result collation or memory consumption (to be checked, and repeated), but even it is nearly triple the speed.

Andy Lawrence