Timing Lasair ingestion with SSD
Here are four timing for the existing Lasair-ZTF ingestion system.
Events are received from the Kafka mirrormaker.
Code under test is ingestStreamThreaded running with 8 threads
Process consists of three processes
reading from Kafka
computing, and
inserting into database.
Timings were done with the event topic of 23 Jan 2020
Code ran on lsstukdata1 aka lasair-dev.roe.ac.uk
Only receiving from Kafka
received 232205 events in 720 seconds = 3 ms per event
Ingestion and computing only
No database inserts
Handled 232205 events in 1303 seconds = 56 ms per event
Ingestion with Spinning Disk
all three ingestion processes
ingested 152205 in 11124 seconds = 73 ms per event
Ingestion with solid state disk
all three ingestion processes
ingested 152205 in 9771 seconds = 64 ms per event
Pure Database Ingestion Tests - HDD vs SSD
Before I ran the database tests, I also installed a tool called bonnie++ to test the latency of raw I/O for both HDD and SSD. The results are tricky to read, but indicate that for block sequential access, the speeds are similar, but for random access, SSD beats HDD by a large margin. I ran the tests on lasair-dev-db and also on 2 machines in Belfast (psdb2 and psdb3). Here are the results:
| random seeks (ms) | sequential output (ms) | sequential input (ms) |
---|---|---|---|
lasair-dev-db HDD | 52 | 1225 | 1215 |
psdb3 HDD | 117 | 326 | 515 |
lasair-dev-db NVME-SSD | 3 | 40 | 38 |
psdb2 PCI-SSD | 7 | 126 | 11 |
psdb3 SATA-SSD | 4 | 227 | 3 |
Headline is that for random seeks (the vast majority of the queries we’ll get) the SSD is 17 times faster, sequential input & output is 30 times faster. Of course, this massive raw I/O speed increase will be slowed down by the interaction with the database process, so below are some raw database tests.
Database Speed Test (part 1 - Ingestion)
The following test completely bypasses the Kafka/Avro packet parsing, and just reads 55,331,759 records from a CSV file and forces them into the database. The code (python 2 at the moment) is currently in my PS1 git repo, which isn’t public yet, but will be soon. The code uses the maximum number of cores available. In this case we have allocated 14 cores to lasair-dev-db, so we use all 14.
Disk type | Elapsed ingest time (minutes) | Ingestion speed (rows/sec) |
---|---|---|
HDD | 168 | 5,489 |
SSD | 165 | 5,589 |
Surprisingly, there was very little difference between HDD and SSD. It indicates that the main bottlenecks are the database processes. But note that this is for SEQUENTIAL WRITING. Random reading on the other hand shows a much greater difference.
Database Speed Test (part 2 - Random Cone Searching)
For the random cone searching test, it is CRITICAL that we randomly shuffle the input data before running the cone searches. The RESET CACHE command DOES NOT WORK! I did 10,000, 100,000 and 1 million cone searches of real objects from the 55 million gaia source table loaded above. This ensures that we are testing not only the lookup, but of the database to return the results back to us. In each case I search for the nearest object, so for a cone search of 10000 random objects, I should get 10,000 objects back from the database. The results are collated single-threaded, so they may slightly skew the overall turnaround time.
The command run is the following code - first shuffle the data then do the cone searches - example is for the million cone searches run. The databases are called “speedtest” (SSD) and “old_speedtest” (HDD).
shuf gaia_dr2_objects_random.tst > gaia_dr2_objects_random_1000000.tst
(panstarrs) ken::lasair-dev-db { ~/gitrelease/ps1/code/utils/python }-> time python \
manyConeSearchesTest.py ../../../config/config_old_speedtest.yaml \
~/gaia_dr2_objects_random_1000000.tst --table=tcs_cat_gaia_dr2 \
--searchradius=0.1 --nprocesses=14
In the following table, the searches were run 5 times (except for the million row search). The figure reported is the mean of the last 3 runs.
Number of Cone Searches | HDD (s) | SSD (s) |
---|---|---|
10,000 | 44 | 7 |
100,000 | 365 *first run only | 68 |
1,000,000 | 1682 *first run only | 676 *first run only |
The results is that the SSD is about 6 times faster than the HDD to return the results of random cone searches. The million cone search result may have been skewed by the result collation or memory consumption (to be checked, and repeated), but even it is nearly triple the speed.
If you require this document in an alternative format, please contact the LSST:UK Project Managers lusc_pm@mlist.is.ed.ac.uk or phone +44 131 651 3577