Lasair Cycle-1 Technology Review, actions from (18/MAR/20)

Time and venue: Wednesday 18th May 2020, via Zoom

Attendees: Andy L, Roy W, Stephen S, Bob M, Meg S, Ken S, Matt N, Dave Y, Britton S,, Mark H , Stelios V, Michael F, Ally H, Dave R, Gareth F, Terry S, Cam Allen (Zooniverse), Nigel Hambley

Apologies:

Notes from discussion

DMR found some code for simulated alerts - https://github.com/lsst-sims/sims_alertsim

Roy Williams, Lasair today: what it does, how it works – demo

Jupyter Notebook platform will be integrated into LSST Science Platform (LSP).

Comments on accumulated size of data and of robustness of service.

ZTF has, to date, observed 85 million candidates and 2 million objects.

Believe LSST will be ~50x larger than ZTF.

Sources of unreliability include:

Disk filling up.
Network issue to ROE.
Lasair development undertaken on a development system.
Third system, running on IRIS OpenStack , is target for future platform.

Reliability is strongly linked to level of flexibility we give to users [RDW]

For example, freeform SQL gives users a lot of scope to create resource-sapping queries.
Mike R has significant experience of controlling users' scope for generating SQL, using--for example--maximum number of records, maximum query times.
Risk of malicious or ill-informed query generation.
Andy L noted balance of being conservative versus ambitious with technology choices.

Bob M asked if scalability experiments were needed to test scalability of watchlist function.

Roy noted no hard limit on watchlist, so scope for causing resource starvation. Further, watchlist queries run frequently.
Ken S noted option to scale test watchlists using a long list of variable stars.
Roy W believes Sherlock is better place in which to manage large, community-interest watchlists.
Bob M noted need to work out the size of watchlist that would trigger consideration of moving to Sherlock.

Sherlock: what it does, how it works (Dave Young)

Cross-match against NED-D is throttled and cached to avoid Sherlock being black-listed by NED-D.
Aim to reduce cross-match to integer-based search fields to speed up queries.
Roy noted that, by end of first year, depth of LSST survey will make most other surveys less relevant. How does this impact role of Sherlock?
- Dave Y noted that, for SNe, NED was critical for classification. LSST would help to eliminate ‘background fog'. For QUB, this will be a critical use case.
- Roy W noted that Lasair will likely serve other applications, as well as SNe.
- Stephen S noted that LSST catalogue (inc. photometric redshift) will be huge help for objects with redshift less than mag. 22, though will not help astronomers to identify those objects for which spectroscopy is interesting.
- Data releases will also be critical if choose to separate stellar sources from extragalactic sources.
  - Bob asked if there was information in the alert regarding stellar sources vs. extragalactic source.
Gareth F noted conflict of tuning for particular applications vs. different use cases. Would it worthwhile to produce confidence information for classification results?
- Dave Y noted that Sherlock produces a list of all sources that the transient has probably been associated list. The top-level result, from Sherlock, is the ‘tip of the iceberg’. Also, agrees that providing one-off confidence value would be useful.
- Visualisation of Sherlock classification would make it more immediate for users to identify source that has been crossmatched and algorithm that has been most successful.
Dave M asked if there was a case to offer power users, for Sherlock classifier, to individual use cases.
- Dave Y suspects users would want to write own algorithms, but that we should simply expose the underlying algorithms to give power users the information they need.
Andy L noted that post-ingestion data mining was a different problem from immediate real-time classification, which was focus for Lasair.
Andy L asked where Sherlock catalogues are stored and processes run.
- Sherlock instance and catalogues, for Lasair, is running at ROE.
- Instance tracks master version, which runs at QUB.
Matt N noted recent issues with Sherlock, related to offsets not being correctly computed.
- Dave Y confirms that this has been fixed (around 12 months ago) though hadn’t been copied to ROE instance.
- New transients, since 12th March 2020, should be fine.

Lasair planning process (Andy Lawrence)

George asked whether consideration of collaborating with other alert-broker providers had been considered.
- Andy notes some discussion but no hard conclusions to date.
- Roy noted three largest projects are Alerces, Antares, and Lasair. We are trying to collaborate with these three, plus may be option to work with French work.
- Andy L observed, at Community Broker meeting, that not all parties were interested to provide a full service. Some of these may wish to join one of the big three.
Andy noted open question about how Lasair:ZTF and Lasair:LSST would coexist.
Stephen S believes technology choices need to be made within the next twelve months.
Dave M noted original idea of providing black-box filter to community broker, to be run on their behalf.
- Andy L noted idea of providing a containerised platform which others could take were dropped when funding for Phase B was cut.
- Roy noted that Antares are planning to support mini-brokers, taking Python code and incorporating into their service.
Dave Y noted online platforms, such as Digital Ocean, who have a community who blog on how to use the platform. Possibly, Lasair could provide blogs/ tutorials/ code snippets, to help others to extend our services.
- Roy asks if this is addressed by Jupyter Notebook.
- Dave Y noted that Jupiter Notebook is great but limited.
- Roy noted that open question how to break out of the notebook.

Roy Williams, Technology Challenges for Lasair

SQL is not a universally understood technology
Semantic differences between static queries and streaming queries--e.g. related to ordering of events/ query results.
How do we provide a consistent filter expression approach for static vs. active queries?
- Strasburg team (CRS) has published a web-based form to help people form their query
- Meg S not convinced that CRS form is a good way to go. Perhaps better training on how to build good SQL queries.
- Bob noted that for WFAU archives, not common to receive poor queries. Plus, was generally easy to flag queries likely to be badly formed.
- Mark K noted ZTF database has to support ingest alongside user queries.
- Meg S noted that it was possible to input what could be done in a SQL query. Also, noted her comment was related to user experience rather than query performance,
Simply queries unlikely to be enough when dealing with billions of rows of data.
- Data mining techniques will be needed
Use of storage approaches – databases vs. blob store, and which data should go where?
Ingestion of new data (LSST alerts) is on the critical path for the service?
Hardware technology choices and implications
What will we do for user data
Service resilience has not up until now been considered seriously.
Andy L noted need to make decision about query-language support.

Stephen S, LSST public and private; galaxies, stars and rocks

Bob asked where the 12-month history of an object originated.
- Stephen confirmed it was in the first detection of an object
Dave Y asked whether it would be worthwhile augmenting Alert Sim with a light Kafka stream for the forced sources, to save other users having to query the Prompt Products database.
- Stephen S believed, for that to be attractive to other groups, we would need to do this very quickly.
Transient alert stream should focus on fluxes rather than magnitudes for detected sources.
For visits in the plane, alert stream is likely to be dominated by stellar sources. Need to work out how to deal with this.
- For example, suggests we need a data rate of 200Mb/s to the UK DAC.
Potential to significantly reduce database size if we only store DIAObjects in database and put sources into blob storage.
Meg S noted that solar system is small. Would it be a significant problem to include solar system objects in the database, and remove only the stellar objects.
Focus of Lasair is on extra-galactic transients.
Andy L proposes to engage Science Working Group members to determine what is/ isn’t useful for handling stellar objects, to ensure we capture stellar transients and outbursts.
Meg S asked if we had a summary of what other Community Brokers are planning to do?
- Stephen does not believes this summary exists, but that it would be worthwhile to talk to others.
- Ken will investigate what other community brokers plan to address in terms of science area.
Dave Y suggests that those interested in particular objects could use watchlist to ensure data was available in fast d/b
Dave M noted having multiple databases is a common strategy for handling high-rate data flows, as has been demonstrated by Social Media providers.

Ken Smith, HDD vs. SSD

Prompted by different experiences of using SSDs (instead of HDDs) to ingest data
Stephen ashed what was meant by ingestion.
- Ken noted reading data from CSV file and inserting into a database. Like-for-like tests.
Difference between ZTF and LSST is that we do not need to associate sources to objects for LSST.
- Both SSD and HDD were able to ingest 5k+ rows per second, equiv. to 0.5 Bn rows per day.

Roy Williams, Handling Lightcurves

Meg asked if strategy is based around the idea of prioritising stuff that needs to be handled urgently.
Stephen S asks whether user is capable of dealing with the stream?
- Stephen is worried user can’t get what they want from Kafka stream (nor email alert).
- Roy noted stream helps people identify object id and then look up more details in the database.
- Alert is fairy rich for LSST, so monitoring the stream could significantly reduce load on the database.
- Alert is the first indication, for a user, that there is something interesting.
Dave Y reminded people of the blog posts and tutorials for helping people use a Kafka stream. For example, we could show people how to filter and convert the Kafka stream and let them develop the basic template.
- Meg noted that this is the approach that Zooniverse use.
George noted that typically a filter could produce hundreds or thousands of hits per night.
Gareth notes the intent to go with a modular approach to defining user workflows.
Gareth is concerned that Stephen’s presentation on forced photometry may prompt a revision to the model.
- Stephen noted that, in first 24 hours, there would potentially be no forced photometry. After 24 hours, people would want this information

Roy Williams, Parallel Ingestion and Workflow

Dave M asked whether we use Kafka for data transfers, as this is what it is designed to do.
- Gareth agrees for new components that is what we should do.
- Cam noted potential risk with over using Kafka consumers and, instead using syncs/ Lambda function to eliminate bookkeeping and resilience in Kafka consumers.
- Cam also notes Pulsar, as a competing technology to Kafka
- Dave Y asks if Openstack includes components the same as AWS Lambda.
George B concerned the database is a bottleneck
- Roy concurs and notes we are looking at strategies to minimise the traffic into databases, to the minimum
- Dave M noted that Cassandra could be parallelised
Andy L noted Edinburgh Kafka meet-up, which discussed event-driven architectures.
- Roy felt we didn’t want the full functionality of Kafka: we just wanted a data pipeline.
Dave M noted that modular approach would help to accommodate necessary changes, such as for forced photometry

Gareth Williams, SQL Queries on Data Streams

KSQL is different to MySQL specifically, but similar level of variations to other SQL variants.
Cam notes need for caution in creating MySQL cache
Cam asks if users would be creating queries to run on the stream? Is there a limitation on how many of these we can support.
Cam would suggest favouring KSQL unless it proves unsuitable. Especially if Kafka is the Data Bus.
- https://docs.confluent.io/current/ksql/docs/capacity-planning.html
Dave M believes the web interface approach would allow people to use either, without extra work.

Ken Smith, Making a Super Sherlock

Dave M asked if it was worth considering Kafka for transferring RA/ DEC data to Sherlock.

Ken Smith, Cassandra vs. MySQL

Citus Data has produced a distributed, relational database based on PostgreSQL. Similar to Qserv.
- Cam noted that a fair portion of the code is open source.
- Also suggested Cockroach DB is potentially interesting.
Cam A noted that if query needs change, then your data model (in Cassandra) needs to change, and there is a risk it is not easy to change it. This has tended to push people away from NoSQL and back to relational databases.
Cam A believes group-key indexing should be possible in relational database.
Dave Y clarified that the Partition Key was first to be put into group key. Is that a concern, given that telescope scanning across the sky would typically lead to imbalance in load on database for cross-matching.
Andy L asked what the problem is that Cassandra is trying to solve:
- Ken believes intent for Cassandra is to distribute processing across multiple commodity nodes.
- Ken noted that there is a replication problem, which is not solved.
- Dave Y believes blob storage could help us tackle the scalability issues we have with MySQL.
- Ken suggests we could store light curves in Cassandra, using Object ID as the primary key.

Gareth F, Storage Technologies

George asked if the performance issue with overwriting a file was a problem, given we have a write-once, ready many workload?
- Gareth wasn’t sure. Might need to modify lightcurves, though they were large, so overhead was less.
- Nigel asked if there was a risk of transferring the same information multiple times, for continuously varying objects that alerted each time.
- Stephen agreed this could be the case.
- Nigel wondered if it would be possible to edit out the repeated data.
- Gareth was concerned that de-duplicating that data created an implicit serialisation, as found for ZTF.
- Ken clarified whether this was the case, given that subsequent detections only contained the 30-day forced photometry, so may not be such a big deal.