Page Comparison

...

Attendees: Andy L, Roy W, Stephen S, Bob M, Meg S, Ken S, Matt N, Dave Y, Britton S,, Mark H , Stelios V, Michael F, Ally H, Dave R, Gareth F, Terry S, Cam Allen (Zooniverse), Nigel Hambley

Apologies:

Notes from discussion

...

Stephen S, LSST public and private; galaxies, stars and rocks

Bob asked where the 12-month history of an object originated.
- Stephen confirmed it was in the first detection of an object
Dave Y asked whether it would be worthwhile augmenting Alert Sim with a light Kafka stream for the forced sources, to save other users having to query the Prompt Products database.
- Stephen S believed, for that to be attractive to other groups, we would need to do this very quickly.
Transient alert stream should focus on fluxes rather than magnitudes for detected sources.
For visits in the plane, alert stream is likely to be dominated by stellar sources. Need to work out how to deal with this.
- For example, suggests we need a data rate of 200Mb/s to the UK DAC.
Potential to significantly reduce database size if we only store DIAObjects in database and put sources into blob storage.
Meg S noted that solar system is small. Would it be a significant problem to include solar system objects in the database, and remove only the stellar objects.
Focus of Lasair is on extra-galactic transients.
Andy L proposes to engage Science Working Group members to determine what is/ isn’t useful for handling stellar objects, to ensure we capture stellar transients and outbursts.
Meg S asked if we had a summary of what other Community Brokers are planning to do?
- Stephen does not believes this summary exists, but that it would be worthwhile to talk to others.
- Ken will investigate what other community brokers plan to address in terms of science area.
Dave Y suggests that those interested in particular objects could use watchlist to ensure data was available in fast d/b
Dave M noted having multiple databases is a common strategy for handling high-rate data flows, as has been demonstrated by Social Media providers.

Ken Smith, HDD vs. SSD

Prompted by different experiences of using SSDs (instead of HDDs) to ingest data
Stephen ashed what was meant by ingestion.
- Ken noted reading data from CSV file and inserting into a database. Like-for-like tests.
Difference between ZTF and LSST is that we do not need to associate sources to objects for LSST.
- Both SSD and HDD were able to ingest 5k+ rows per second, equiv. to 0.5 Bn rows per day.

Roy Williams, Handling Lightcurves

Meg asked if strategy is based around the idea of prioritising stuff that needs to be handled urgently.
Stephen S asks whether user is capable of dealing with the stream?
- Stephen is worried user can’t get what they want from Kafka stream (nor email alert).
- Roy noted stream helps people identify object id and then look up more details in the database.
- Alert is fairy rich for LSST, so monitoring the stream could significantly reduce load on the database.
- Alert is the first indication, for a user, that there is something interesting.
Dave Y reminded people of the blog posts and tutorials for helping people use a Kafka stream. For example, we could show people how to filter and convert the Kafka stream and let them develop the basic template.
- Meg noted that this is the approach that Zooniverse use.
George noted that typically a filter could produce hundreds or thousands of hits per night.
Gareth notes the intent to go with a modular approach to defining user workflows.
Gareth is concerned that Stephen’s presentation on forced photometry may prompt a revision to the model.
- Stephen noted that, in first 24 hours, there would potentially be no forced photometry. After 24 hours, people would want this information

Roy Williams, Parallel Ingestion and Workflow

Dave M asked whether we use Kafka for data transfers, as this is what it is designed to do.
- Gareth agrees for new components that is what we should do.
- Cam noted potential risk with over using Kafka consumers and, instead using syncs/ Lambda function to eliminate bookkeeping and resilience in Kafka consumers.
- Cam also notes Pulsar, as a competing technology to Kafka
- Dave Y asks if Openstack includes components the same as AWS Lambda.
George B concerned the database is a bottleneck
- Roy concurs and notes we are looking at strategies to minimise the traffic into databases, to the minimum
- Dave M noted that Cassandra could be parallelised
Andy L noted Edinburgh Kafka meet-up, which discussed event-driven architectures.
- Roy felt we didn’t want the full functionality of Kafka: we just wanted a data pipeline.
Dave M noted that modular approach would help to accommodate necessary changes, such as for forced photometry

Gareth Williams, SQL Queries on Data Streams

KSQL is different to MySQL specifically, but similar level of variations to other SQL variants.
Cam notes need for caution in creating MySQL cache
Cam asks if users would be creating queries to run on the stream? Is there a limitation on how many of these we can support.
Cam would suggest favouring KSQL unless it proves unsuitable. Especially if Kafka is the Data Bus.
- https://docs.confluent.io/current/ksql/docs/capacity-planning.html
Dave M believes the web interface approach would allow people to use either, without extra work.

Ken Smith, Making a Super Sherlock

Dave M asked if it was worth considering Kafka for transferring RA/ DEC data to Sherlock.

Ken Smith, Cassandra vs. MySQL

Citus Data has produced a distributed, relational database based on PostgreSQL. Similar to Qserv.
- Cam noted that a fair portion of the code is open source.
- Also suggested Cockroach DB is potentially interesting.
Cam A noted that if query needs change, then your data model (in Cassandra) needs to change, and there is a risk it is not easy to change it. This has tended to push people away from NoSQL and back to relational databases.
Cam A believes group-key indexing should be possible in relational database.
Dave Y clarified that the Partition Key was first to be put into group key. Is that a concern, given that telescope scanning across the sky would typically lead to imbalance in load on database for cross-matching.
Andy L asked what the problem is that Cassandra is trying to solve:
- Ken believes intent for Cassandra is to distribute processing across multiple commodity nodes.
- Ken noted that there is a replication problem, which is not solved.
- Dave Y believes blob storage could help us tackle the scalability issues we have with MySQL.
- Ken suggests we could store light curves in Cassandra, using Object ID as the primary key.

Gareth F, Storage Technologies

George asked if the performance issue with overwriting a file was a problem, given we have a write-once, ready many workload?
- Gareth wasn’t sure. Might need to modify lightcurves, though they were large, so overhead was less.
- Nigel asked if there was a risk of transferring the same information multiple times, for continuously varying objects that alerted each time.
- Stephen agreed this could be the case.
- Nigel wondered if it would be possible to edit out the repeated data.
- Gareth was concerned that de-duplicating that data created an implicit serialisation, as found for ZTF.
- Ken clarified whether this was the case, given that subsequent detections only contained the 30-day forced photometry, so may not be such a big deal.
Question about whether or not need to store all difference images. May be reasonable to keep only one image per object.
Meg believes hiding things behind Python is a good way to go. Is something that astronomers are familiar with and move towards using.
- Gareth noted that Python could simply access an HTTP service.
- Mark H believes the choice of technology will be transparent to the users.

Michael Fulton, Light Curve Classification and Features

Andy asked whether RAPID could be packaged up for users to run from a notebook, with the caveat that it does not necessarily classify events accurately.
- Michael worries that traditional spectroscopic follow-up outperforms this.
- Stephen believes, if not reliable over first ten days or a light curve, it can not be used in the fast stream.
- Meg asked if this was a training problem. Expectation is that ML techniques typically fail when applied to a different dataset.
- Michael noted that RAPID had been trained for ZTF, so that was unlikely to be source of inaccuracy.
- Meg noted that simulated ZTF data is different to real ZTF data.
  - Two questions: do you want a ML classifier? Do you want RAPID to be it?

Stelios V, LSST Science Platform

Dave Y asked if Nublado solves issue with sharing notebooks? Nearly impossible to do so in Jupyter Hub, it would seem.
Andy encourages that we make things as familiar as possible to users, meaning maximum harmonisation with US approach – e.g. using their Jupyter Service, TAP, service, etc. Andy is less sure about Firefly, which seems a bit clunky.
Meg asked how this compared to US DAC, given rumour they would move away from DAC.
- While there had previously been a pause on development, it looked as if Firefly was still the expected web interface.

Summary, Andy L

Should take a little time to reflect on days discussions, and then consider what next.
Could use Slack to further consider some of the topics, reconvene for another session (e.g. Lasair telco on 24th), or define a smaller group to take forward key issues.
The decision was that key issues would be discussed at the next Lasair telecon, which would have two hours dedicated to it.
However, before this we must attempt to digest and summarise our notes, and produce a tightly focused decision list to be discussed at the telecon
Andy would have the first go at making a wiki page with “decisions required”
All are welcome to continue making points in the Slack discussion
Decisions needed can be divided into:
- definite decisions
- actions on further experiments/tests
- actions on more significant work, e.g. drawing up a Kafka-centred architecture.

Versions Compared

Old Version 5

New Version Current

Key