...
Gareth Williams, SQL Queries on Data Streams
KSQL is different to MySQL specifically, but similar level of variations to other SQL variants.
Cam notes need for caution in creating MySQL cache
Cam asks if users would be creating queries to run on the stream? Is there a limitation on how many of these we can support.
Cam would suggest favouring KSQL unless it proves unsuitable. Especially if Kafka is the Data Bus.
Dave M believes the web interface approach would allow people to use either, without extra work.
Ken Smith, Making a Super Sherlock
Dave M asked if it was worth considering Kafka for transferring RA/ DEC data to Sherlock.
Ken Smith, Cassandra vs. MySQL
Citus Data has produced a distributed, relational database based on PostgreSQL. Similar to Qserv.
Cam noted that a fair portion of the code is open source.
Also suggested Cockroach DB is potentially interesting.
Cam A noted that if query needs change, then your data model (in Cassandra) needs to change, and there is a risk it is not easy to change it. This has tended to push people away from NoSQL and back to relational databases.
Cam A believes group-key indexing should be possible in relational database.
Dave Y clarified that the Partition Key was first to be put into group key. Is that a concern, given that telescope scanning across the sky would typically lead to imbalance in load on database for cross-matching.
Andy L asked what the problem is that Cassandra is trying to solve:
Ken believes intent for Cassandra is to distribute processing across multiple commodity nodes.
Ken noted that there is a replication problem, which is not solved.
Dave Y believes blob storage could help us tackle the scalability issues we have with MySQL.
Ken suggests we could store light curves in Cassandra, using Object ID as the primary key.
Gareth F, Storage Technologies
George asked if the performance issue with overwriting a file was a problem, given we have a write-once, ready many workload?
Gareth wasn’t sure. Might need to modify lightcurves, though they were large, so overhead was less.
Nigel asked if there was a risk of transferring the same information multiple times, for continuously varying objects that alerted each time.
Stephen agreed this could be the case.
Nigel wondered if it would be possible to edit out the repeated data.
Gareth was concerned that de-duplicating that data created an implicit serialisation, as found for ZTF.
Ken clarified whether this was the case, given that subsequent detections only contained the 30-day forced photometry, so may not be such a big deal.
Question about whether or not need to store all difference images. May be reasonable to keep only one image per object.
Meg believes hiding things behind Python is a good way to go. Is something that astronomers are familiar with and move towards using.
Gareth noted that Python could simply access an HTTP service.
Mark H believes the choice of technology will be transparent to the users.
Michael Fulton, Light Curve Classification and Features
Andy asked whether RAPID could be packaged up for users to run from a notebook, with the caveat that it does not necessarily classify events accurately.
Michael worries that traditional spectroscopic follow-up outperforms this.
Stephen believes, if not reliable over first ten days or a light curve, it can not be used in the fast stream.
Meg asked if this was a training problem. Expectation is that ML techniques typically fail when applied to a different dataset.
Michael noted that RAPID had been trained for ZTF, so that was unlikely to be source of inaccuracy.
Meg noted that simulated ZTF data is different to real ZTF data.
Two questions: do you want a ML classifier? Do you want RAPID to be it?
Stelios V, LSST Science Platform
Dave Y asked if Nublado solves issue with sharing notebooks? Nearly impossible to do so in Jupyter Hub, it would seem.
Andy encourages that we make things as familiar as possible to users, meaning maximum harmonisation with US approach – e.g. using their Jupyter Service, TAP, service, etc. Andy is less sure about Firefly, which seems a bit clunky.
Meg asked how this compared to US DAC, given rumour they would move away from DAC.
While there had previously been a pause on development, it looked as if Firefly was still the expected web interface.
Summary, Andy L
Should take a little time to reflect on days discussions, and then consider what next.
Could use Slack to further consider some of the topics, reconvene for another session (e.g. Lasair telco on 24th), or define a smaller group to take forward key issues.
The decision was that key issues would be discussed at the next Lasair telecon, which would have two hours dedicated to it.
However, before this we must attempt to digest and summarise our notes, and produce a tightly focused decision list to be discussed at the telecon
Andy would have the first go at making a wiki page with “decisions required”
All are welcome to continue making points in the Slack discussion
Decisions needed can be divided into:
definite decisions
actions on further experiments/tests
actions on more significant work, e.g. drawing up a Kafka-centred architecture.