Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Attendees: Andy L, Roy W, Stephen S, Bob M, Meg S, Ken S, Matt N, Dave Y, Britton S,, Mark H , Stelios V, Michael F, Ally H, Dave R, Gareth F, Terry S, Cam Allen (Zooniverse), Nigel Hambley

Apologies:

Notes from discussion

...

Stephen S, LSST public and private; galaxies, stars and rocks

  • Bob asked where the 12-month history of an object originated.

    • Stephen confirmed it was in the first detection of an object

  • Dave Y asked whether it would be worthwhile augmenting Alert Sim with a light Kafka stream for the forced sources, to save other users having to query the Prompt Products database.

    • Stephen S believed, for that to be attractive to other groups, we would need to do this very quickly.

  • Transient alert stream should focus on fluxes rather than magnitudes for detected sources.

  • For visits in the plane, alert stream is likely to be dominated by stellar sources. Need to work out how to deal with this.

    • For example, suggests we need a data rate of 200Mb/s to the UK DAC.

  • Potential to significantly reduce database size if we only store DIAObjects in database and put sources into blob storage.

  • Meg S noted that solar system is small. Would it be a significant problem to include solar system objects in the database, and remove only the stellar objects.

  • Focus of Lasair is on extra-galactic transients.

  • Andy L proposes to engage Science Working Group members to determine what is/ isn’t useful for handling stellar objects, to ensure we capture stellar transients and outbursts.

  • Meg S asked if we had a summary of what other Community Brokers are planning to do?

    • Stephen does not believes this summary exists, but that it would be worthwhile to talk to others.

    • Ken will investigate what other community brokers plan to address in terms of science area.

  • Dave Y suggests that those interested in particular objects could use watchlist to ensure data was available in fast d/b

  • Dave M noted having multiple databases is a common strategy for handling high-rate data flows, as has been demonstrated by Social Media providers.

Ken Smith, HDD vs. SSD

  • Prompted by different experiences of using SSDs (instead of HDDs) to ingest data

  • Stephen ashed what was meant by ingestion.

    • Ken noted reading data from CSV file and inserting into a database. Like-for-like tests.

  • Difference between ZTF and LSST is that we do not need to associate sources to objects for LSST.

    • Both SSD and HDD were able to ingest 5k+ rows per second, equiv. to 0.5 Bn rows per day.

Roy Williams, Handling Lightcurves

  • Meg asked if strategy is based around the idea of prioritising stuff that needs to be handled urgently.

  • Stephen S asks whether user is capable of dealing with the stream?

    • Stephen is worried user can’t get what they want from Kafka stream (nor email alert).

    • Roy noted stream helps people identify object id and then look up more details in the database.

    • Alert is fairy rich for LSST, so monitoring the stream could significantly reduce load on the database.

    • Alert is the first indication, for a user, that there is something interesting.

  • Dave Y reminded people of the blog posts and tutorials for helping people use a Kafka stream. For example, we could show people how to filter and convert the Kafka stream and let them develop the basic template.

    • Meg noted that this is the approach that Zooniverse use.

  • George noted that typically a filter could produce hundreds or thousands of hits per night.

  • Gareth notes the intent to go with a modular approach to defining user workflows.

  • Gareth is concerned that Stephen’s presentation on forced photometry may prompt a revision to the model.

    • Stephen noted that, in first 24 hours, there would potentially be no forced photometry. After 24 hours, people would want this information

Roy Williams, Parallel Ingestion and Workflow

  • Dave M asked whether we use Kafka for data transfers, as this is what it is designed to do.

    • Gareth agrees for new components that is what we should do.

    • Cam noted potential risk with over using Kafka consumers and, instead using syncs/ Lambda function to eliminate bookkeeping and resilience in Kafka consumers.

    • Cam also notes Pulsar, as a competing technology to Kafka

    • Dave Y asks if Openstack includes components the same as AWS Lambda.

  • George B concerned the database is a bottleneck

    • Roy concurs and notes we are looking at strategies to minimise the traffic into databases, to the minimum

    • Dave M noted that Cassandra could be parallelised

  • Andy L noted Edinburgh Kafka meet-up, which discussed event-driven architectures.

    • Roy felt we didn’t want the full functionality of Kafka: we just wanted a data pipeline.

  • Dave M noted that modular approach would help to accommodate necessary changes, such as for forced photometry

Gareth Williams, SQL Queries on Data Streams

  • KSQL is different to MySQL specifically, but similar level of variations to other SQL variants.

  • Cam notes need for caution in creating MySQL cache

  • Cam asks if users would be creating queries to run on the stream? Is there a limitation on how many of these we can support.

  • Cam would suggest favouring KSQL unless it proves unsuitable. Especially if Kafka is the Data Bus.

  • Dave M believes the web interface approach would allow people to use either, without extra work.

Ken Smith, Making a Super Sherlock

  • Dave M asked if it was worth considering Kafka for transferring RA/ DEC data to Sherlock.

Ken Smith, Cassandra vs. MySQL

  • Citus Data has produced a distributed, relational database based on PostgreSQL. Similar to Qserv.

    • Cam noted that a fair portion of the code is open source.

    • Also suggested Cockroach DB is potentially interesting.

  • Cam A noted that if query needs change, then your data model (in Cassandra) needs to change, and there is a risk it is not easy to change it. This has tended to push people away from NoSQL and back to relational databases.

  • Cam A believes group-key indexing should be possible in relational database.

  • Dave Y clarified that the Partition Key was first to be put into group key. Is that a concern, given that telescope scanning across the sky would typically lead to imbalance in load on database for cross-matching.

  • Andy L asked what the problem is that Cassandra is trying to solve:

    • Ken believes intent for Cassandra is to distribute processing across multiple commodity nodes.

    • Ken noted that there is a replication problem, which is not solved.

    • Dave Y believes blob storage could help us tackle the scalability issues we have with MySQL.

    • Ken suggests we could store light curves in Cassandra, using Object ID as the primary key.

Gareth F, Storage Technologies

  • George asked if the performance issue with overwriting a file was a problem, given we have a write-once, ready many workload?

    • Gareth wasn’t sure. Might need to modify lightcurves, though they were large, so overhead was less.

    • Nigel asked if there was a risk of transferring the same information multiple times, for continuously varying objects that alerted each time.

    • Stephen agreed this could be the case.

    • Nigel wondered if it would be possible to edit out the repeated data.

    • Gareth was concerned that de-duplicating that data created an implicit serialisation, as found for ZTF.

    • Ken clarified whether this was the case, given that subsequent detections only contained the 30-day forced photometry, so may not be such a big deal.

  • Question about whether or not need to store all difference images. May be reasonable to keep only one image per object.

  • Meg believes hiding things behind Python is a good way to go. Is something that astronomers are familiar with and move towards using.

    • Gareth noted that Python could simply access an HTTP service.

    • Mark H believes the choice of technology will be transparent to the users.

Michael Fulton, Light Curve Classification and Features

  • Andy asked whether RAPID could be packaged up for users to run from a notebook, with the caveat that it does not necessarily classify events accurately.

    • Michael worries that traditional spectroscopic follow-up outperforms this.

    • Stephen believes, if not reliable over first ten days or a light curve, it can not be used in the fast stream.

    • Meg asked if this was a training problem. Expectation is that ML techniques typically fail when applied to a different dataset.

    • Michael noted that RAPID had been trained for ZTF, so that was unlikely to be source of inaccuracy.

    • Meg noted that simulated ZTF data is different to real ZTF data.

      • Two questions: do you want a ML classifier? Do you want RAPID to be it?

Stelios V, LSST Science Platform

  • Dave Y asked if Nublado solves issue with sharing notebooks? Nearly impossible to do so in Jupyter Hub, it would seem.

  • Andy encourages that we make things as familiar as possible to users, meaning maximum harmonisation with US approach – e.g. using their Jupyter Service, TAP, service, etc. Andy is less sure about Firefly, which seems a bit clunky.

  • Meg asked how this compared to US DAC, given rumour they would move away from DAC.

    • While there had previously been a pause on development, it looked as if Firefly was still the expected web interface.

Summary, Andy L

  • Should take a little time to reflect on days discussions, and then consider what next.

  • Could use Slack to further consider some of the topics, reconvene for another session (e.g. Lasair telco on 24th), or define a smaller group to take forward key issues.

  • The decision was that key issues would be discussed at the next Lasair telecon, which would have two hours dedicated to it.

  • However, before this we must attempt to digest and summarise our notes, and produce a tightly focused decision list to be discussed at the telecon

  • Andy would have the first go at making a wiki page with “decisions required”

  • All are welcome to continue making points in the Slack discussion

  • Decisions needed can be divided into:

    • definite decisions

    • actions on further experiments/tests

    • actions on more significant work, e.g. drawing up a Kafka-centred architecture.