Why are we, at Soroco, inspired by astronomers - the OG computer hackers? [Part II]

Rohan Murty

Diving deeper into ZTF

Of all the projects we have come across in astronomy, we see a strong parallel between the Zwicky Transient Facility (ZTF) and Scout. ZTF is basically Scout for the night sky. Or Scout is ZTF for the enterprise. Both systems span multiple areas of computing and at the heart of it solve a similar problem – how do you find faint patterns from noisy observational data at scale?  

ZTF is an automated system of telescopes that find transients (such as gamma ray bursts, comets, etc.), at Palomar/Caltech and generates ~ 4TB per night (assuming 100 observational nights in a year this is about 400TB / year). ZTF consists of a base platform, which collects, cleans, and stores the data. It is then processed through a series of successive pipelines to refine it and find patterns. Subsequently, the processed data, rich with possibilities, is then extended to address multiple astroinformatics questions 

At the heart of it, ZTF is meant to find new patterns by comparing these patterns to previously known discoveries to ascertain the validity of the newly found pattern. Conceptually this is an example of what ZTF does: 

Source: ZTF

Once a pattern has been discovered, ZTF classifies the new pattern or ‘alert’ into bins such as (“variable star”, or a false detection, etc.). Here is a snapshot of how ZTF classifies light curves or observations. Think of light curves as a particular hash or signature of an astronomical phenomenon. Here is an example of a light curve. 

These light curves are classified using a combination of machine learning and deep-learning. Here is a schematic of how ZTF classifies light curves. 

Classification uses supervised learning algorithms and sets up the classification problem as an optimization problem of minimizing the gap between a prediction and the ground truth observation. But why use any learning algorithms here at all? Besides the large volume of light curve data, it tends to be unevenly sampled, incomplete, and may be affected by biases (presumably from the equipment?). Hence, standard time series analysis may prove to be insufficient. Instead, this is where learning algorithms tend to do quite well. A whole body of prior work has demonstrated that learning algorithms tend to do well on these class of problems.

Once a pattern is classified, ZTF has the potential to run several different pipelines to further validate the specific bin that the event has been classified into. For example, DeepStreaks is a component in the pipeline in ZTF that is used to identify streaking near-earth objects (NEO) (such as comets). Here is a high-level decision tree and sample results for how DeepStreaks decides if the candidate pattern is plausible NEO, non-NEO events, or noise. 

Source: Matthew Graham, ZTF & Caltech

Finally, all of these add up into Tails, the world’s first deep-learning based framework to assist in the discovery of comets. Tails is built on top of the base data gathering platform. 

Source: Tails: Chasing Comets with the Zwicky Transient Facility and Deep Learning, Dmitry A. Duev, NeurIPS 2020

The Tails architecture which employs an EfficientDet D0-based network

Tails has been online since August 2020 and produces between 10-20 NEO candidates each night. Let us examine the achievement of this particular project in a historical context. Since the first homo sapiens, the ancients have always looked up at the night sky and wondered about our place in this universe. This very act has been the source of all inspiration – religionart, science, literature, and pretty much everything mankind has done. More specifically, cave art from 40,000 years ago reveal the ancients tracked astronomical phenomenon such as comet strikes and planetary shifts. And what we see today with ZTF, is an example of how, this very old profession of humankind has today largely been automated with advances in contemporary computing. 

Fritz software platform

All of these advances have culminated in the ZTF team open sourcing their underlying extensible data platform – Fritz. In many regards Fritz and the entire ZTF effort echoes the architecture, thinking, and design for how we at Soroco are building the Scout platform.  

The point here is just through the lens of ZTF we can see an example of the incredible range of expertise that the ZTF team of astronomers and engineers have had to develop to do their scientific work — signal processing, computer vision, deep learning, machine learning, clustering algorithms, infrastructure, storage, databases, API design, parallel processing, networking, and operating systems. And architecture, system design, and system integration on top of all that. Whew! This literally is an entire undergraduate computer science curriculum worth of skillsets rolled into one team! 

Think of this. When was the last time you knew of a software product or project built by a small team that spanned so many different areas of computing? At Soroco, whenever confronted with technical challenges, we remind ourselves of what these ninja teams in astronomy do and that humbles us and spurs us further. 

If you enjoy reading this article and want to work on similar problems, apply here and come work with us!

Reflections

Some computer science purists may argue a lot of this is about application of technology vs building ‘new’ technology. But we view these distinctions as irrelevant barriers. Instead, what astronomers have shown us, time and time again, is a focus on achieving the end outcome using computation and solving any and every problem that comes their way. It is precisely this confluence of different skills, technologies spanning the stack, and collaboration across physics to computer science that births new systems advancing the capabilities of any software system. In several cases, these teams may have perhaps applied existing algorithms and technologies but they have had to figure out how to integrate disparate components together, which components to pick, scale, performance, latency, accuracy, etc. And in some instances, they have had to solve hard computing problems on their own without necessarily waiting for computer scientists to solve these problems and then publishing them.  

Therefore, astronomers have had no choice but to mature into excellent computer scientists and engineers themselves. They have had to design, engineer, and solve their way to actually doing their science. We believe that in terms of skillsets astronomers often represent a superset of many computer scientists and certainly most computer engineers. And the same is likely true of several physicists (see our friend Jacob Bourjaily’s fascinating work on the computational efficiency of Feynman diagrams) or computational biologists, among others. And that, really, is the point of all this. At Soroco we are always humbled by the complexity, scale, and difficulty of the problems that these scientists are solving. The more we build our own platforms the more we come to appreciate the grit, depth, and diversity of expertise of these scientists. Hence, our own approach to recruiting from the very beginning has looked beyond just computer scientists. We value computer scientists very much. But we also value astronomers, biologists, and physicists (by the way, interested in solving similar hard problems with us? PhD or notApply here!). We see them as incredibly versatile yet practical engineers who understand trade-offs that need to be made when building production systems.  

It must be evident by now that this is how we see all astronomers.

When we think of building teams and recruiting talent, we consider scientists as first-class engineers as well. So, if you are a scientist dealing with lots of data and computation, please email us! We’d love to work with you. And if you are not a scientist but want to solve and engineer the kinds of scale problems that astronomers solve then get in touch with us! We’d love to chat. 

Crafted with ❤️ by Rohan

Appendix (for even more fun reading)

In many ways our entire company was inspired by Shri Kulkarni’s vision for computational astronomy. Shri, now an advisor to Soroco, is the George Ellory Hale professor of astronomy at Caltech. In 2013 a couple of us attended Shri’s colloquium talk at the MIT physics department where he outlined the PTF (Palomar Transient Factory) project, ZTF’s predecessor. He showed how he and a group of collaborators were using computer science and electrical engineering to accelerate the scale and pace of astronomy discoveries. It was this pivotal talk, which Shri subsequently re-titled to “automating the discovery of the universe”, that led to the formation of Soroco because we kept asking ourselves – surely no problem in the enterprise can be harder than automating one of humankind’s oldest obsessions (i.e. with the night sky)? Through Shri I have gained a deeper appreciation for the strengths and the evolution of astronomers into first-class computer scientists.  

But I first understood the point that Shri was making back in the mid-2000s when I had an opportunity to meet the legendary Jim Gray. At the time, as a first-year graduate student in computer science, I excitedly described my work on TCP variants (which were in vogue in 2005) and shipping large volumes of data across long distances (via high bandwidth delay product links). I hoped to impress Jim on the problem I was working on. But Jim was not easily impressed. His point was even the fastest link with the best TCP variant could not satiate the needs that astronomy has in terms of scale. Jim’s point was simply this: astronomy is, in many regards, the final frontier for data and scale.  

We had a fascinating discussion on the scale of data and challenges that astronomers face, which according to Jim, was nothing like what we computer scientists faced at the time. He pointed out the Sloan Digital Sky Survey (SDSS) as an inspiration for scale and complexity of data that astronomers face every day. As an example, Jim pointed out that even the fastest network link at the time would not suffice to cater to the top 10 telescopes every day to ship their data volume of data generated. And hence astronomers were shipping hard drives through the traditional postal system (ala a SIGCOMM paper on the postal system I had read as an undergraduate student from Randy Wang and Co. at Princeton). He then described computing challenges in storing large volumes of data, processing them, etc. Most of these challenges did not have clear solutions. And yet astronomers kept making progress on their own, without necessarily relying on computer science to make progress. Astronomers just built and hacked their way through these problems. Hence, IMO, astronomers are the OG hackers! 

If you enjoy reading this article and want to work on similar problems, apply here and come work with us!

Like this article? Spread the word 

Share on facebook
Share on twitter
Share on linkedin
Share on reddit
Share on mix
Share on email

Content Explorer

Leave a Reply

Your email address will not be published. Required fields are marked *