Introduction
Modern astronomical surveys are getting ever larger. In the quest for discovering answers to the fundamental cosmological questions (such as: What is dark matter? What is dark energy? How fast is the Universe expanding? Why is it expanding?), ever more detailed and extensive surveys are needed. And that results in very large amounts of data that need to be organized, analyzed and learned from.
Looking at several recent and currently planned optical astronomical surveys we can see that those are always at the edge of current technical capabilities. For example, Sloan Digital Sky Survey (SDSS), which started in 2003, released the latest Data Release 14 with 1.2 billion objects and 150 GB of data. European Space Agancy's Data Release 2 contains about 1.8 billion objects (about 400 GB). LSST (starting in 2022), which will observe close to 20 billion objects about 1000 times on average, is projected to amass up to 80 PB of data in 10 years of the planned mission duration.
Astronomical surveys typically result in astronomical catalogs: databases where rows correspond to the observed objects (stars and galaxies) and where columns contain their measured characteristics (luminance, position, shape, etc.). Handling catalogs of this scale requires different approaches than those offered by typical relational databases.
In collaboration with the Astronomy Department of University of Washington, we developed a system for handling and cross-matching large astronomical catalogs called Astronomical Extensions for Spark (AXS; documentation and source code).
A fast algorithm for cross-matching astronomical catalogs
One of the operations that astronomers analyzing catalog data most often perform is catalog cross-matching. What that means is that they want to find pairs of measurements from two (or more) catalogs corresponding to the same object. This is usually done by spatially joining two catalogs based on objects' RA (right ascention) and Dec (declination) equatorial coordinates: objects less than some arbitrary distance apart (e.g. 2 arc-seconds) signify a match.
Different approaches have traditionally been used for performing such spatial joins. The basic question is how to optimally organize data to enable quick searches and joins. HEALPix (Hierarchical Equal Area and isoLatitude Pixelization of the sphere, Gorski et al. 1999) and HTM are two indexing schemes traditionally most often used. They organize the sky into a mashes of varying granularities. Another approach popularized by late Jim Gray (Gray et al. 2002 and Gray et al 2007) is the zones algorithm, which divides the sky into horizontal stripes called zones.
We found the zones approach to be the most appropriate for the modern, distributed, shared-nothing architectures and massive parallelization. In AXS, data is held in Parquet files, which is a distributed (spread over many machines) and columnar (data is organized by columns, not rows, for quicker handling) file format with basic indexing capabilities (parts of files and whole files can be skipped when searching for data). Furthermore, data inside AXS catalogs (Parquet files) are bucketed by zone (which means that rows with different zones are physically separated in different files) and sorted by zone and RA coordinate. Hence, any join operation can be split into several independent sub-joins involving only pairs of files containing the same zones. These sub-joins can be easily parallelized. They are independent in part also because data inside the zone borders has been duplicated (during data preparation process) to the neighboring zone, so no data movement between zones is needed during the joins.
AXS is based on Apache Spark, a general distributed data processing engine. We implemented an epsilon-join (Silva et al 2010) algorithm in the Spark SQL module to enable the needed fast spatial joins of bucketed Parquet files. In an epsilon join, a query similar to this one:
SELECT * FROM cat1 c1 JOIN cat2 c2 ON c1.zone = c2.zone AND c1.ra BETWEEN c2.ra - e AND c2.ra + e AND distance (c1.ra, c1.dec, c2.ra, c2.dec) <= e
results in an operation where for each row from c1, the distance function is calculated only for the pairs of rows inside a moving window of rows from c2 corresponding to the condition in the BETWEEN clause.
Cross-matching results
We measured AXS' cross-matching performance on a single large machine. We cross-matched SDSS catalog (800 million rows), Gaia DR2 catalog (1.8 billion rows), AllWISE catalog (800 million rows) and ZTF catalog (3.3 billion rows) in different combinations, in warm-cache (data was partially cached in memory) and cold-cache (all of data needed to be read from disk) scenarios. The results show that using our zones-based method, AXS is able to cross-match Gaia DR2 with AllWISE, resulting in 340 million rows in only 35 seconds (warm cache) or 226 seconds (cold cache), using 24 processes. Gaia-SDSS cross-match, resulting in 227 million rows, is performed in 25 s (warm cache) or 136 s (cold cache). The full performance testing results will be published in a future paper (in preparation).
AXS beyond cross-matching
AXS provides several other functions for making astronomers' lives easier: histogram and histogram2d functions for binning data based on arbitrary functions; region and cone functions for querying parts of the sky; array_positions and array_select SQL functions for handling light-curve data; add_increment function for adding data from new observations to the existing catalogs.
The underlying Spark engine is rich in data processing functionalities and we believe the whole offers an excellent choice for astronomical data processing and scientific analysis.