S4.4 - PZ Tables as Federated Datasets

Introduction

The Legacy Survey of Space and Time (LSST) will provide photometric measurements for billions of objects during its ten years of operations. Most foreseen LSST science cases rely on photometric redshifts (photo-z) estimates for these objects.

The LSST Project's Data Management (DM) department plans to provide at least one, possibly more, photo-z estimates for each object as part of each data release. Given the large and diverse scope of science that can result from the LSST data, a unique photo-z method is expected not to satisfy all the requirements of the whole community.

This contribution consists of offering photo-z tables as federated datasets for each data release, using a different photo-z method from the official estimates (also to be defined by the DM team), thus expanding the scope of the science supported by the data releases.

The infrastructure required to produce, store, and deliver the photo-z tables will be provided by the Brazilian IDAC. The software development necessary to produce these tables, including the optimization and refactoring of the DES photo-z pipelines to run on the LSST scale and the production of new pipelines to cover all data flow steps, is accounted for as a directable software development effort.

Software development

The initial software development plan consisted of refactoring the pipeline Photo-z Compute from the DES Science Portal to ensure scalability in LSST. The new pipeline would keep only the concepts of data preparation, parallelization, easy access to metadata, and provenance control. The LSST scale imposes the adoption of a completely different technology from that used in the DES Portal.

Preliminary tests using Parsl to handle the parallelization and using the photo-z code LePHARE were carried out in 2021-2022. In mid-2022, the development team started an investigation to evaluate the possibility of reusing RAIL (open source code, developed by DESC) infrastructure to support the production of photo-z tables.

An initial set of scripts to automatize the preparation of input data and the execution of the RAIL estimate module at LIneA's HPC environment is available on the PZ Compute's repository on GitHub. It started supporting two photo-z algorithms, BPZ and FlexZBoost, then was expanded to include GPz, TPZ, and LePHARE.

Scalability tests on DP0.2 data yielded the first runtime and storage benchmarks. Detailed reports with test results are available in these two documents:

In summary, the main results are:

  • For 40 billion objects, around 267 thousand files are expected.
  • Uncompressed output files with 301 points would take around 344 MiB.
  • The result of estimating 40 billion objects would take around 88 TiB.
  • 12 column pre-processed input files with magnitude data would take around 14 MiB each, 3,5 TiB for DR11 (40 billion objects).
Algorithm Predicted Runtime for DR11
Pré-processing 24 min
FlexZBost 7 hours
BPZ 9 hours
TPZ 9 hours
LePHARE 24 days

Roadmap for the development phase

  • Development of scripts to automatize data integrity tests after the data transfer.
  • Development of scripts to automatize the creation of skinny tables (lightweight input tables).
  • Development of Jupyter notebooks to perform PZ training and validation.
  • Development of scripts to automatize the execution of RAIL with parallelization in HPC Cluster
  • Development of pipeline PZ Compute's code profiler and process monitoring
  • Scalability tests using the IDAC's infrastructure (Apolo computer cluster and the supercomputer Santos Dumont)
  • Integration tests - E2E pre-operations using data previews DP0.2, DP1, DP2
  • Wrapping new photo-z codes into RAIL (if necessary).
  • Performance tests - comparing codes (using DP1 and DP2).

Planning for operations

The figure below shows a flowchart representing the data flow within the Brazilian IDAC infrastructure. The numbered arrows refer to the sequence of processes that involves moving data through the IDAC components. The roman numbers refer to processes involving data transformation.

Sequence of steps for PZ Compute

The steps involved in the production of PZ tables are organized in four stages:

Stage I - Data acquisition

  • Register new data release (DR) on the IDAC data management system (initial metadata).
  • Start data transfer: download LSST Objects Catalog files from LSST Data Access Center (DAC) to LIneA's Data Transfer Node in DMZ.
  • Save Objects Catalog files on Lustre T1.
  • Objects Catalog validation - integrity and consistency checks, QA reports, and compare key numbers. Append QA info to DR metadata.
  • Apply data cleaning to create a "skinny table" for input for photo-z pipelines (e.g., select columns, apply photometric corrections, round extra decimal cases, etc). Save skinny table on Lustre T0 (fast I/O). Append skinny table info to DR metadata.
  • Download and store ancillary files (e.g., observing conditions maps, SED templates, documents, etc). Append ancillary file info to DR metadata.

Stage II - Photo-z pre-processing

  • Execution of pipeline Training Set Maker (x-matching).
  • Characterization of the training set - QA report (Jupyter Notebook).
  • Register (and upload) training set(s) on the PZ Server.
  • Photo-z Validation: execution of RAIL inform, estimation, and evaluation modules. Report with photo-z quality metrics (Jupyter Notebook).
  • Obtain report approval from Rubin DM Photo-z Coord.
  • Register (and upload) PZ validation results on the PZ Server.

Stage III - Photo-z computing

  • Execution of pipeline PZ Compute on the whole dataset.
  • Validation of PZ Compute results - Report with metadata, priors, configuration parameters, global statistics, N(z) (global and tomographic bins) (Jupyter Notebook).
  • Register (no upload, only metadata) the PZ table on the PZ Server.

Stage IV - Photo-z post-processing

  • Data preparation for upload (compress files, prepare a package with reports and metadata). Move the PZ table to LIneA's Data Transfer Node in DMZ.
  • Deliver PZ tables (data transfer from BR IDAC to US DAC or Cloud).
  • Register on RSP as a federated dataset. Make the table stored in USDF available for RSP users.