Running the HIPS Catalog pipeline step by step
This notebook walks through a minimal pipeline run with explanatory comments so you can adapt it to your own data. It assumes the package is installed in editable mode and that example data are available (see examples/data).
1. Import the pipeline helpers
We import the configuration helpers to load or inspect available presets and the main run_pipeline entry point.
[1]:
from hipscatalog_gen.config import load_config_from_dict, display_available_configs
from hipscatalog_gen.pipeline.main import run_pipeline
2. Inspect shipped configuration templates
Use display_available_configs() to list the example configurations packaged with the project. This is useful to pick a starting point before customizing your own run.
[2]:
display_available_configs()
HiPS catalog pipeline configuration reference
=============================================
Top-level sections
------------------
input [required]
columns [required]
algorithm [required]
cluster [required]
output [required]
input
-----
paths [required] list[str]
Glob patterns for input files (Parquet/CSV/TSV/HATS).
format [optional, default="parquet"]
One of: "parquet", "csv", "tsv", "hats".
header [optional, default=True]
Whether CSV/TSV files include a header row.
ascii_format [optional, default=None]
Optional hint for ASCII input ("CSV" or "TSV").
columns
-------
ra [required] str
RA column name.
dec [required] str
DEC column name.
keep [optional, default=None] list[str] or null
Controls which columns are kept in the HiPS tiles:
- Not set / null (default):
Keep all input columns preserving original input order.
- Empty list []:
Keep only the essential set: RA, DEC, score deps, and mag/flux
(if mag_global), with RA/DEC first.
- Non-empty list:
Use the provided keep order when it already contains all essential
columns. Otherwise, prepend missing essential columns before the keep
order (with RA/DEC first if they are missing).
algorithm (block-based)
-----------------------
selection_mode [required]
"mag_global" | "score_global" | "score_density_hybrid".
level_limit [required] int
Maximum HiPS order (NorderL). Must be in [4, 11].
moc_order [optional, default=level_limit] int
HiPS order used for the MOC.
selection_defaults [optional] dict
Shared defaults for all modes. Recognized keys:
- hist_nbins (int, default 2048)
- adaptive_range ("complete" | "hist_peak", default "complete")
- order_desc (bool, default False)
- density_bias_n1/n2/n3 (float, default 1.0 for SDH)
mag_global block
^^^^^^^^^^^^^^^^
mag_global.mag_column [required if flux_column absent] str
mag_global.flux_column [required if mag_column absent] str
mag_global.mag_offset [required when flux_column is set] float
mag_global.mag_min/max [optional] float
mag_global.adaptive_range [optional, default=selection_defaults.adaptive_range or "complete"]
mag_global.hist_nbins [optional, default=selection_defaults.hist_nbins or 2048]
mag_global.k_1/k_2/k_3 [optional] int, "per active tile" aliases for n_*
mag_global.n_1/n_2/n_3 [optional] int (must be provided in order)
mag_global.order_desc [optional, default=selection_defaults.order_desc or False]
score_global block
^^^^^^^^^^^^^^^^^^
score_global.score_column [required] str (column or expression)
score_global.score_min/max [optional] float
score_global.adaptive_range [optional, default=selection_defaults.adaptive_range or "complete"]
score_global.hist_nbins [optional, default=selection_defaults.hist_nbins or 2048]
score_global.k_1/k_2/k_3 [optional] int, "per active tile" aliases for n_*
score_global.n_1/n_2/n_3 [optional] int (must be provided in order)
score_global.order_desc [optional, default=selection_defaults.order_desc or False]
score_density_hybrid block
^^^^^^^^^^^^^^^^^^^^^^^^^^
score_density_hybrid.score_column [required] str (column or expression)
score_density_hybrid.score_min/max [optional] float
score_density_hybrid.adaptive_range [optional, default=selection_defaults.adaptive_range or "complete"]
score_density_hybrid.hist_nbins [optional, default=selection_defaults.hist_nbins or 2048]
score_density_hybrid.density_up_to_depth [optional, default=4]
Max depth handled by stage-1 density pass (clamped to level_limit).
score_density_hybrid.k_1/k_2/k_3 [optional] int, "per active tile" aliases for n_*
score_density_hybrid.n_1/n_2/n_3 [optional] int (must be provided in order)
score_density_hybrid.density_bias_n1/n2/n3 [optional, default=selection_defaults.density_bias_n* or 1.0]
float in [0,1]
score_density_hybrid.order_desc [optional, default=selection_defaults.order_desc or False]
cluster
-------
mode [optional, default="local"]
Cluster mode: "local" or "slurm".
n_workers [optional, default=3] int
threads_per_worker [optional, default=1] int
memory_per_worker [optional, default="2GB"] str
slurm [optional, default=None] dict
low_memory_mode [optional, default=True] bool
DEPRECATED: kept only for backward compatibility and has no effect.
The pipeline now always uses:
- no DataFrame persistence of large intermediates
- avoiding early large computes whenever possible
diagnostics_mode [optional, default="global"]
"per_step" | "global" | "off".
output
------
out_dir [required] str
cat_name [required] str
target [optional, default="0 0"] str
creator_did [optional, default=None] str
obs_title [optional, default=None] str
overwrite [optional, default=False] bool
Examples
========
Minimal configuration (YAML)
----------------------------
input:
paths: ["/path/to/catalog/*.parquet"]
columns:
ra: "ra"
dec: "dec"
algorithm:
selection_mode: "mag_global"
level_limit: 10
mag_global:
mag_column: "mag_r"
cluster: {}
output:
out_dir: "/path/to/output"
cat_name: "MyCatalog"
3. Build a configuration inline
Here we define a minimal configuration dictionary pointing to the sample DES DR2 parquet files. In real runs, adjust paths, output locations, and processing options as needed.
[3]:
cfg = {
"input": { # source parquet files (wildcards allowed)
"paths": ["../../../hipscatalog_gen/tests/data/des_dr2_small_sample/*.parquet"],
},
"columns": { # column mapping for sky coordinates
"ra": "RA",
"dec": "DEC",
},
"algorithm": { # HEALPix and selection parameters
"level_limit": 11,
"selection_mode": "mag_global",
"mag_global": {
"mag_column": "MAG_AUTO_I_DERED",
},
},
"cluster": {}, # use defaults for clustering
"output": { # where outputs and catalog names are stored
"out_dir": "../../../hipscatalog_gen/outputs/DES_DR2_small_sample",
"cat_name": "DES_DR2_small_sample",
},
}
# Validate and enrich the config using project defaults
cfg = load_config_from_dict(cfg)
4. Execute the pipeline
With the configuration prepared, call run_pipeline. It will read inputs, process them according to the configuration, and write outputs under the configured root.
[4]:
# run_pipeline(cfg)
Next steps
Swap the input paths to point to your own data or a different survey sample.
Explore the
examples/configstemplates for richer configurations and options.Integrate this notebook in an automated workflow (e.g., scheduled jobs) to keep catalogs up to date.