Running the HIPS Catalog pipeline step by step

This notebook walks through a minimal pipeline run with explanatory comments so you can adapt it to your own data. It assumes the package is installed in editable mode and that example data are available (see examples/data).

1. Import the pipeline helpers

We import the configuration helpers to load or inspect available presets and the main run_pipeline entry point.

[1]:

from hipscatalog_gen.config import load_config_from_dict, display_available_configs
from hipscatalog_gen.pipeline.main import run_pipeline

/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

2. Inspect shipped configuration templates

Use display_available_configs() to list the example configurations packaged with the project. This is useful to pick a starting point before customizing your own run.

[2]:

display_available_configs()

HiPS catalog pipeline configuration reference
=============================================

Top-level sections
------------------
input      [required]
columns    [required]
algorithm  [required]
cluster    [required]
output     [required]

input
-----
paths         [required] list[str]
    Glob patterns for input files (Parquet/CSV/TSV/HATS).
format        [optional, default="parquet"]
    One of: "parquet", "csv", "tsv", "hats".
header        [optional, default=True]
    Whether CSV/TSV files include a header row.
ascii_format  [optional, default=None]
    Optional hint for ASCII input ("CSV" or "TSV").

columns
-------
ra    [required] str
    RA column name.
dec   [required] str
    DEC column name.
keep  [optional, default=None] list[str] or null
    Controls which columns are kept in the HiPS tiles:
      - Not set / null (default):
          Keep all input columns preserving original input order.
      - Empty list []:
          Keep only the essential set: RA, DEC, score deps, and mag/flux
          (if mag_global), with RA/DEC first.
      - Non-empty list:
          Use the provided keep order when it already contains all essential
          columns. Otherwise, prepend missing essential columns before the keep
          order (with RA/DEC first if they are missing).

algorithm (block-based)
-----------------------
selection_mode         [required]
    "mag_global" | "score_global" | "score_density_hybrid".
level_limit            [required] int
    Maximum HiPS order (NorderL). Must be in [4, 11].
moc_order              [optional, default=level_limit] int
    HiPS order used for the MOC.
selection_defaults     [optional] dict
    Shared defaults for all modes. Recognized keys:
      - hist_nbins        (int, default 2048)
      - adaptive_range    ("complete" | "hist_peak", default "complete")
      - order_desc        (bool, default False)
      - density_bias_n1/n2/n3 (float, default 1.0 for SDH)

mag_global block
^^^^^^^^^^^^^^^^
mag_global.mag_column        [required if flux_column absent] str
mag_global.flux_column       [required if mag_column absent] str
mag_global.mag_offset        [required when flux_column is set] float
mag_global.mag_min/max       [optional] float
mag_global.adaptive_range    [optional, default=selection_defaults.adaptive_range or "complete"]
mag_global.hist_nbins        [optional, default=selection_defaults.hist_nbins or 2048]
mag_global.k_1/k_2/k_3       [optional] int, "per active tile" aliases for n_*
mag_global.n_1/n_2/n_3       [optional] int (must be provided in order)
mag_global.order_desc        [optional, default=selection_defaults.order_desc or False]

score_global block
^^^^^^^^^^^^^^^^^^
score_global.score_column    [required] str (column or expression)
score_global.score_min/max   [optional] float
score_global.adaptive_range  [optional, default=selection_defaults.adaptive_range or "complete"]
score_global.hist_nbins      [optional, default=selection_defaults.hist_nbins or 2048]
score_global.k_1/k_2/k_3     [optional] int, "per active tile" aliases for n_*
score_global.n_1/n_2/n_3     [optional] int (must be provided in order)
score_global.order_desc      [optional, default=selection_defaults.order_desc or False]

score_density_hybrid block
^^^^^^^^^^^^^^^^^^^^^^^^^^
score_density_hybrid.score_column   [required] str (column or expression)
score_density_hybrid.score_min/max  [optional] float
score_density_hybrid.adaptive_range [optional, default=selection_defaults.adaptive_range or "complete"]
score_density_hybrid.hist_nbins     [optional, default=selection_defaults.hist_nbins or 2048]
score_density_hybrid.density_up_to_depth [optional, default=4]
    Max depth handled by stage-1 density pass (clamped to level_limit).
score_density_hybrid.k_1/k_2/k_3    [optional] int, "per active tile" aliases for n_*
score_density_hybrid.n_1/n_2/n_3    [optional] int (must be provided in order)
score_density_hybrid.density_bias_n1/n2/n3 [optional, default=selection_defaults.density_bias_n* or 1.0]
    float in [0,1]
score_density_hybrid.order_desc     [optional, default=selection_defaults.order_desc or False]

cluster
-------
mode                     [optional, default="local"]
    Cluster mode: "local" or "slurm".
n_workers                [optional, default=3] int
threads_per_worker       [optional, default=1] int
memory_per_worker        [optional, default="2GB"] str
slurm                    [optional, default=None] dict
low_memory_mode          [optional, default=True] bool
    DEPRECATED: kept only for backward compatibility and has no effect.
    The pipeline now always uses:
      - no DataFrame persistence of large intermediates
      - avoiding early large computes whenever possible
diagnostics_mode         [optional, default="global"]
    "per_step" | "global" | "off".

output
------
out_dir      [required] str
cat_name     [required] str
target       [optional, default="0 0"] str
creator_did  [optional, default=None] str
obs_title    [optional, default=None] str
overwrite    [optional, default=False] bool

Examples
========

Minimal configuration (YAML)
----------------------------

    input:
      paths: ["/path/to/catalog/*.parquet"]
    columns:
      ra: "ra"
      dec: "dec"
    algorithm:
      selection_mode: "mag_global"
      level_limit: 10
      mag_global:
        mag_column: "mag_r"
    cluster: {}
    output:
      out_dir: "/path/to/output"
      cat_name: "MyCatalog"

3. Build a configuration inline

Here we define a minimal configuration dictionary pointing to the sample DES DR2 parquet files. In real runs, adjust paths, output locations, and processing options as needed.

[3]:

cfg = {
    "input": {  # source parquet files (wildcards allowed)
        "paths": ["../../../hipscatalog_gen/tests/data/des_dr2_small_sample/*.parquet"],
    },
    "columns": {  # column mapping for sky coordinates
        "ra": "RA",
        "dec": "DEC",
    },
    "algorithm": {  # HEALPix and selection parameters
        "level_limit": 11,
        "selection_mode": "mag_global",
        "mag_global": {
            "mag_column": "MAG_AUTO_I_DERED",
        },
    },
    "cluster": {},  # use defaults for clustering
    "output": {  # where outputs and catalog names are stored
        "out_dir": "../../../hipscatalog_gen/outputs/DES_DR2_small_sample",
        "cat_name": "DES_DR2_small_sample",
    },
}

# Validate and enrich the config using project defaults
cfg = load_config_from_dict(cfg)

4. Execute the pipeline

With the configuration prepared, call run_pipeline. It will read inputs, process them according to the configuration, and write outputs under the configured root.

[4]:

# run_pipeline(cfg)

Next steps

Swap the input paths to point to your own data or a different survey sample.
Explore the examples/configs templates for richer configurations and options.
Integrate this notebook in an automated workflow (e.g., scheduled jobs) to keep catalogs up to date.