Skip to content

calderlen/malca

Repository files navigation

MALCA: Multi-timescale ASAS-SN Light Curve Analysis

Tests

MALCA is a Bayesian event-detection pipeline for finding dimming and dipping events in ASAS-SN photometric light curves. It fits per-camera Gaussian process baselines, scores candidate events via marginal log-likelihood grids and leave-one-out posterior probabilities, and applies multi-stage quality filters to produce a catalog of dipper candidates. Post-detection modules add multi-wavelength characterization (Gaia, WISE, dust maps) and astrophysical classification.

Contents

Install

# Requires Python >= 3.9
git clone https://github.com/calderlen/malca.git && cd malca
pip install -e "."          # installs all runtime + test dependencies

Conda option:

conda env create -f environment.yml
conda activate malca

Input Files

  • Per-mag-bin directories: <lcsv2_root>/<mag_bin>/
    • Index CSVs: index*.csv with columns like asas_sn_id, ra_deg, dec_deg, pm_ra, pm_dec, ...
  • Light curves: lc<num>_cal/ folders containing <asas_sn_id>.dat2
  • Optional catalogs:
    • VSX crossmatch: input/vsx/asassn_x_vsx_matches_20250919_2252.csv (pre-crossmatched with columns: asas_sn_id, sep_arcsec, class)
    • Raw VSX: input/vsx/vsxcat.090525.csv (used by vsx/filter.py to generate crossmatch)
    • Note: Bright nearby star (BNS) filtering is handled upstream by ASAS-SN during LC generation

Dependencies

  • Core + runtime modules: numpy, pandas, scipy, numba, astropy, celerite2, matplotlib, tqdm, pyarrow
  • Review + plotting: dash, dash-bootstrap-components, plotly
  • Characterization + catalog access: astroquery, dustmaps3d, pyvo, banyan-sigma, requests
  • ML utilities: lightgbm, joblib

Quick Start

# Build manifest (source_id → path index)
malca manifest --index-root /path/to/lcsv2 --lc-root /path/to/lcsv2 --mag-bin 13_13.5 --out output/manifest.parquet --workers 10

# Run event detection pipeline
malca pipeline --mag-bin 13_13.5 --workers 10 --lc-root /path/to/lcsv2 --index-root /path/to/lcsv2 --output output/results.parquet --min-mag-offset 0.1

# Validate results against known candidates (no raw data needed)
malca validate --results output/results.parquet

# Plot light curves
malca plot --input /path/to/lc123.dat2 --out-dir output/plots

# Apply quality filters
malca filter --input output/results.parquet --output output/filtered.parquet

# Multi-wavelength characterization (post-detection)
malca characterize --input output/filtered.parquet --output output/characterized.parquet --dust --starhorse input/starhorse/starhorse2021.parquet

# Get help for any command
malca --help
malca pipeline --help

Minimal split workflow (cluster -> home):

# On cluster: run upstream/raw-dependent steps and export transfer bundle
malca pipeline --stage cluster --mag-bin 13_13.5 --out-dir output/run_001 --export-bundle output/run_001_bundle.zip

# On home machine: import bundle and run downstream/catalog steps only
malca pipeline --stage home --out-dir output/run_001 --import-bundle ~/Downloads/run_001_bundle.zip

Pipeline Architecture

flowchart TB

    %% ── Data Sources ─────────────────────────────────────────
    subgraph sources["Data Sources"]
        RAW["ASAS-SN .dat2 Light Curves"]
        IDX["Index CSVs<br/>(per mag bin)"]
        SKY["SkyPatrol CSVs"]
        VSX_RAW["VSX Catalog"]
        GAIA_SRC["Gaia DR3"]
        SH_SRC["StarHorse Catalog"]
        DUST_SRC["3D Dust Maps<br/>(Wang+ 2025)"]
    end

    %% ── Data Preparation ─────────────────────────────────────
    subgraph prep["Data Preparation"]
        MAN["manifest.py<br/>Build source_id-to-path index"]
        MAN_OUT[("Manifest .parquet")]
        MAN --> MAN_OUT

        subgraph vsxtools["VSX Preprocessing (vsx/)"]
            VFILT["filter.py<br/>Clean variable classes"]
            VCROSS["crossmatch.py<br/>PM-corrected positional match"]
            VFILT --> VCROSS
        end
        VCROSS --> VSX_MATCH[("VSX Crossmatch")]
    end

    RAW --> MAN
    IDX --> MAN
    VSX_RAW --> VFILT

    %% ── Discovery Pipeline ───────────────────────────────────
    subgraph discovery["Discovery Pipeline (detect.py orchestrator)"]
        TAG["tag.py<br/>Sparse-LC, multi-camera,<br/>VSX quality tags"]
        EVENTS["events.py<br/>Bayesian detection, morphology fits,<br/>recurrence analysis, Bayes factors"]
        FILT["filter.py<br/>Evidence strength, run robustness,<br/>morphology, periodicity,<br/>Gaia RUWE/PM, periodic catalogs"]
        TAG --> EVENTS --> FILT
    end

    MAN_OUT --> TAG
    VSX_MATCH -.-> TAG
    GAIA_SRC -.-> FILT
    FILT --> CAND[("Candidates .parquet")]

    %% ── Post-Detection Characterization ──────────────────────
    subgraph postdet["Post-Detection"]
        CHAR["characterize.py<br/>Gaia astrometry/photometry, 3D dust,<br/>YSO classes, galactic coords,<br/>BANYAN, IPHAS, SFR, clusters, unWISE"]
        VET["vetting.py<br/>SIMBAD, Gaia variability/EB,<br/>ASAS-SN Var, ZTF, TNS, ALeRCE,<br/>eROSITA, ATLAS, NEOWISE"]
        CLASS["classify.py<br/>EB/CV/starspot/disk/YSO"]

        subgraph enrichgrp["Enrichment (enrich/)"]
            NEIGH["neighbor.py<br/>Gaia, 2MASS, AllWISE, VSX"]
            SPECTRA["spectra.py<br/>SDSS, LAMOST, GALAH, RAVE"]
        end

        CHAR --> VET --> CLASS --> enrichgrp
    end

    CAND --> CHAR
    GAIA_SRC -.-> CHAR
    DUST_SRC -.-> CHAR
    SH_SRC -.-> CHAR
    enrichgrp --> ENRICHED[("Enriched .parquet")]

    %% ── Visualization ────────────────────────────────────────
    PLOT["plot.py<br/>Light curve + event visualization"]
    CAND --> PLOT
    RAW -.-> PLOT
    SKY -.-> PLOT

    %% ── Review App ───────────────────────────────────────────
    subgraph reviewgrp["Review App (review/)"]
        STORE["store.py<br/>SQLite candidate DB"]
        APP["app.py<br/>Dash GUI: scoring, event classes,<br/>vetting cards, diagnostic plots"]
        RPIPE["pipeline.py<br/>Run missing stages on demand"]
        RMERGE["merge.py<br/>Merge review DBs"]
        RDIAG["diagnostic_plots.py<br/>CMD, Kiel, NEOWISE, Gaia epoch"]
        REXPLORE["explorer.py<br/>EDA + LC explorer"]
        STORE --> APP
        RPIPE -.-> APP
        RDIAG -.-> APP
    end

    CAND --> STORE
    ENRICHED -.-> STORE
    APP --> LABELS[("Labeled Reviews<br/>score + event_class")]

    %% ── Machine Learning ─────────────────────────────────────
    subgraph mlgrp["Machine Learning (ml/)"]
        FEAT["features.py<br/>107 curated features"]
        TRAIN["train.py<br/>LightGBM classifier"]
        PRED["predict.py<br/>Score new candidates"]
        FEAT --> TRAIN --> MODEL[("Model + schema")]
        MODEL --> PRED
    end

    LABELS -.-> TRAIN
    ENRICHED -.-> FEAT

    %% ── LTV Pipeline ─────────────────────────────────────────
    subgraph ltvpipe["LTV Pipeline - Long-Term Variability (ltv/)"]
        LTV_PIPE["pipeline.py<br/>Orchestrator"]
        LTV_CORE["core.py<br/>Season medians, linear/quad fits,<br/>slopes, Lomb-Scargle"]
        LTV_FILT["filter.py<br/>Slope, max diff, dec, PM cuts"]
        LTV_CROSS["crossmatch.py<br/>Gaia, VSX, OGLE, ZTF,<br/>Gaia Alerts, MilliQuas, SIMBAD"]
        LTV_STOCH["stochastic.py<br/>Structure function, IAR,<br/>MHPS, DRW"]
        LTV_NEO["neowise.py<br/>IRSA TAP IR light curves"]
        LTV_DUST["dust.py<br/>Dust excess flags"]
        LTV_CMD["cmd.py<br/>MIST grid, Bailer-Jones distances"]
        LTV_BUNDLE["bundle.py<br/>Package .dat2 files"]
        LTV_INGEST["review.py<br/>Ingest into review DB"]
        LTV_PIPE --> LTV_CORE --> LTV_FILT
        LTV_FILT --> LTV_CROSS --> LTV_STOCH
        LTV_STOCH --> LTV_NEO --> LTV_DUST --> LTV_CMD
        LTV_CMD --> LTV_BUNDLE --> LTV_INGEST
    end

    RAW --> LTV_PIPE
    IDX --> LTV_PIPE
    GAIA_SRC -.-> LTV_CROSS
    LTV_INGEST --> STORE

    %% ── Evaluation ───────────────────────────────────────────
    subgraph evalgrp["Evaluation (evaluation/)"]
        INJ["injection.py<br/>Synthetic dip injection-recovery"]
        DET_RATE["detection_rate.py<br/>Baseline detection rate"]
        VALID["validation.py<br/>Precision/recall vs known targets"]
        REPRO["reproduce.py<br/>Re-run detection on known objects"]
        ATTR["attrition.py<br/>Filter attrition summary"]
        FP_EVAL["false_positive.py<br/>FP contaminant benchmark"]
    end

    MAN_OUT -.-> INJ
    MAN_OUT -.-> DET_RATE
    MAN_OUT -.-> REPRO
    CAND -.-> VALID
    CAND -.-> REPRO
    CAND -.-> ATTR

    %% ── Core Libraries ───────────────────────────────────────
    subgraph corelibs["Core Libraries"]
        UTILS["utils.py<br/>LC I/O, cleaning, kernels"]
        LCIO["lightcurve_io.py<br/>.dat2 / .csv readers"]
        BASE["baseline.py<br/>GP + median baselines"]
        TRIG["triggering.py<br/>logBF / posterior trigger resolution"]
        SCORE_LIB["score.py<br/>Dip/jump/microlensing scoring"]
        STATS_LIB["stats.py<br/>Stetson, von Neumann, RoMS, LS"]
        PERIOD_LIB["periodogram.py<br/>Lomb-Scargle, PDM,<br/>Conditional Entropy"]
        PCA_LIB["pca.py<br/>Variability PCA"]
        FETCH_LIB["fetch.py<br/>SkyPatrol V1/V2 download"]
        GAIA_FETCH["gaia_fetch.py<br/>Bulk Gaia DR3 via AIP TAP"]
    end

    UTILS -.-> EVENTS
    BASE -.-> EVENTS
    TRIG -.-> EVENTS
    SCORE_LIB -.-> EVENTS
    STATS_LIB -.-> SCORE_LIB
    PERIOD_LIB -.-> FILT
    UTILS -.-> REPRO
    BASE -.-> REPRO

    %% ── Configuration ────────────────────────────────────────
    subgraph configgrp["Configuration (config/)"]
        direction LR
        CONF["config_paths, config_pipeline, config_filters,<br/>config_io, config_characterize, config_classify,<br/>config_ltv, config_stats, config_ml, config_vetting"]
    end

    %% ── CLI Entry Point ──────────────────────────────────────
    CLI["__main__.py — malca CLI<br/>manifest, pipeline, filter, tag, events, plot, characterize, classify,<br/>vetting, review, ml_train, ml_predict, injection, validate, reproduce,<br/>ltv-pipeline, ltv-core, ltv-build, ltv-ingest, attrition, stats, ..."]
    CLI -.-> discovery
    CLI -.-> postdet
    CLI -.-> reviewgrp
    CLI -.-> mlgrp
    CLI -.-> ltvpipe
    CLI -.-> evalgrp
    CLI -.-> PLOT
Loading

Key Components:

  • Discovery pipeline: manifest.pytag.pyevents.pyfilter.py (orchestrated by detect.py)
  • Post-detection: characterize.py (Gaia, dust, YSO, galactic coords, auxiliary catalogs) → vetting.py (SIMBAD, ZTF, TNS, eROSITA, ALeRCE, ATLAS, NEOWISE, ...) → classify.py (EB/CV/starspot/disk/YSO) → enrich/ (neighbor catalogs, spectra availability)
  • LTV pipeline: ltv/pipeline.pycore.pyfilter.pycrossmatch.pystochastic.pyneowise.pydust.pycmd.pybundle.pyreview.py (ingest to review DB)
  • Review: review/app.py (Dash GUI with scoring, event classes, diagnostic plots, vetting cards) → labeled training set
  • ML: ml/features.py (107 curated features) → ml/train.py (LightGBM classifier) → ml/predict.py (score candidates)
  • Evaluation: injection.py (synthetic dips), detection_rate.py, validation.py, reproduce.py, attrition.py, false_positive.py
  • Core libraries: utils.py, lightcurve_io.py, baseline.py, triggering.py, score.py, stats.py, periodogram.py, pca.py, fetch.py, gaia_fetch.py
  • Configuration: 10 modules in config/ centralizing all pipeline parameters
  • CLI: Unified interface via malca [command] (__main__.py)

See docs/architecture.md for detailed documentation.

Usage Guide

Detection Pipeline

The full detection workflow has three steps: build a manifest, run detection with batching/resume, then filter.

  1. Build a manifest (map IDs -> light-curve directories):

    malca manifest --index-root /path/to/lcsv2 --lc-root /path/to/lcsv2 --mag-bin 13_13.5 --out output/lc_manifest_13_13.5.parquet --workers 10
  2. Tag and run events in batches with resume support:

    malca pipeline --mag-bin 13_13.5 --workers 10 --min-time-span 100 --min-points-per-day 0.05 --min-cameras 2 --vsx-crossmatch input/vsx/asassn_x_vsx_matches_20250919_2252.csv --batch-size 2000 --lc-root /path/to/lcsv2 --index-root /path/to/lcsv2 --output output/lc_events_results_13_13.5.parquet --trigger-mode posterior_prob --baseline-func gp --min-mag-offset 0.1
    • The pipeline command builds/loads the manifest, runs tag checks, then calls events.py in batches.
    • Resume: if interrupted, skips already-processed paths using the checkpoint file.
    • VSX tags are saved to tags/vsx_tags/ and merged into results.
    • To disable VSX handling: --skip-vsx. To tag instead of filter: --vsx-mode tag.
  3. Filter events:

    malca filter --input output/lc_events_results_13_13.5.parquet --output output/lc_events_results_13_13.5_filtered.parquet
    
    # With custom thresholds
    malca filter --input results.parquet --output filtered.parquet --min-bayes-factor 20 --min-run-points 3 --apply-morphology
    • Implemented filters: posterior strength, run robustness, score, morphology, periodicity, Gaia RUWE, Gaia PM, multi-catalog periodic consensus
  4. Optional: tune filter behavior directly from malca pipeline / malca detect.

    # Keep pipeline defaults but disable score-based rejection
    malca pipeline --mag-bin 13_13.5 --skip-score-filter
    
    # Enable stricter optional validators
    malca pipeline --mag-bin 13_13.5 --apply-morphology --min-delta-bic 12 --apply-periodicity-validation --periodicity-n-bootstrap 2000 --gaia-reject --periodic-catalog-reject
    • Defaults in pipeline: evidence strength, run robustness, score, Gaia RUWE, Gaia PM, and periodic-catalog consensus validation are on; morphology and periodicity-validation are off.
    • Control flags now available in pipeline:
      • Evidence/run: --skip-evidence-strength, --allow-infinite-local-bf, --skip-run-robustness, --min-run-count, --filter-min-run-points, --filter-min-run-cameras
      • Morphology/score: --apply-morphology, --dip-morphology, --jump-morphology, --min-delta-bic, --skip-score-filter, --min-score
      • Validators: --apply-periodicity-validation (+ periodicity knobs), --skip-gaia-ruwe-validation|--gaia-reject, --skip-gaia-pm-validation|--gaia-pm-reject, --skip-periodic-catalog-validation|--periodic-catalog-reject

Detect options:

# logBF triggering (faster)
malca pipeline --mag-bin 13_13.5 --workers 8 --lc-root /path/to/lcsv2 --index-root /path/to/lcsv2 --output output/events_logbf.parquet --trigger-mode logbf --baseline-func gp_masked --min-mag-offset 0.1

# Multiple mag bins (writes one output per bin)
malca pipeline --mag-bin 12_12.5 12.5_13 13_13.5 --lc-root /path/to/lcsv2 --index-root /path/to/lcsv2 --output output/lc_events_results.parquet --trigger-mode logbf

Individual Commands

malca manifest

malca manifest --index-root <index_dir> --lc-root <lc_dir> --mag-bin 12_12.5 --out output/lc_manifest.parquet

malca events

Run event detection directly (without the pipeline orchestrator):

malca events --input /path/to/lc*_cal/*.dat2 --output output/results.parquet --workers 10

# With signal amplitude filtering (requires |event_mag - baseline_mag| > 0.1)
malca events --input /path/to/lc*_cal/*.dat2 --output output/results.parquet --workers 10 --min-mag-offset 0.1
  • Default Bayesian grid is 12x12. Change p-grid with --p-points.
  • Output includes per-event morphology fit parameters (best_amp, best_t0, best_alpha, best_tau, best_morph, delta_bic, width_param, symmetry_score) and recurrence statistics (is_single_event, inter_event_spacing_median/std, amplitude_consistency, duration_consistency) for both dips and jumps.

malca tag

malca tag --help
  • Expects columns asas_sn_id and path pointing to lc_dir.
  • VSX handling: default is tag (keeps all rows and attaches vsx_sep_arcsec/vsx_class). Use --vsx-mode filter only when you explicitly want VSX-based rejection.

malca filter

malca filter --input output/results.parquet --output output/results_filtered.parquet

malca plot

# Single file
malca plot --input /path/to/lc123.dat2 --out-dir output/plots --format png

# Multiple files (glob patterns supported)
malca plot --input input/skypatrol2/*.csv --out-dir output/plots --skip-events

# All files from events.py results
malca plot --events output/lc_events_results_13_13.5_filtered.parquet --out-dir output/plots

Note: Event scores are computed automatically during detection and included in the results table (dipper_score, dipper_n_dips, dipper_n_valid_dips columns).

Legacy batch plotting: malca old.plot_results_bayes /path/to/*.csv --results-csv output/lc_events_results_13_13.5.csv --out-dir output/plots

malca injection

# Full run
malca injection --workers 10

# Quick test with limited trials
malca injection --max-trials 1000 --workers 10

# Custom manifest and output directory
malca injection --manifest /path/to/manifest.parquet --out-dir output/injection

See Injection Testing output for the directory layout.

  • Injects synthetic dips with skew-normal profiles onto real observed light curves
  • Preserves real cadence, systematics, and noise characteristics
  • Supports resume for long-running parameter sweeps

Python API:

from malca.evaluation.injection import (
    load_efficiency_cube,
    plot_efficiency_all,
    plot_efficiency_mag_slices,
    plot_efficiency_marginalized,
    plot_efficiency_threshold_contour,
    plot_efficiency_3d,
)

cube = load_efficiency_cube("output/injection/cubes/efficiency_cube.npz")
plot_efficiency_marginalized(cube, axis="mag", output_path="avg_over_mag.png")
plot_efficiency_threshold_contour(cube, threshold=0.5, output_path="depth_at_50pct.png")

malca reproduce

# Re-run detection on raw data (requires manifest and .dat2 files)
malca reproduce --manifest output/lc_manifest.parquet --candidates my_targets.csv --out-dir output/results_repro --workers 10

Note: Reproduction uses Bayesian detection.

malca validate

# Auto-discover and validate ALL results for LOO method
malca validate --method loo

# Auto-discover for Bayes Factor method
malca validate --method bf

# Filter to specific magnitude bin
malca validate --method loo --mag-bin 13_13.5

# Direct file specification
malca validate --results output/results.parquet

# Validate latest detect run output (output/runs/<timestamp>/results)
malca validate --latest-run

# Validate a specific detect run directory
malca validate --run-dir output/runs/20250119_1349

# With custom candidates
malca validate --method loo --candidates my_targets.csv -v

# Reproduce on built-in candidates using local SkyPatrol CSVs
malca validate --candidates brayden_candidates --skypatrol-dir input/skypatrol2 --method bf --workers 4

# Validate using a direct results file path
malca validate --results output/events_logbf.parquet

malca characterize

After detecting dipper candidates, characterize them using multi-wavelength data:

malca characterize --input output/filtered.parquet --output output/characterized.parquet --dust --starhorse input/starhorse/starhorse2021.parquet

Features:

  • Gaia DR3 Queries: Astrometry, astrophysics (Teff, logg, metallicity, distance), 2MASS/AllWISE photometry
  • 3D Dust Extinction: All-sky coverage via dustmaps3d (Wang et al. 2025, ~350MB)
  • YSO Classification: Koenig & Leisawitz (2014) IR color-color diagram with dust correction
  • Galactic Coordinates: Galactic longitude/latitude (l, b) from ra/dec
  • Galactic Population: Thin/thick disk classification using metallicity or StarHorse ages
  • StarHorse (if provided): Stellar ages, masses, distances from local catalog join
  • Auxiliary Catalog Crossmatches (Tzanidakis+2025):
    • BANYAN Σ: Young stellar association membership probabilities
    • IPHAS DR2: Hα emission detection for Galactic plane sources
    • Star-forming regions: Proximity check to known SFRs (Prisinzano+2022)
    • Open clusters: Cantat-Gaudin+2020 membership crossmatch
    • unWISE/unTimely: Mid-IR variability z-scores
  • Caching: Gaia results cached locally to speed up repeated analyses

Setup:

# Dust maps auto-download on first use (~350MB)
# For StarHorse, download catalog manually:
# https://cdsarc.cds.unistra.fr/viz-bin/cat/I/354

Output columns:

  • source_id, ra, dec, parallax, distance_gspphot
  • tmass_j, tmass_h, tmass_k, unwise_w1, unwise_w2
  • A_v_3d, ebv_3d (3D dust extinction)
  • H_K, W1_W2, yso_class (Class I/II/Transition Disk/Main Sequence)
  • population (thin_disk/thick_disk from metallicity or age)
  • age50, mass50 (if StarHorse provided)
  • gal_l, gal_b (Galactic coordinates)
  • Auxiliary crossmatches (Tzanidakis+2025):
    • banyan_field_prob, banyan_best_assoc (BANYAN Σ membership)
    • iphas_r_ha, iphas_ha_excess (IPHAS Hα)
    • near_sfr, sfr_name (star-forming region proximity)
    • cluster_name, cluster_age_myr (open cluster membership)
    • unwise_w1_zscore, unwise_w2_zscore, unwise_w1_var (IR variability)

malca vetting

Run post-review vetting against external catalogs:

# Vet all candidates in a characterized parquet
malca vetting output/characterized.parquet -o output/vetted.parquet

# Skip slow modules
malca vetting output/characterized.parquet --no-simbad --no-alerce

# Only vet high-scoring candidates
malca vetting output/characterized.parquet --min-score 3.0

# With crash-resume checkpoint
malca vetting output/characterized.parquet --checkpoint output/vetting_checkpoint.parquet

Modules (all on by default, disable with --no-*):

  • SIMBAD: Object type, bibliography, cross-IDs
  • Gaia DR3 variability: Variable flag, classification, score
  • Gaia DR3 eclipsing binaries: Period, morphology, global ranking
  • Gaia epoch photometry: Availability, observation count, G-band range
  • ASAS-SN variables: Variable star catalog crossmatch
  • ZTF variables: Chen+ 2020 periodic variables (type, period, amplitude)
  • TNS: Transient Name Server (name, type, redshift, discovery date)
  • ALeRCE: ZTF broker classifications and stamp probabilities
  • eROSITA: X-ray detection, flux, separation
  • PM consistency: Proper motion agreement with host cluster
  • ATLAS (opt-in, --atlas-token): Forced photometry light curves
  • NEOWISE (opt-in, --neowise-lc): Full NEOWISE light curves

Pipeline default: vetting runs by default in malca pipeline; use --no-run-vetting to opt out.

Vetting is also available during import in the review GUI ("Vet on import" toggle). Results are cached per input file so re-imports skip already-vetted candidates.

malca classify

malca classify --input output/characterized.parquet --output output/classified.parquet

malca stats

malca stats /path/to/lc123.dat2

malca attrition

malca attrition --pre output/pre.parquet --post output/post.parquet

Candidate Review

# Launch Dash review GUI against an existing run bundle
malca review --plot-dir output/runs/YOUR_RUN/plots

# Standalone mode (no plot directory required)
malca review

Dash GUI features:

  • Native Plotly light-curve viewer with PNG fallback, camera filtering, and plot presets/overlays (raw points, dip/jump markers, residuals, phase-fold, diagnostics)
  • Confidence scoring (1-4) via number keys or clickable buttons
  • Event class labeling (single-select) with direct key shortcuts and clickable badges: dipper, microlensing, flare, yso, unknown_interesting, instrumental, other (toggle off to unclassified)
  • Collapsible candidate panels with metadata health, vetting banner, external follow-up cards, diagnostic plots, and run-config provenance
  • Sidebar queue controls: unreviewed/failed filters, grouped numeric/text/select filters, multi-column sort, open-existing jump, and native camera selection
  • Import/fetch workflows: import tables or raw LC files (optional characterize + vet on import), or fetch by ASAS-SN ID, Gaia DR3 ID, or coordinates
  • Per-candidate pipeline stage chips with "Run All Missing" / "Re-run Current", plus notes/followup/review-pass tracking and CSV/Parquet export

malca ml_train

Train a baseline classifier on reviewed labels:

malca ml_train --input output/review/reviewed.parquet --out-dir output/ml --cv-folds 5
  • Uses curated physics/context features from malca/ml/features.py
  • Trains a LightGBM classifier on labeled event_class values (dropping unclassified by default)
  • Saves model artifacts to output/ml/ (candidate_classifier.joblib, feature_schema.json, metrics.json)

malca ml_predict

Score candidates with a trained classifier:

malca ml_predict --model-dir output/ml --input output/review/reviewed.parquet --output output/review/scored.parquet
  • Loads candidate_classifier.joblib + feature_schema.json
  • Applies the same feature transforms used during training
  • Appends ml_predicted_class and ml_prob_<class> columns to the output table

malca vsx-filter

Build the cleaned ASAS-SN index and filtered VSX catalog:

malca vsx-filter --help
malca vsx-filter --vsx-file input/vsx/vsxcat.090525.csv --masked-dir /path/to/lcsv2_masked --output-dir input/vsx
malca vsx-filter --stamp 20260213_120000   # timestamped output filenames
  • Reads the raw fixed-width VSX catalog and filters out unwanted variability classes (eclipsing binaries, supernovae, AGN, etc.)
  • Concatenates masked ASAS-SN index CSVs from all magnitude bins
  • Outputs asassn_catalog.csv and vsx_cleaned.csv (or timestamped variants with --stamp)

malca vsx-crossmatch

Crossmatch ASAS-SN sources with VSX by position (with proper-motion correction):

malca vsx-crossmatch --help
malca vsx-crossmatch --asassn-csv input/vsx/asassn_catalog.csv --vsx-csv input/vsx/vsx_cleaned.csv
malca vsx-crossmatch --radius 5.0 --stamp 20260213_120000
  • Propagates ASAS-SN coordinates from epoch 2016.0 to 2000.0 using proper motions
  • Default match radius is 3 arcseconds
  • Outputs asassn_x_vsx_matches_{stamp}.csv to input/vsx/

Output Directory Structure

Integrated Pipeline

When running malca pipeline, the following directory structure is created for complete provenance tracking:

output/runs/20250121_143052/          # Timestamp-based run directory
├── run_params.json                   # Detection parameters (detect.py)
├── run_summary.json                 # Detection results stats (detect.py)
├── filter_log.json                   # Filtering parameters & stats (filter.py)
├── plot_log.json                     # Plotting parameters (plot.py)
├── run.log                           # Simple text log with paths
│
├── manifests/                        # Manifest files
│   └── lc_manifest_{mag_bin}.parquet
│
├── tags/                             # Tagging results
│   ├── lc_filtered_{mag_bin}.parquet
│   ├── lc_stats_checkpoint_{mag_bin}.parquet
│   ├── rejected_tag_{mag_bin}.csv
│   └── vsx_tags/
│       └── vsx_tags_{mag_bin}.csv
│
├── paths/                            # Input paths
│   └── filtered_paths_{mag_bin}.txt
│
├── results/                          # Detection results
│   ├── lc_events_results.parquet     # Raw detection output (includes dipper_score)
│   ├── lc_events_results_PROCESSED.txt  # Checkpoint log
│   ├── lc_events_results_filtered.parquet   # After filter.py
│   └── rejected_filter.csv           # Filter rejections
│
└── plots/                            # Visualizations (plot.py)
    ├── {source_id}_dips.png
    ├── {source_id}_dips.png
    └── ...

Key Features:

  • JSON logs track full provenance: Every parameter and result is logged for reproducibility
  • Self-contained runs: Each timestamped directory contains everything needed to reproduce the analysis
  • Checkpoint support: Detection runs can be interrupted and resumed using *_PROCESSED.txt files
  • Rejection tracking: Both tagging and filter rejections are logged with reasons

JSON Log Contents:

  • run_params.json: All tagging and detection parameters (thresholds, workers, baseline settings)
  • run_summary.json: Manifest statistics, tag rejection breakdown, detection results
  • filter_log.json: Filter toggles, thresholds, input/output counts, rejection breakdown
  • plot_log.json: Plotting parameters, GP settings, number of plots generated

Note: Event scores (dipper_score, dipper_n_dips, dipper_n_valid_dips) are automatically computed during detection for significant events and included in the results table.

Standalone Module Outputs

Injection Testing

output/injection/                     # Default output directory
├── results/
│   ├── injection_results.parquet     # Trial-by-trial injection results
│   └── injection_results_PROCESSED.txt  # Checkpoint for resume
│
├── cubes/
│   └── efficiency_cube.npz           # 3D efficiency cube (depth × duration × mag)
│
└── plots/
    ├── mag_slices/                   # Per-magnitude 2D heatmaps
    │   ├── mag_12.0_efficiency.png
    │   ├── mag_13.0_efficiency.png
    │   └── ...
    ├── efficiency_marginalized_*.png  # Averaged over one axis
    ├── depth_at_*pct_efficiency.png   # Threshold contour maps
    └── efficiency_3d_volume.html      # Interactive 3D (if plotly installed)

Detection Rate

output/detection_rate/                # Default base directory
├── 20250121_143052/                  # Timestamped run directory
│   ├── run_params.json                # Full parameter dump
│   ├── results/
│   │   ├── detection_rate_results.parquet
│   │   ├── detection_rate_results_PROCESSED.txt  # Checkpoint
│   │   └── detection_summary.json     # Detection rate summary
│   └── plots/
│       ├── detection_rate_vs_mag.png
│       ├── detection_duration_dist.png
│       └── detection_depth_dist.png
│
├── 20250121_150318_custom_tag/       # Optional --run-tag appended
│   └── ...
│
└── latest -> 20250121_150318_custom_tag/  # Symlink to latest run

Multi-Wavelength Characterization

output/
├── characterized.parquet             # Single output file with added columns:
                                      #   - Gaia astrometry & photometry
                                      #   - 3D dust extinction (A_v_3d, ebv_3d)
                                      #   - YSO classification (yso_class)
                                      #   - Galactic population (thin_disk/thick_disk)
                                      #   - StarHorse ages/masses (if provided)
                                      #   - Auxiliary crossmatches (BANYAN Σ, IPHAS, etc.)
└── gaia_cache/                       # Gaia query cache (created when cache is used)
    └── gaia_results_{hash}.parquet

Dipper Classification

output/
└── classified.parquet                # Single output file with added columns:
                                      #   - P_eb, P_cv, P_starspot, P_disk
                                      #   - yso_class
                                      #   - a_circ_au, transit_prob
                                      #   - final_class (EB/CV/Starspot/Disk/YSO/Unknown)

Manifest Building

output/
└── lc_manifest_{mag_bin}.parquet     # Single parquet file with:
                                      #   - asas_sn_id
                                      #   - ra_deg, dec_deg
                                      #   - lc_dir (directory path)
                                      #   - dat_path (full .dat2 path)
                                      #   - dat_exists (bool)

Citation

If you use MALCA or any part of its codebase in published research, please cite this repository:

Lenhart, C. (2025). MALCA: Multi-timescale ASAS-SN Light Curve Analysis [Software].
https://github.com/calderlen/malca

License

This project is licensed under the GNU General Public License v3.0. See LICENSE for details.

About

pipeline to search for "big dipper" extrinsic variables, microlensing events, and long-term variability in ASAS-SN light curves

Topics

Resources

License

Stars

Watchers

Forks

Contributors