MALCA is a Bayesian event-detection pipeline for finding dimming and dipping events in ASAS-SN photometric light curves. It fits per-camera Gaussian process baselines, scores candidate events via marginal log-likelihood grids and leave-one-out posterior probabilities, and applies multi-stage quality filters to produce a catalog of dipper candidates. Post-detection modules add multi-wavelength characterization (Gaia, WISE, dust maps) and astrophysical classification.
# Requires Python >= 3.9
git clone https://github.com/calderlen/malca.git && cd malca
pip install -e "." # installs all runtime + test dependenciesConda option:
conda env create -f environment.yml
conda activate malca- Per-mag-bin directories:
<lcsv2_root>/<mag_bin>/- Index CSVs:
index*.csvwith columns likeasas_sn_id, ra_deg, dec_deg, pm_ra, pm_dec, ...
- Index CSVs:
- Light curves:
lc<num>_cal/folders containing<asas_sn_id>.dat2 - Optional catalogs:
- VSX crossmatch:
input/vsx/asassn_x_vsx_matches_20250919_2252.csv(pre-crossmatched with columns: asas_sn_id, sep_arcsec, class) - Raw VSX:
input/vsx/vsxcat.090525.csv(used byvsx/filter.pyto generate crossmatch) - Note: Bright nearby star (BNS) filtering is handled upstream by ASAS-SN during LC generation
- VSX crossmatch:
- Core + runtime modules: numpy, pandas, scipy, numba, astropy, celerite2, matplotlib, tqdm, pyarrow
- Review + plotting: dash, dash-bootstrap-components, plotly
- Characterization + catalog access: astroquery, dustmaps3d, pyvo, banyan-sigma, requests
- ML utilities: lightgbm, joblib
# Build manifest (source_id → path index)
malca manifest --index-root /path/to/lcsv2 --lc-root /path/to/lcsv2 --mag-bin 13_13.5 --out output/manifest.parquet --workers 10
# Run event detection pipeline
malca pipeline --mag-bin 13_13.5 --workers 10 --lc-root /path/to/lcsv2 --index-root /path/to/lcsv2 --output output/results.parquet --min-mag-offset 0.1
# Validate results against known candidates (no raw data needed)
malca validate --results output/results.parquet
# Plot light curves
malca plot --input /path/to/lc123.dat2 --out-dir output/plots
# Apply quality filters
malca filter --input output/results.parquet --output output/filtered.parquet
# Multi-wavelength characterization (post-detection)
malca characterize --input output/filtered.parquet --output output/characterized.parquet --dust --starhorse input/starhorse/starhorse2021.parquet
# Get help for any command
malca --help
malca pipeline --helpMinimal split workflow (cluster -> home):
# On cluster: run upstream/raw-dependent steps and export transfer bundle
malca pipeline --stage cluster --mag-bin 13_13.5 --out-dir output/run_001 --export-bundle output/run_001_bundle.zip
# On home machine: import bundle and run downstream/catalog steps only
malca pipeline --stage home --out-dir output/run_001 --import-bundle ~/Downloads/run_001_bundle.zipflowchart TB
%% ── Data Sources ─────────────────────────────────────────
subgraph sources["Data Sources"]
RAW["ASAS-SN .dat2 Light Curves"]
IDX["Index CSVs<br/>(per mag bin)"]
SKY["SkyPatrol CSVs"]
VSX_RAW["VSX Catalog"]
GAIA_SRC["Gaia DR3"]
SH_SRC["StarHorse Catalog"]
DUST_SRC["3D Dust Maps<br/>(Wang+ 2025)"]
end
%% ── Data Preparation ─────────────────────────────────────
subgraph prep["Data Preparation"]
MAN["manifest.py<br/>Build source_id-to-path index"]
MAN_OUT[("Manifest .parquet")]
MAN --> MAN_OUT
subgraph vsxtools["VSX Preprocessing (vsx/)"]
VFILT["filter.py<br/>Clean variable classes"]
VCROSS["crossmatch.py<br/>PM-corrected positional match"]
VFILT --> VCROSS
end
VCROSS --> VSX_MATCH[("VSX Crossmatch")]
end
RAW --> MAN
IDX --> MAN
VSX_RAW --> VFILT
%% ── Discovery Pipeline ───────────────────────────────────
subgraph discovery["Discovery Pipeline (detect.py orchestrator)"]
TAG["tag.py<br/>Sparse-LC, multi-camera,<br/>VSX quality tags"]
EVENTS["events.py<br/>Bayesian detection, morphology fits,<br/>recurrence analysis, Bayes factors"]
FILT["filter.py<br/>Evidence strength, run robustness,<br/>morphology, periodicity,<br/>Gaia RUWE/PM, periodic catalogs"]
TAG --> EVENTS --> FILT
end
MAN_OUT --> TAG
VSX_MATCH -.-> TAG
GAIA_SRC -.-> FILT
FILT --> CAND[("Candidates .parquet")]
%% ── Post-Detection Characterization ──────────────────────
subgraph postdet["Post-Detection"]
CHAR["characterize.py<br/>Gaia astrometry/photometry, 3D dust,<br/>YSO classes, galactic coords,<br/>BANYAN, IPHAS, SFR, clusters, unWISE"]
VET["vetting.py<br/>SIMBAD, Gaia variability/EB,<br/>ASAS-SN Var, ZTF, TNS, ALeRCE,<br/>eROSITA, ATLAS, NEOWISE"]
CLASS["classify.py<br/>EB/CV/starspot/disk/YSO"]
subgraph enrichgrp["Enrichment (enrich/)"]
NEIGH["neighbor.py<br/>Gaia, 2MASS, AllWISE, VSX"]
SPECTRA["spectra.py<br/>SDSS, LAMOST, GALAH, RAVE"]
end
CHAR --> VET --> CLASS --> enrichgrp
end
CAND --> CHAR
GAIA_SRC -.-> CHAR
DUST_SRC -.-> CHAR
SH_SRC -.-> CHAR
enrichgrp --> ENRICHED[("Enriched .parquet")]
%% ── Visualization ────────────────────────────────────────
PLOT["plot.py<br/>Light curve + event visualization"]
CAND --> PLOT
RAW -.-> PLOT
SKY -.-> PLOT
%% ── Review App ───────────────────────────────────────────
subgraph reviewgrp["Review App (review/)"]
STORE["store.py<br/>SQLite candidate DB"]
APP["app.py<br/>Dash GUI: scoring, event classes,<br/>vetting cards, diagnostic plots"]
RPIPE["pipeline.py<br/>Run missing stages on demand"]
RMERGE["merge.py<br/>Merge review DBs"]
RDIAG["diagnostic_plots.py<br/>CMD, Kiel, NEOWISE, Gaia epoch"]
REXPLORE["explorer.py<br/>EDA + LC explorer"]
STORE --> APP
RPIPE -.-> APP
RDIAG -.-> APP
end
CAND --> STORE
ENRICHED -.-> STORE
APP --> LABELS[("Labeled Reviews<br/>score + event_class")]
%% ── Machine Learning ─────────────────────────────────────
subgraph mlgrp["Machine Learning (ml/)"]
FEAT["features.py<br/>107 curated features"]
TRAIN["train.py<br/>LightGBM classifier"]
PRED["predict.py<br/>Score new candidates"]
FEAT --> TRAIN --> MODEL[("Model + schema")]
MODEL --> PRED
end
LABELS -.-> TRAIN
ENRICHED -.-> FEAT
%% ── LTV Pipeline ─────────────────────────────────────────
subgraph ltvpipe["LTV Pipeline - Long-Term Variability (ltv/)"]
LTV_PIPE["pipeline.py<br/>Orchestrator"]
LTV_CORE["core.py<br/>Season medians, linear/quad fits,<br/>slopes, Lomb-Scargle"]
LTV_FILT["filter.py<br/>Slope, max diff, dec, PM cuts"]
LTV_CROSS["crossmatch.py<br/>Gaia, VSX, OGLE, ZTF,<br/>Gaia Alerts, MilliQuas, SIMBAD"]
LTV_STOCH["stochastic.py<br/>Structure function, IAR,<br/>MHPS, DRW"]
LTV_NEO["neowise.py<br/>IRSA TAP IR light curves"]
LTV_DUST["dust.py<br/>Dust excess flags"]
LTV_CMD["cmd.py<br/>MIST grid, Bailer-Jones distances"]
LTV_BUNDLE["bundle.py<br/>Package .dat2 files"]
LTV_INGEST["review.py<br/>Ingest into review DB"]
LTV_PIPE --> LTV_CORE --> LTV_FILT
LTV_FILT --> LTV_CROSS --> LTV_STOCH
LTV_STOCH --> LTV_NEO --> LTV_DUST --> LTV_CMD
LTV_CMD --> LTV_BUNDLE --> LTV_INGEST
end
RAW --> LTV_PIPE
IDX --> LTV_PIPE
GAIA_SRC -.-> LTV_CROSS
LTV_INGEST --> STORE
%% ── Evaluation ───────────────────────────────────────────
subgraph evalgrp["Evaluation (evaluation/)"]
INJ["injection.py<br/>Synthetic dip injection-recovery"]
DET_RATE["detection_rate.py<br/>Baseline detection rate"]
VALID["validation.py<br/>Precision/recall vs known targets"]
REPRO["reproduce.py<br/>Re-run detection on known objects"]
ATTR["attrition.py<br/>Filter attrition summary"]
FP_EVAL["false_positive.py<br/>FP contaminant benchmark"]
end
MAN_OUT -.-> INJ
MAN_OUT -.-> DET_RATE
MAN_OUT -.-> REPRO
CAND -.-> VALID
CAND -.-> REPRO
CAND -.-> ATTR
%% ── Core Libraries ───────────────────────────────────────
subgraph corelibs["Core Libraries"]
UTILS["utils.py<br/>LC I/O, cleaning, kernels"]
LCIO["lightcurve_io.py<br/>.dat2 / .csv readers"]
BASE["baseline.py<br/>GP + median baselines"]
TRIG["triggering.py<br/>logBF / posterior trigger resolution"]
SCORE_LIB["score.py<br/>Dip/jump/microlensing scoring"]
STATS_LIB["stats.py<br/>Stetson, von Neumann, RoMS, LS"]
PERIOD_LIB["periodogram.py<br/>Lomb-Scargle, PDM,<br/>Conditional Entropy"]
PCA_LIB["pca.py<br/>Variability PCA"]
FETCH_LIB["fetch.py<br/>SkyPatrol V1/V2 download"]
GAIA_FETCH["gaia_fetch.py<br/>Bulk Gaia DR3 via AIP TAP"]
end
UTILS -.-> EVENTS
BASE -.-> EVENTS
TRIG -.-> EVENTS
SCORE_LIB -.-> EVENTS
STATS_LIB -.-> SCORE_LIB
PERIOD_LIB -.-> FILT
UTILS -.-> REPRO
BASE -.-> REPRO
%% ── Configuration ────────────────────────────────────────
subgraph configgrp["Configuration (config/)"]
direction LR
CONF["config_paths, config_pipeline, config_filters,<br/>config_io, config_characterize, config_classify,<br/>config_ltv, config_stats, config_ml, config_vetting"]
end
%% ── CLI Entry Point ──────────────────────────────────────
CLI["__main__.py — malca CLI<br/>manifest, pipeline, filter, tag, events, plot, characterize, classify,<br/>vetting, review, ml_train, ml_predict, injection, validate, reproduce,<br/>ltv-pipeline, ltv-core, ltv-build, ltv-ingest, attrition, stats, ..."]
CLI -.-> discovery
CLI -.-> postdet
CLI -.-> reviewgrp
CLI -.-> mlgrp
CLI -.-> ltvpipe
CLI -.-> evalgrp
CLI -.-> PLOT
Key Components:
- Discovery pipeline:
manifest.py→tag.py→events.py→filter.py(orchestrated bydetect.py) - Post-detection:
characterize.py(Gaia, dust, YSO, galactic coords, auxiliary catalogs) →vetting.py(SIMBAD, ZTF, TNS, eROSITA, ALeRCE, ATLAS, NEOWISE, ...) →classify.py(EB/CV/starspot/disk/YSO) →enrich/(neighbor catalogs, spectra availability) - LTV pipeline:
ltv/pipeline.py→core.py→filter.py→crossmatch.py→stochastic.py→neowise.py→dust.py→cmd.py→bundle.py→review.py(ingest to review DB) - Review:
review/app.py(Dash GUI with scoring, event classes, diagnostic plots, vetting cards) → labeled training set - ML:
ml/features.py(107 curated features) →ml/train.py(LightGBM classifier) →ml/predict.py(score candidates) - Evaluation:
injection.py(synthetic dips),detection_rate.py,validation.py,reproduce.py,attrition.py,false_positive.py - Core libraries:
utils.py,lightcurve_io.py,baseline.py,triggering.py,score.py,stats.py,periodogram.py,pca.py,fetch.py,gaia_fetch.py - Configuration: 10 modules in
config/centralizing all pipeline parameters - CLI: Unified interface via
malca [command](__main__.py)
See docs/architecture.md for detailed documentation.
The full detection workflow has three steps: build a manifest, run detection with batching/resume, then filter.
-
Build a manifest (map IDs -> light-curve directories):
malca manifest --index-root /path/to/lcsv2 --lc-root /path/to/lcsv2 --mag-bin 13_13.5 --out output/lc_manifest_13_13.5.parquet --workers 10
-
Tag and run events in batches with resume support:
malca pipeline --mag-bin 13_13.5 --workers 10 --min-time-span 100 --min-points-per-day 0.05 --min-cameras 2 --vsx-crossmatch input/vsx/asassn_x_vsx_matches_20250919_2252.csv --batch-size 2000 --lc-root /path/to/lcsv2 --index-root /path/to/lcsv2 --output output/lc_events_results_13_13.5.parquet --trigger-mode posterior_prob --baseline-func gp --min-mag-offset 0.1
- The pipeline command builds/loads the manifest, runs tag checks, then calls
events.pyin batches. - Resume: if interrupted, skips already-processed paths using the checkpoint file.
- VSX tags are saved to
tags/vsx_tags/and merged into results. - To disable VSX handling:
--skip-vsx. To tag instead of filter:--vsx-mode tag.
- The pipeline command builds/loads the manifest, runs tag checks, then calls
-
Filter events:
malca filter --input output/lc_events_results_13_13.5.parquet --output output/lc_events_results_13_13.5_filtered.parquet # With custom thresholds malca filter --input results.parquet --output filtered.parquet --min-bayes-factor 20 --min-run-points 3 --apply-morphology- Implemented filters: posterior strength, run robustness, score, morphology, periodicity, Gaia RUWE, Gaia PM, multi-catalog periodic consensus
-
Optional: tune filter behavior directly from
malca pipeline/malca detect.# Keep pipeline defaults but disable score-based rejection malca pipeline --mag-bin 13_13.5 --skip-score-filter # Enable stricter optional validators malca pipeline --mag-bin 13_13.5 --apply-morphology --min-delta-bic 12 --apply-periodicity-validation --periodicity-n-bootstrap 2000 --gaia-reject --periodic-catalog-reject
- Defaults in pipeline: evidence strength, run robustness, score, Gaia RUWE, Gaia PM, and periodic-catalog consensus validation are on; morphology and periodicity-validation are off.
- Control flags now available in pipeline:
- Evidence/run:
--skip-evidence-strength,--allow-infinite-local-bf,--skip-run-robustness,--min-run-count,--filter-min-run-points,--filter-min-run-cameras - Morphology/score:
--apply-morphology,--dip-morphology,--jump-morphology,--min-delta-bic,--skip-score-filter,--min-score - Validators:
--apply-periodicity-validation(+ periodicity knobs),--skip-gaia-ruwe-validation|--gaia-reject,--skip-gaia-pm-validation|--gaia-pm-reject,--skip-periodic-catalog-validation|--periodic-catalog-reject
- Evidence/run:
Detect options:
# logBF triggering (faster)
malca pipeline --mag-bin 13_13.5 --workers 8 --lc-root /path/to/lcsv2 --index-root /path/to/lcsv2 --output output/events_logbf.parquet --trigger-mode logbf --baseline-func gp_masked --min-mag-offset 0.1
# Multiple mag bins (writes one output per bin)
malca pipeline --mag-bin 12_12.5 12.5_13 13_13.5 --lc-root /path/to/lcsv2 --index-root /path/to/lcsv2 --output output/lc_events_results.parquet --trigger-mode logbfmalca manifest --index-root <index_dir> --lc-root <lc_dir> --mag-bin 12_12.5 --out output/lc_manifest.parquetRun event detection directly (without the pipeline orchestrator):
malca events --input /path/to/lc*_cal/*.dat2 --output output/results.parquet --workers 10
# With signal amplitude filtering (requires |event_mag - baseline_mag| > 0.1)
malca events --input /path/to/lc*_cal/*.dat2 --output output/results.parquet --workers 10 --min-mag-offset 0.1- Default Bayesian grid is 12x12. Change p-grid with
--p-points. - Output includes per-event morphology fit parameters (
best_amp,best_t0,best_alpha,best_tau,best_morph,delta_bic,width_param,symmetry_score) and recurrence statistics (is_single_event,inter_event_spacing_median/std,amplitude_consistency,duration_consistency) for both dips and jumps.
malca tag --help- Expects columns
asas_sn_idandpathpointing to lc_dir. - VSX handling: default is
tag(keeps all rows and attachesvsx_sep_arcsec/vsx_class). Use--vsx-mode filteronly when you explicitly want VSX-based rejection.
malca filter --input output/results.parquet --output output/results_filtered.parquet# Single file
malca plot --input /path/to/lc123.dat2 --out-dir output/plots --format png
# Multiple files (glob patterns supported)
malca plot --input input/skypatrol2/*.csv --out-dir output/plots --skip-events
# All files from events.py results
malca plot --events output/lc_events_results_13_13.5_filtered.parquet --out-dir output/plotsNote: Event scores are computed automatically during detection and included in the results table (dipper_score, dipper_n_dips, dipper_n_valid_dips columns).
Legacy batch plotting: malca old.plot_results_bayes /path/to/*.csv --results-csv output/lc_events_results_13_13.5.csv --out-dir output/plots
# Full run
malca injection --workers 10
# Quick test with limited trials
malca injection --max-trials 1000 --workers 10
# Custom manifest and output directory
malca injection --manifest /path/to/manifest.parquet --out-dir output/injectionSee Injection Testing output for the directory layout.
- Injects synthetic dips with skew-normal profiles onto real observed light curves
- Preserves real cadence, systematics, and noise characteristics
- Supports resume for long-running parameter sweeps
Python API:
from malca.evaluation.injection import (
load_efficiency_cube,
plot_efficiency_all,
plot_efficiency_mag_slices,
plot_efficiency_marginalized,
plot_efficiency_threshold_contour,
plot_efficiency_3d,
)
cube = load_efficiency_cube("output/injection/cubes/efficiency_cube.npz")
plot_efficiency_marginalized(cube, axis="mag", output_path="avg_over_mag.png")
plot_efficiency_threshold_contour(cube, threshold=0.5, output_path="depth_at_50pct.png")# Re-run detection on raw data (requires manifest and .dat2 files)
malca reproduce --manifest output/lc_manifest.parquet --candidates my_targets.csv --out-dir output/results_repro --workers 10Note: Reproduction uses Bayesian detection.
# Auto-discover and validate ALL results for LOO method
malca validate --method loo
# Auto-discover for Bayes Factor method
malca validate --method bf
# Filter to specific magnitude bin
malca validate --method loo --mag-bin 13_13.5
# Direct file specification
malca validate --results output/results.parquet
# Validate latest detect run output (output/runs/<timestamp>/results)
malca validate --latest-run
# Validate a specific detect run directory
malca validate --run-dir output/runs/20250119_1349
# With custom candidates
malca validate --method loo --candidates my_targets.csv -v
# Reproduce on built-in candidates using local SkyPatrol CSVs
malca validate --candidates brayden_candidates --skypatrol-dir input/skypatrol2 --method bf --workers 4
# Validate using a direct results file path
malca validate --results output/events_logbf.parquetAfter detecting dipper candidates, characterize them using multi-wavelength data:
malca characterize --input output/filtered.parquet --output output/characterized.parquet --dust --starhorse input/starhorse/starhorse2021.parquetFeatures:
- Gaia DR3 Queries: Astrometry, astrophysics (Teff, logg, metallicity, distance), 2MASS/AllWISE photometry
- 3D Dust Extinction: All-sky coverage via
dustmaps3d(Wang et al. 2025, ~350MB) - YSO Classification: Koenig & Leisawitz (2014) IR color-color diagram with dust correction
- Galactic Coordinates: Galactic longitude/latitude (l, b) from ra/dec
- Galactic Population: Thin/thick disk classification using metallicity or StarHorse ages
- StarHorse (if provided): Stellar ages, masses, distances from local catalog join
- Auxiliary Catalog Crossmatches (Tzanidakis+2025):
- BANYAN Σ: Young stellar association membership probabilities
- IPHAS DR2: Hα emission detection for Galactic plane sources
- Star-forming regions: Proximity check to known SFRs (Prisinzano+2022)
- Open clusters: Cantat-Gaudin+2020 membership crossmatch
- unWISE/unTimely: Mid-IR variability z-scores
- Caching: Gaia results cached locally to speed up repeated analyses
Setup:
# Dust maps auto-download on first use (~350MB)
# For StarHorse, download catalog manually:
# https://cdsarc.cds.unistra.fr/viz-bin/cat/I/354Output columns:
source_id,ra,dec,parallax,distance_gspphottmass_j,tmass_h,tmass_k,unwise_w1,unwise_w2A_v_3d,ebv_3d(3D dust extinction)H_K,W1_W2,yso_class(Class I/II/Transition Disk/Main Sequence)population(thin_disk/thick_disk from metallicity or age)age50,mass50(if StarHorse provided)gal_l,gal_b(Galactic coordinates)- Auxiliary crossmatches (Tzanidakis+2025):
banyan_field_prob,banyan_best_assoc(BANYAN Σ membership)iphas_r_ha,iphas_ha_excess(IPHAS Hα)near_sfr,sfr_name(star-forming region proximity)cluster_name,cluster_age_myr(open cluster membership)unwise_w1_zscore,unwise_w2_zscore,unwise_w1_var(IR variability)
Run post-review vetting against external catalogs:
# Vet all candidates in a characterized parquet
malca vetting output/characterized.parquet -o output/vetted.parquet
# Skip slow modules
malca vetting output/characterized.parquet --no-simbad --no-alerce
# Only vet high-scoring candidates
malca vetting output/characterized.parquet --min-score 3.0
# With crash-resume checkpoint
malca vetting output/characterized.parquet --checkpoint output/vetting_checkpoint.parquetModules (all on by default, disable with --no-*):
- SIMBAD: Object type, bibliography, cross-IDs
- Gaia DR3 variability: Variable flag, classification, score
- Gaia DR3 eclipsing binaries: Period, morphology, global ranking
- Gaia epoch photometry: Availability, observation count, G-band range
- ASAS-SN variables: Variable star catalog crossmatch
- ZTF variables: Chen+ 2020 periodic variables (type, period, amplitude)
- TNS: Transient Name Server (name, type, redshift, discovery date)
- ALeRCE: ZTF broker classifications and stamp probabilities
- eROSITA: X-ray detection, flux, separation
- PM consistency: Proper motion agreement with host cluster
- ATLAS (opt-in,
--atlas-token): Forced photometry light curves - NEOWISE (opt-in,
--neowise-lc): Full NEOWISE light curves
Pipeline default: vetting runs by default in malca pipeline; use --no-run-vetting to opt out.
Vetting is also available during import in the review GUI ("Vet on import" toggle). Results are cached per input file so re-imports skip already-vetted candidates.
malca classify --input output/characterized.parquet --output output/classified.parquetmalca stats /path/to/lc123.dat2malca attrition --pre output/pre.parquet --post output/post.parquet# Launch Dash review GUI against an existing run bundle
malca review --plot-dir output/runs/YOUR_RUN/plots
# Standalone mode (no plot directory required)
malca reviewDash GUI features:
- Native Plotly light-curve viewer with PNG fallback, camera filtering, and plot presets/overlays (raw points, dip/jump markers, residuals, phase-fold, diagnostics)
- Confidence scoring (
1-4) via number keys or clickable buttons - Event class labeling (single-select) with direct key shortcuts and clickable badges:
dipper,microlensing,flare,yso,unknown_interesting,instrumental,other(toggle off tounclassified) - Collapsible candidate panels with metadata health, vetting banner, external follow-up cards, diagnostic plots, and run-config provenance
- Sidebar queue controls: unreviewed/failed filters, grouped numeric/text/select filters, multi-column sort, open-existing jump, and native camera selection
- Import/fetch workflows: import tables or raw LC files (optional characterize + vet on import), or fetch by ASAS-SN ID, Gaia DR3 ID, or coordinates
- Per-candidate pipeline stage chips with "Run All Missing" / "Re-run Current", plus notes/followup/review-pass tracking and CSV/Parquet export
Train a baseline classifier on reviewed labels:
malca ml_train --input output/review/reviewed.parquet --out-dir output/ml --cv-folds 5- Uses curated physics/context features from
malca/ml/features.py - Trains a LightGBM classifier on labeled
event_classvalues (droppingunclassifiedby default) - Saves model artifacts to
output/ml/(candidate_classifier.joblib,feature_schema.json,metrics.json)
Score candidates with a trained classifier:
malca ml_predict --model-dir output/ml --input output/review/reviewed.parquet --output output/review/scored.parquet- Loads
candidate_classifier.joblib+feature_schema.json - Applies the same feature transforms used during training
- Appends
ml_predicted_classandml_prob_<class>columns to the output table
Build the cleaned ASAS-SN index and filtered VSX catalog:
malca vsx-filter --help
malca vsx-filter --vsx-file input/vsx/vsxcat.090525.csv --masked-dir /path/to/lcsv2_masked --output-dir input/vsx
malca vsx-filter --stamp 20260213_120000 # timestamped output filenames- Reads the raw fixed-width VSX catalog and filters out unwanted variability classes (eclipsing binaries, supernovae, AGN, etc.)
- Concatenates masked ASAS-SN index CSVs from all magnitude bins
- Outputs
asassn_catalog.csvandvsx_cleaned.csv(or timestamped variants with--stamp)
Crossmatch ASAS-SN sources with VSX by position (with proper-motion correction):
malca vsx-crossmatch --help
malca vsx-crossmatch --asassn-csv input/vsx/asassn_catalog.csv --vsx-csv input/vsx/vsx_cleaned.csv
malca vsx-crossmatch --radius 5.0 --stamp 20260213_120000- Propagates ASAS-SN coordinates from epoch 2016.0 to 2000.0 using proper motions
- Default match radius is 3 arcseconds
- Outputs
asassn_x_vsx_matches_{stamp}.csvtoinput/vsx/
When running malca pipeline, the following directory structure is created for complete provenance tracking:
output/runs/20250121_143052/ # Timestamp-based run directory
├── run_params.json # Detection parameters (detect.py)
├── run_summary.json # Detection results stats (detect.py)
├── filter_log.json # Filtering parameters & stats (filter.py)
├── plot_log.json # Plotting parameters (plot.py)
├── run.log # Simple text log with paths
│
├── manifests/ # Manifest files
│ └── lc_manifest_{mag_bin}.parquet
│
├── tags/ # Tagging results
│ ├── lc_filtered_{mag_bin}.parquet
│ ├── lc_stats_checkpoint_{mag_bin}.parquet
│ ├── rejected_tag_{mag_bin}.csv
│ └── vsx_tags/
│ └── vsx_tags_{mag_bin}.csv
│
├── paths/ # Input paths
│ └── filtered_paths_{mag_bin}.txt
│
├── results/ # Detection results
│ ├── lc_events_results.parquet # Raw detection output (includes dipper_score)
│ ├── lc_events_results_PROCESSED.txt # Checkpoint log
│ ├── lc_events_results_filtered.parquet # After filter.py
│ └── rejected_filter.csv # Filter rejections
│
└── plots/ # Visualizations (plot.py)
├── {source_id}_dips.png
├── {source_id}_dips.png
└── ...
Key Features:
- JSON logs track full provenance: Every parameter and result is logged for reproducibility
- Self-contained runs: Each timestamped directory contains everything needed to reproduce the analysis
- Checkpoint support: Detection runs can be interrupted and resumed using
*_PROCESSED.txtfiles - Rejection tracking: Both tagging and filter rejections are logged with reasons
JSON Log Contents:
run_params.json: All tagging and detection parameters (thresholds, workers, baseline settings)run_summary.json: Manifest statistics, tag rejection breakdown, detection resultsfilter_log.json: Filter toggles, thresholds, input/output counts, rejection breakdownplot_log.json: Plotting parameters, GP settings, number of plots generated
Note: Event scores (dipper_score, dipper_n_dips, dipper_n_valid_dips) are automatically computed during detection for significant events and included in the results table.
output/injection/ # Default output directory
├── results/
│ ├── injection_results.parquet # Trial-by-trial injection results
│ └── injection_results_PROCESSED.txt # Checkpoint for resume
│
├── cubes/
│ └── efficiency_cube.npz # 3D efficiency cube (depth × duration × mag)
│
└── plots/
├── mag_slices/ # Per-magnitude 2D heatmaps
│ ├── mag_12.0_efficiency.png
│ ├── mag_13.0_efficiency.png
│ └── ...
├── efficiency_marginalized_*.png # Averaged over one axis
├── depth_at_*pct_efficiency.png # Threshold contour maps
└── efficiency_3d_volume.html # Interactive 3D (if plotly installed)
output/detection_rate/ # Default base directory
├── 20250121_143052/ # Timestamped run directory
│ ├── run_params.json # Full parameter dump
│ ├── results/
│ │ ├── detection_rate_results.parquet
│ │ ├── detection_rate_results_PROCESSED.txt # Checkpoint
│ │ └── detection_summary.json # Detection rate summary
│ └── plots/
│ ├── detection_rate_vs_mag.png
│ ├── detection_duration_dist.png
│ └── detection_depth_dist.png
│
├── 20250121_150318_custom_tag/ # Optional --run-tag appended
│ └── ...
│
└── latest -> 20250121_150318_custom_tag/ # Symlink to latest run
output/
├── characterized.parquet # Single output file with added columns:
# - Gaia astrometry & photometry
# - 3D dust extinction (A_v_3d, ebv_3d)
# - YSO classification (yso_class)
# - Galactic population (thin_disk/thick_disk)
# - StarHorse ages/masses (if provided)
# - Auxiliary crossmatches (BANYAN Σ, IPHAS, etc.)
└── gaia_cache/ # Gaia query cache (created when cache is used)
└── gaia_results_{hash}.parquet
output/
└── classified.parquet # Single output file with added columns:
# - P_eb, P_cv, P_starspot, P_disk
# - yso_class
# - a_circ_au, transit_prob
# - final_class (EB/CV/Starspot/Disk/YSO/Unknown)
output/
└── lc_manifest_{mag_bin}.parquet # Single parquet file with:
# - asas_sn_id
# - ra_deg, dec_deg
# - lc_dir (directory path)
# - dat_path (full .dat2 path)
# - dat_exists (bool)
If you use MALCA or any part of its codebase in published research, please cite this repository:
Lenhart, C. (2025). MALCA: Multi-timescale ASAS-SN Light Curve Analysis [Software].
https://github.com/calderlen/malca
This project is licensed under the GNU General Public License v3.0. See LICENSE for details.