The Software Gardening Almanack is an open-source handbook of applied guidance and tools for sustainable software development and maintenance.
The project entails two primary components:
- The Almanack handbook: the content found here helps educate, demonstrate, and evolve the concepts of sustainable software development.
- The
almanackpackage: is a Python package which implements the concepts of the book to help improve software sustainability by generating organized metrics and running linting checks on repositories. The Python package may also be used as a pre-commit hook to check repositories for best practices.
Please see our pavilion section of the book for presentations and other related materials for the Almanack.
- Online (HTML): https://software-gardening.github.io/almanack/
- Offline (PDF): software-gardening-almanack.pdf
You can install the Almanack with the following:
# install from pypi
pip install almanack
# install directly from source
pip install git+https://github.com/software-gardening/almanack.gitOnce installed, the Almanack can be used to analyze repositories for sustainable development practices.
Output from the Almanack includes metrics which are defined through metrics.yml as a Python dictionary (JSON-compatible) record structure.
You can use the Almanack package as a command-line interface (CLI):
# generate a table of metrics based on a repository
almanack table path/to/repository
# perform linting-style checks on a repository
almanack check path/to/repositoryWe provide pre-commit hooks to enable you to run the Almanack as part of your automated checks for developing software projects.
Add the following to your pre-commit-config.yaml in order to use the Almanack.
For example:
# include this in your pre-commit-config.yaml
- repo: https://github.com/software-gardening/almanack
rev: v0.1.1
hooks:
- id: almanack-checkYou can also use the Almanack through a Python API:
For example:
import almanack
import pandas as pd
# gather the almanack table using the almanack repo as a reference
almanack_table = almanack.table("path/to/repository")
# show the almanack table as a Pandas DataFrame
pd.DataFrame(almanack_table)Please see this example notebook which demonstrates using the Almanack package.
The almanack batch command runs the almanack check across many repositories in parallel and writes one parquet file (or one per batch) while optionally streaming progress to stdout.
# Run from a list (comma-separated) and write a single parquet
almanack batch results.parquet --repo_urls https://github.com/org/repo1,https://github.com/org/repo2 --max_workers 8
# Use threads (good for I/O-bound workloads) and split outputs per batch
almanack batch out_dir --repo_urls https://github.com/org/repo1,https://github.com/org/repo2 --executor thread --split_batches --batch_size 100
# Read repo URLs from a column in a provided parquet file
almanack batch results.parquet --parquet_path links.parquet --column github_linkKey options:
--executor:process(default) orthread--batch_size: how many repos per batch (a small multiple ofmax_workersworks well)--split_batches: an option to write one parquet file per batch intooutput_path(treated as a directory)--collect_dataframe: set toFalseto avoid returning a dataframe (only write to file)--show_repo_progress: shows progress per repository--show_batch_progress: shows progress per batch (sets of repos)--show_errors: emit any errors from the almanack processing
Python API example:
from concurrent.futures import ThreadPoolExecutor
from almanack import process_repositories_batch
repos = ["https://github.com/org/repo1", "https://github.com/org/repo2"]
# Single parquet
df = process_repositories_batch(
repos,
output_path="almanack_results.parquet",
max_workers=8,
executor_cls=ThreadPoolExecutor, # threads are notebook-friendly / I/O-friendly
)
# Per-batch files, no in-memory DataFrame
process_repositories_batch(
repos,
output_path="batch_outputs",
split_batches=True,
collect_dataframe=False,
batch_size=100,
max_workers=16,
)The Almanack uses GitHub’s API to gather certain metrics.
Anonymous API requests have extremely low rate limits—once hit, requests are throttled and batch jobs slow down.
Export a personal access token as GITHUB_TOKEN before running any CLI or Python workflows to raise the per-hour quota:
export GITHUB_TOKEN=ghp_yourtokenhereCommands launched from the same shell automatically reuse the token, so your GitHub requests complete faster and more reliably.
Please see our CONTRIBUTING.md document for more information on how to contribute to this project.
This work was supported by the Better Scientific Software Fellowship Program, a collaborative effort of the U.S. Department of Energy (DOE), Office of Advanced Scientific Research via ANL under Contract DE-AC02-06CH11357 and the National Nuclear Security Administration Advanced Simulation and Computing Program via LLNL under Contract DE-AC52-07NA27344; and by the National Science Foundation (NSF) via SHI under Grant No. 2327079.
