diff --git a/README.md b/README.md
index bc72d143..55f69ae6 100644
--- a/README.md
+++ b/README.md
@@ -6,105 +6,123 @@
[](https://pypi.python.org/pypi/foundry_ml)
[](https://github.com/MLMI2-CSSI/foundry/actions/workflows/tests.yml)
-[](https://github.com/MLMI2-CSSI/foundry/actions/workflows/python-publish.yml)
[](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1931306&HistoricalAwards=false)
[
](https://ai-materials-and-chemistry.gitbook.io/foundry/)
+**Foundry-ML** simplifies access to machine learning-ready datasets in materials science and chemistry.
-Foundry-ML simplifies the discovery and usage of ML-ready datasets in materials science and chemistry providing a simple API to access even complex datasets.
-* Load ML-ready data with just a few lines of code
-* Work with datasets in local or cloud environments.
-* Publish your own datasets with Foundry to promote community usage
-* (in progress) Run published ML models without hassle
+- **Search & Load** - Find and use curated datasets with a few lines of code
+- **Understand** - Rich schemas describe what each field means
+- **Cite** - Automatic citation generation for publications
+- **Publish** - Share your datasets with the community
+- **AI-Ready** - MCP server for Claude and other AI assistants
-Learn more and see our available datasets on [Foundry-ML.org](https://foundry-ml.org/)
+## Quick Start
+```bash
+pip install foundry-ml
+```
+
+```python
+from foundry import Foundry
+
+# Connect
+f = Foundry()
+# Search
+results = f.search("band gap", limit=5)
-# Documentation
-Information on how to install and use Foundry is available in our documentation [here](https://ai-materials-and-chemistry.gitbook.io/foundry/v/docs/).
+# Load
+dataset = results.iloc[0].FoundryDataset
+X, y = dataset.get_as_dict()['train']
-DLHub documentation for model publication and running information can be found [here](https://dlhub-sdk.readthedocs.io/en/latest/servable-publication.html).
+# Understand
+schema = dataset.get_schema()
+print(schema['fields'])
-# Quick Start
-Install Foundry-ML via command line with:
-`pip install foundry_ml`
+# Cite
+print(dataset.get_citation())
+```
+
+## Cloud Environments
-You can use the following code to import and instantiate Foundry-ML, then load a dataset.
+For Google Colab or remote Jupyter:
```python
-from foundry import Foundry
-f = Foundry(index="mdf")
+f = Foundry(no_browser=True, no_local_server=True)
+```
+## CLI
-f = f.load("10.18126/e73h-3w6n", globus=True)
+```bash
+foundry search "band gap"
+foundry schema 10.18126/abc123
+foundry --help
```
-*NOTE*: If you run locally and don't want to install the [Globus Connect Personal endpoint](https://www.globus.org/globus-connect-personal), just set the `globus=False`.
-If running this code in a notebook, a table of metadata for the dataset will appear:
+## AI Agent Integration
-
+```bash
+foundry mcp install # Add to Claude Code
+```
-We can use the data with `f.load_data()` and specifying splits such as `train` for different segments of the dataset, then use matplotlib to visualize it.
+## Documentation
-```python
-res = f.load_data()
+- [Getting Started](https://ai-materials-and-chemistry.gitbook.io/foundry/quickstart)
+- [User Guide](https://ai-materials-and-chemistry.gitbook.io/foundry/)
+- [API Reference](https://ai-materials-and-chemistry.gitbook.io/foundry/api/foundry)
+- [Examples](./examples)
+
+## Features
+
+| Feature | Description |
+|---------|-------------|
+| Search | Find datasets by keyword, DOI, or browse catalog |
+| Load | Automatic download, caching, and format conversion |
+| PyTorch/TensorFlow | `dataset.get_as_torch()`, `dataset.get_as_tensorflow()` |
+| CLI | Terminal-based workflows |
+| MCP Server | AI assistant integration |
+| HuggingFace Export | Publish to HuggingFace Hub |
-imgs = res['train']['input']['imgs']
-desc = res['train']['input']['metadata']
-coords = res['train']['target']['coords']
+## Available Datasets
-n_images = 3
-offset = 150
-key_list = list(res['train']['input']['imgs'].keys())[0+offset:n_images+offset]
+Browse datasets at [Foundry-ML.org](https://foundry-ml.org/) or:
-fig, axs = plt.subplots(1, n_images, figsize=(20,20))
-for i in range(n_images):
- axs[i].imshow(imgs[key_list[i]])
- axs[i].scatter(coords[key_list[i]][:,0], coords[key_list[i]][:,1], s = 20, c = 'r', alpha=0.5)
+```python
+f = Foundry()
+f.list(limit=20) # See available datasets
```
-
-[See full examples](./examples)
+## How to Cite
-# How to Cite
-If you find Foundry-ML useful, please cite the following [paper](https://doi.org/10.21105/joss.05467)
+If you use Foundry-ML, please cite:
-```
+```bibtex
@article{Schmidt2024,
doi = {10.21105/joss.05467},
- url = {https://doi.org/10.21105/joss.05467},
- year = {2024}, publisher = {The Open Journal},
+ year = {2024},
+ publisher = {The Open Journal},
volume = {9},
number = {93},
pages = {5467},
- author = {Kj Schmidt and Aristana Scourtas and Logan Ward and Steve Wangen and Marcus Schwarting and Isaac Darling and Ethan Truelove and Aadit Ambadkar and Ribhav Bose and Zoa Katok and Jingrui Wei and Xiangguo Li and Ryan Jacobs and Lane Schultz and Doyeon Kim and Michael Ferris and Paul M. Voyles and Dane Morgan and Ian Foster and Ben Blaiszik},
- title = {Foundry-ML - Software and Services to Simplify Access to Machine Learning Datasets in Materials Science}, journal = {Journal of Open Source Software}
+ author = {Kj Schmidt and Aristana Scourtas and Logan Ward and others},
+ title = {Foundry-ML - Software and Services to Simplify Access to Machine Learning Datasets in Materials Science},
+ journal = {Journal of Open Source Software}
}
```
-# Contributing
-Foundry is an Open Source project and we encourage contributions from the community. To contribute, please fork from the `main` branch and open a Pull Request on the `main` branch. A member of our team will review your PR shortly.
+## Contributing
-## Developer notes
-In order to enforce consistency with external schemas for the metadata and datacite structures ([contained in the MDF data schema repository](https://github.com/materials-data-facility/data-schemas)) the `dc_model.py` and `project_model.py` pydantic data models (found in the `foundry/jsonschema_models` folder) were generated using the [datamodel-code-generator](https://github.com/koxudaxi/datamodel-code-generator/) tool. In order to ensure compliance with the flake8 linting, the `--use-annoted` flag was passed to ensure regex patterns in `dc_model.py` were specified using pydantic's `Annotated` type vs the soon to be deprecated `constr` type. The command used to run the datamodel-code-generator looks like:
-```
-datamodel-codegen --input dc.json --output dc_model.py --use-annotated
-```
+Foundry is open source. To contribute:
-# Primary Support
-This work was supported by the National Science Foundation under NSF Award Number: 1931306 "Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure".
+1. Fork from `main`
+2. Make your changes
+3. Open a Pull Request
-# Other Support
-Foundry-ML brings together many components in the materials data ecosystem. Including [MAST-ML](https://mastmldocs.readthedocs.io/en/latest/), the [Data and Learning Hub for Science](https://www.dlhub.org) (DLHub), and the [Materials Data Facility](https://materialsdatafacility.org) (MDF).
+See [CONTRIBUTING.md](docs/how-to-contribute/contributing.md) for details.
-## MAST-ML
-This work was supported by the National Science Foundation (NSF) SI2 award No. 1148011 and DMREF award number DMR-1332851
+## Support
-## The Data and Learning Hub for Science (DLHub)
-This material is based upon work supported by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory, provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357.
-https://www.dlhub.org
+This work was supported by the National Science Foundation under NSF Award Number: 1931306 "Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure".
-## The Materials Data Facility
-This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the [Center for Hierarchical Material Design (CHiMaD)](http://chimad.northwestern.edu). This work was performed under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD). This work was also supported by the National Science Foundation as part of the [Midwest Big Data Hub](http://midwestbigdatahub.org) under NSF Award Number: 1636950 "BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design (IMaD): Leverage, Innovate, and Disseminate".
-https://www.materialsdatafacility.org
+Foundry integrates with [Materials Data Facility](https://materialsdatafacility.org), [DLHub](https://www.dlhub.org), and [MAST-ML](https://mastmldocs.readthedocs.io/).
diff --git a/docs/README.md b/docs/README.md
index abcc4c54..06a5e00c 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,43 +1,87 @@
-# Getting started with Foundry
+# Introduction
-
+
+
+
-## What is Foundry?
+**Foundry-ML** is a Python library that simplifies access to machine learning-ready datasets in materials science and chemistry.
-Foundry is a Python package that simplifies the discovery and usage of machine-learning ready datasets in materials science and chemistry. Foundry provides software tools that make it easy to load these datasets and work with them in local or cloud environments. Further, Foundry provides a dataset specification, and defined curation flows, that allow users to create new datasets for the community to use through this same interface.
+## Features
-## Installation
+- **Search & Discover** - Find datasets by keyword or browse the catalog
+- **Rich Metadata** - Understand datasets before downloading with detailed schemas
+- **Easy Loading** - Get data in Python, PyTorch, or TensorFlow format
+- **Automatic Caching** - Fast subsequent access after first download
+- **Publishing** - Share your own datasets with the community
+- **AI Integration** - MCP server for AI assistant access
+- **CLI** - Terminal-based workflows
-Foundry can be installed on any operating system with Python with pip
+## Quick Example
-```text
-pip install foundry-ml
+```python
+from foundry import Foundry
+
+# Connect
+f = Foundry()
+
+# Search for datasets
+results = f.search("band gap", limit=5)
+
+# Load a dataset
+dataset = results.iloc[0].FoundryDataset
+X, y = dataset.get_as_dict()['train']
+
+# Get citation for your paper
+print(dataset.get_citation())
```
-### Globus
+## Installation
-Foundry uses the Globus platform for authentication, search, and to optimize some data transfer operations. Follow the steps below to get set up.
+```bash
+pip install foundry-ml
+```
-* [Create a free account.](https://app.globus.org) You can create a free account here with your institutional credentials or with free IDs \(GlobusID, Google, ORCID, etc\).
-* [Set up a Globus Connect Personal endpoint ](https://www.globus.org/globus-connect-personal)_**\(optional\)**_. While this step is optional, some Foundry capabilities will work more efficiently when using GCP.
+For cloud environments (Colab, remote Jupyter):
-## Project Support
+```python
+f = Foundry(no_browser=True, no_local_server=True)
+```
-This work was supported by the National Science Foundation under NSF Award Number: 1931306 "Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure".
+## What's Next?
-### Other Support
+
+
+|
-Foundry brings together many components in the materials data ecosystem. Including MAST-ML, the Data and Learning Hub for Science \(DLHub\), and The Materials Data Facility \(MDF\).
+**Getting Started**
+- [Installation](installation.md)
+- [Quick Start](quickstart.md)
-#### MAST-ML
+ |
+
-This work was supported by the National Science Foundation \(NSF\) SI2 award No. 1148011 and DMREF award number DMR-1332851
+**User Guide**
+- [Searching](guide/searching.md)
+- [Loading Data](guide/loading-data.md)
+- [ML Frameworks](guide/ml-frameworks.md)
-#### The Data and Learning Hub for Science \(DLHub\)
+ |
+
-This material is based upon work supported by Laboratory Directed Research and Development \(LDRD\) funding from Argonne National Laboratory, provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357. [https://www.dlhub.org](https://www.dlhub.org)
+**Features**
+- [CLI](features/cli.md)
+- [MCP Server](features/mcp-server.md)
+- [HuggingFace](features/huggingface.md)
-#### The Materials Data Facility
+ |
+
+
-This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the [Center for Hierarchical Material Design \(CHiMaD\)](http://chimad.northwestern.edu). This work was performed under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design \(CHiMaD\). This work was also supported by the National Science Foundation as part of the [Midwest Big Data Hub](http://midwestbigdatahub.org) under NSF Award Number: 1636950 "BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design \(IMaD\): Leverage, Innovate, and Disseminate". [https://www.materialsdatafacility.org](https://www.materialsdatafacility.org)
+## Project Support
+
+This work was supported by the National Science Foundation under NSF Award Number: 1931306 "Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure".
+Foundry brings together components from:
+- [Materials Data Facility (MDF)](https://materialsdatafacility.org)
+- [Data and Learning Hub for Science (DLHub)](https://www.dlhub.org)
+- [MAST-ML](https://mastmldocs.readthedocs.io/)
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
index aac625cd..d7bb6a42 100644
--- a/docs/SUMMARY.md
+++ b/docs/SUMMARY.md
@@ -1,14 +1,48 @@
-# Table of contents
+# Table of Contents
-* [Getting started with Foundry](README.md)
+## Getting Started
-## How to contribute
+* [Introduction](README.md)
+* [Installation](installation.md)
+* [Quick Start](quickstart.md)
-* [Contribution Process](how-to-contribute/contributing.md)
-* [Contributor Covenant](how-to-contribute/code_of_conduct.md)
+## User Guide
----
+* [Searching for Datasets](guide/searching.md)
+* [Loading Data](guide/loading-data.md)
+* [Using with ML Frameworks](guide/ml-frameworks.md)
+* [Dataset Schemas](guide/schemas.md)
-* [Sphinx Autogenerated documentation - markdown](sphinx-autogenerated-documentation.md)
-* [foundry package — Foundry\_test 1.1 documentation - HTML AUTOGENERATION](foundry-package-foundry_test-1.1-documentation-html-autogeneration.md)
+## Features
+* [Command Line Interface](features/cli.md)
+* [MCP Server (AI Agents)](features/mcp-server.md)
+* [HuggingFace Integration](features/huggingface.md)
+* [Error Handling](features/errors.md)
+
+## Concepts
+
+* [Overview](concepts/overview.md)
+* [Foundry Datasets](concepts/foundry-datasets.md)
+* [Data Packages](concepts/foundry-data-packages.md)
+
+## Publishing
+
+* [Publishing Datasets](publishing/publishing-datasets.md)
+* [Metadata Reference](publishing/metadata-reference.md)
+
+## Reference
+
+* [API Reference](api/foundry.md)
+* [CLI Reference](api/cli-reference.md)
+* [Configuration](api/configuration.md)
+
+## Community
+
+* [Contributing](how-to-contribute/contributing.md)
+* [Code of Conduct](how-to-contribute/code_of_conduct.md)
+
+## Support
+
+* [Troubleshooting](support/troubleshooting.md)
+* [FAQ](support/faq.md)
diff --git a/docs/concepts/overview.md b/docs/concepts/overview.md
index 19be629b..23a80ffc 100644
--- a/docs/concepts/overview.md
+++ b/docs/concepts/overview.md
@@ -1,11 +1,111 @@
# Overview
-TODO:
+Foundry-ML is a Python library that simplifies access to machine learning-ready datasets in materials science and chemistry. It provides a unified interface to discover, load, and use curated scientific datasets.
-* Change the code snippet in the image
-* Write the text :\)
+## What is Foundry?
-
+Foundry serves as a bridge between data producers (researchers who create datasets) and data consumers (researchers who use datasets for ML). It standardizes how datasets are:
+- **Discovered** - Search by keyword, browse catalogs, or get by DOI
+- **Described** - Rich metadata including field descriptions, units, and citations
+- **Delivered** - Automatic download, caching, and format conversion
+## Key Features
+### For Data Users
+
+```python
+from foundry import Foundry
+
+# Connect and search
+f = Foundry()
+results = f.search("band gap", limit=5)
+
+# Load a dataset
+dataset = results.iloc[0].FoundryDataset
+X, y = dataset.get_as_dict()['train']
+
+# Understand the data
+schema = dataset.get_schema()
+print(schema['fields']) # What columns exist and what they mean
+```
+
+### For AI Agents
+
+Foundry includes an MCP (Model Context Protocol) server that enables AI assistants like Claude to discover and use datasets programmatically:
+
+```bash
+foundry mcp install # Add to Claude Code
+```
+
+### For Data Publishers
+
+Share your datasets with the community using standardized metadata:
+
+```python
+f.publish(metadata, data_path="./my_data", source_id="my_dataset_v1")
+```
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ Your Code │
+├─────────────────────────────────────────────────────────────┤
+│ Foundry Python API │
+│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
+│ │ Search │ │ Load │ │ Schema │ │ Publish │ │
+│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
+├─────────────────────────────────────────────────────────────┤
+│ Data Transport │
+│ HTTPS (default) │ Globus (large files) │
+├─────────────────────────────────────────────────────────────┤
+│ Materials Data Facility │
+│ (Storage, Metadata, DOI Registration) │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Core Concepts
+
+### Datasets
+
+A Foundry dataset contains:
+- **Data files** - The actual data (JSON, CSV, HDF5, etc.)
+- **Schema** - Description of fields, types, and splits
+- **Metadata** - DataCite-compliant citation information
+
+### Splits
+
+Datasets are organized into splits (e.g., `train`, `test`, `validation`) with input/target pairs:
+
+```python
+data = dataset.get_as_dict()
+X_train, y_train = data['train']
+X_test, y_test = data['test']
+```
+
+### Keys (Fields)
+
+Each field in a dataset has:
+- **Name** - The column/field identifier
+- **Type** - `input` or `target`
+- **Description** - What the field represents
+- **Units** - Physical units (if applicable)
+
+## Ecosystem Integration
+
+Foundry integrates with the broader ML ecosystem:
+
+| Integration | Purpose |
+|-------------|---------|
+| **PyTorch** | `dataset.get_as_torch()` |
+| **TensorFlow** | `dataset.get_as_tensorflow()` |
+| **HuggingFace Hub** | Export datasets for broader visibility |
+| **MCP Server** | AI agent access |
+| **CLI** | Terminal-based workflows |
+
+## Next Steps
+
+- [Installation](../installation.md) - Get Foundry installed
+- [Quick Start](../quickstart.md) - Load your first dataset in 5 minutes
+- [Searching for Datasets](../guide/searching.md) - Find the right data
diff --git a/docs/features/cli.md b/docs/features/cli.md
new file mode 100644
index 00000000..4ab11896
--- /dev/null
+++ b/docs/features/cli.md
@@ -0,0 +1,175 @@
+# Command Line Interface
+
+Foundry includes a CLI for terminal-based workflows.
+
+## Basic Usage
+
+```bash
+foundry --help
+```
+
+## Commands
+
+### Search Datasets
+
+```bash
+# Search by keyword
+foundry search "band gap"
+
+# Limit results
+foundry search "band gap" --limit 10
+
+# JSON output (for scripting)
+foundry search "band gap" --json
+```
+
+### Get Dataset Info
+
+```bash
+# Get info by DOI
+foundry get 10.18126/abc123
+
+# JSON output
+foundry get 10.18126/abc123 --json
+```
+
+### View Schema
+
+See what fields a dataset contains:
+
+```bash
+foundry schema 10.18126/abc123
+```
+
+Output:
+```
+Dataset: foundry_oqmd_band_gaps_v1.1
+Data Type: tabular
+
+Fields:
+ - composition (input): Chemical composition
+ - band_gap (target): Band gap value (eV)
+
+Splits:
+ - train
+ - test
+```
+
+### List All Datasets
+
+```bash
+# List available datasets
+foundry catalog
+
+# Limit results
+foundry catalog --limit 20
+
+# JSON output
+foundry catalog --json
+```
+
+### Check Publication Status
+
+```bash
+foundry status my_dataset_v1
+```
+
+### Version
+
+```bash
+foundry version
+```
+
+## HuggingFace Export
+
+Export a Foundry dataset to HuggingFace Hub:
+
+```bash
+foundry push-to-hf 10.18126/abc123 --repo your-username/dataset-name
+```
+
+Options:
+- `--repo` - HuggingFace repository ID (required)
+- `--token` - HuggingFace API token (or set HF_TOKEN env var)
+- `--private` - Create a private repository
+
+## MCP Server
+
+Start the MCP server for AI agent integration:
+
+```bash
+# Start server
+foundry mcp start
+
+# Install to Claude Code
+foundry mcp install
+```
+
+See [MCP Server](mcp-server.md) for details.
+
+## JSON Output
+
+Most commands support `--json` for machine-readable output:
+
+```bash
+# Pipe to jq for processing
+foundry catalog --json | jq '.[].name'
+
+# Save to file
+foundry search "crystal" --json > results.json
+```
+
+## Exit Codes
+
+| Code | Meaning |
+|------|---------|
+| 0 | Success |
+| 1 | Error (see message) |
+
+## Environment Variables
+
+| Variable | Purpose |
+|----------|---------|
+| `HF_TOKEN` | HuggingFace API token |
+| `GLOBUS_TOKEN` | Globus authentication |
+
+## Examples
+
+### Find and Download a Dataset
+
+```bash
+# Search
+foundry search "formation energy" --limit 5
+
+# Get the DOI from results, then get details
+foundry schema 10.18126/xyz789
+
+# Use in Python
+python -c "
+from foundry import Foundry
+f = Foundry()
+ds = f.get_dataset('10.18126/xyz789')
+print(ds.get_as_dict().keys())
+"
+```
+
+### Export to HuggingFace
+
+```bash
+# Set token
+export HF_TOKEN=hf_xxxxx
+
+# Export
+foundry push-to-hf 10.18126/abc123 --repo materials-science/my-dataset
+```
+
+### Scripting with JSON
+
+```bash
+#!/bin/bash
+# Find all datasets with "band" in the name
+foundry search "band" --json | jq -r '.[].doi' | while read doi; do
+ echo "Processing: $doi"
+ foundry schema "$doi"
+done
+```
diff --git a/docs/features/errors.md b/docs/features/errors.md
new file mode 100644
index 00000000..a5a0392f
--- /dev/null
+++ b/docs/features/errors.md
@@ -0,0 +1,254 @@
+# Error Handling
+
+Foundry uses structured error classes that provide clear context for both humans and AI agents.
+
+## Error Structure
+
+All Foundry errors include:
+
+```python
+class FoundryError(Exception):
+ code: str # Machine-readable error code
+ message: str # Human-readable message
+ details: dict # Additional context
+ recovery_hint: str # How to fix the issue
+```
+
+## Error Types
+
+### DatasetNotFoundError
+
+Raised when a search or get operation returns no results.
+
+```python
+from foundry.errors import DatasetNotFoundError
+
+try:
+ dataset = f.get_dataset("nonexistent-doi")
+except DatasetNotFoundError as e:
+ print(e.code) # "DATASET_NOT_FOUND"
+ print(e.message) # "No dataset found matching..."
+ print(e.recovery_hint) # "Try a broader search term..."
+```
+
+### AuthenticationError
+
+Raised when authentication fails.
+
+```python
+from foundry.errors import AuthenticationError
+
+try:
+ f = Foundry(use_globus=True)
+except AuthenticationError as e:
+ print(e.code) # "AUTH_FAILED"
+ print(e.details) # {"service": "Globus"}
+ print(e.recovery_hint) # "Run Foundry(no_browser=False)..."
+```
+
+### DownloadError
+
+Raised when a file download fails.
+
+```python
+from foundry.errors import DownloadError
+
+try:
+ data = dataset.get_as_dict()
+except DownloadError as e:
+ print(e.code) # "DOWNLOAD_FAILED"
+ print(e.details) # {"url": "...", "reason": "..."}
+ print(e.recovery_hint) # "Check network connection..."
+```
+
+### DataLoadError
+
+Raised when data files cannot be parsed.
+
+```python
+from foundry.errors import DataLoadError
+
+try:
+ data = dataset.get_as_dict()
+except DataLoadError as e:
+ print(e.code) # "DATA_LOAD_FAILED"
+ print(e.details) # {"file_path": "...", "data_type": "..."}
+```
+
+### ValidationError
+
+Raised when metadata validation fails.
+
+```python
+from foundry.errors import ValidationError
+
+try:
+ f.publish(invalid_metadata, ...)
+except ValidationError as e:
+ print(e.code) # "VALIDATION_FAILED"
+ print(e.details) # {"field_name": "creators", "schema_type": "datacite"}
+```
+
+### PublishError
+
+Raised when dataset publication fails.
+
+```python
+from foundry.errors import PublishError
+
+try:
+ f.publish(metadata, data_path="./data", source_id="my_dataset")
+except PublishError as e:
+ print(e.code) # "PUBLISH_FAILED"
+ print(e.details) # {"source_id": "...", "status": "..."}
+```
+
+### CacheError
+
+Raised when local cache operations fail.
+
+```python
+from foundry.errors import CacheError
+
+try:
+ data = dataset.get_as_dict()
+except CacheError as e:
+ print(e.code) # "CACHE_ERROR"
+ print(e.details) # {"operation": "write", "cache_path": "..."}
+```
+
+### ConfigurationError
+
+Raised when configuration is invalid.
+
+```python
+from foundry.errors import ConfigurationError
+
+try:
+ f = Foundry(use_globus="maybe") # Invalid value
+except ConfigurationError as e:
+ print(e.code) # "CONFIG_ERROR"
+ print(e.details) # {"setting": "use_globus", "current_value": "maybe"}
+```
+
+## Error Codes Reference
+
+| Code | Error Class | Common Causes |
+|------|-------------|---------------|
+| `DATASET_NOT_FOUND` | DatasetNotFoundError | Invalid DOI, no search results |
+| `AUTH_FAILED` | AuthenticationError | Expired token, no credentials |
+| `DOWNLOAD_FAILED` | DownloadError | Network issues, URL not found |
+| `DATA_LOAD_FAILED` | DataLoadError | Corrupted file, wrong format |
+| `VALIDATION_FAILED` | ValidationError | Missing required fields |
+| `PUBLISH_FAILED` | PublishError | Server error, permission denied |
+| `CACHE_ERROR` | CacheError | Disk full, permission denied |
+| `CONFIG_ERROR` | ConfigurationError | Invalid parameter values |
+
+## Handling Errors
+
+### Basic Pattern
+
+```python
+from foundry import Foundry
+from foundry.errors import DatasetNotFoundError, DownloadError
+
+f = Foundry()
+
+try:
+ dataset = f.get_dataset("10.18126/abc123")
+ data = dataset.get_as_dict()
+except DatasetNotFoundError as e:
+ print(f"Dataset not found: {e.message}")
+ print(f"Try: {e.recovery_hint}")
+except DownloadError as e:
+ print(f"Download failed: {e.message}")
+ print(f"URL: {e.details.get('url')}")
+```
+
+### Catch All Foundry Errors
+
+```python
+from foundry.errors import FoundryError
+
+try:
+ # Your code
+ pass
+except FoundryError as e:
+ print(f"[{e.code}] {e.message}")
+ if e.recovery_hint:
+ print(f"Suggestion: {e.recovery_hint}")
+```
+
+### Serialization for APIs
+
+Errors can be serialized for JSON responses:
+
+```python
+from foundry.errors import DatasetNotFoundError
+import json
+
+error = DatasetNotFoundError("missing-dataset")
+error_dict = error.to_dict()
+
+print(json.dumps(error_dict, indent=2))
+# {
+# "code": "DATASET_NOT_FOUND",
+# "message": "No dataset found matching query: 'missing-dataset'",
+# "details": {"query": "missing-dataset", "search_type": "query"},
+# "recovery_hint": "Try a broader search term..."
+# }
+```
+
+## For AI Agents
+
+Structured errors are designed for programmatic handling:
+
+```python
+def handle_foundry_operation(operation):
+ try:
+ return operation()
+ except FoundryError as e:
+ return {
+ "success": False,
+ "error_code": e.code,
+ "message": e.message,
+ "recovery_action": e.recovery_hint
+ }
+```
+
+The `recovery_hint` field is particularly useful for agents to suggest next steps to users.
+
+## Custom Error Handling
+
+### Retry Logic
+
+```python
+import time
+from foundry.errors import DownloadError
+
+def download_with_retry(dataset, max_retries=3):
+ for attempt in range(max_retries):
+ try:
+ return dataset.get_as_dict()
+ except DownloadError as e:
+ if attempt < max_retries - 1:
+ print(f"Retry {attempt + 1}/{max_retries}...")
+ time.sleep(2 ** attempt) # Exponential backoff
+ else:
+ raise
+```
+
+### Fallback Strategies
+
+```python
+from foundry.errors import DownloadError
+
+try:
+ # Try HTTPS first (default)
+ f = Foundry()
+ data = dataset.get_as_dict()
+except DownloadError:
+ # Fall back to Globus
+ f = Foundry(use_globus=True)
+ data = dataset.get_as_dict()
+```
diff --git a/docs/features/huggingface.md b/docs/features/huggingface.md
new file mode 100644
index 00000000..5f6fa6bb
--- /dev/null
+++ b/docs/features/huggingface.md
@@ -0,0 +1,235 @@
+# HuggingFace Integration
+
+Export Foundry datasets to HuggingFace Hub to increase visibility and enable discovery by the broader ML community.
+
+## Installation
+
+```bash
+pip install foundry-ml[huggingface]
+```
+
+## Quick Start
+
+### Python API
+
+```python
+from foundry import Foundry
+from foundry.integrations.huggingface import push_to_hub
+
+# Get a dataset
+f = Foundry()
+dataset = f.search("band gap", limit=1).iloc[0].FoundryDataset
+
+# Export to HuggingFace Hub
+url = push_to_hub(
+ dataset,
+ repo_id="your-username/dataset-name",
+ token="hf_xxxxx" # Or set HF_TOKEN env var
+)
+print(f"Published at: {url}")
+```
+
+### CLI
+
+```bash
+# Set your HuggingFace token
+export HF_TOKEN=hf_xxxxx
+
+# Export a dataset
+foundry push-to-hf 10.18126/abc123 --repo your-username/dataset-name
+```
+
+## What Gets Created
+
+When you export a dataset, Foundry creates:
+
+### 1. Data Files
+
+The dataset is converted to HuggingFace's format (Parquet/Arrow) with all splits preserved:
+
+```
+dataset/
+ train/
+ data-00000.parquet
+ test/
+ data-00000.parquet
+```
+
+### 2. Dataset Card (README.md)
+
+A comprehensive README is auto-generated from the Foundry metadata:
+
+```markdown
+---
+license: cc-by-4.0
+tags:
+ - materials-science
+ - foundry-ml
+---
+
+# Band Gap Dataset
+
+Calculated band gaps for 50,000 materials...
+
+## Fields
+| Field | Role | Description | Units |
+|-------|------|-------------|-------|
+| composition | input | Chemical formula | - |
+| band_gap | target | DFT band gap | eV |
+
+## Citation
+@article{...}
+
+## Source
+Original DOI: 10.18126/abc123
+```
+
+### 3. Metadata
+
+HuggingFace-compatible metadata including:
+- License information
+- Task categories
+- Tags for discoverability
+- Size information
+
+## API Reference
+
+### push_to_hub
+
+```python
+def push_to_hub(
+ dataset, # FoundryDataset object
+ repo_id: str, # HF Hub repo (e.g., "org/name")
+ token: str = None, # HF API token
+ private: bool = False,
+ split: str = None # Specific split to export
+) -> str: # Returns URL
+```
+
+**Parameters:**
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `dataset` | FoundryDataset | Yes | Dataset from Foundry |
+| `repo_id` | str | Yes | HuggingFace repository ID |
+| `token` | str | No | API token (uses cached if not provided) |
+| `private` | bool | No | Create private repository |
+| `split` | str | No | Export specific split only |
+
+**Returns:** URL of the created dataset
+
+## Author Attribution
+
+**Important:** The authors listed on HuggingFace come from the original DataCite metadata, not the person pushing. This preserves proper scientific attribution.
+
+```python
+# The original creators from DataCite metadata
+authors = dataset.dc.creators
+# e.g., [{"creatorName": "Smith, John"}, {"creatorName": "Doe, Jane"}]
+```
+
+## Examples
+
+### Export All Splits
+
+```python
+from foundry import Foundry
+from foundry.integrations.huggingface import push_to_hub
+
+f = Foundry()
+ds = f.get_dataset("10.18126/abc123")
+
+url = push_to_hub(ds, "materials-science/band-gaps")
+```
+
+### Export Single Split
+
+```python
+url = push_to_hub(
+ ds,
+ "materials-science/band-gaps-train",
+ split="train"
+)
+```
+
+### Private Repository
+
+```python
+url = push_to_hub(
+ ds,
+ "my-org/private-dataset",
+ private=True
+)
+```
+
+### Using Environment Variable
+
+```bash
+export HF_TOKEN=hf_xxxxx
+```
+
+```python
+# Token is picked up automatically
+url = push_to_hub(ds, "org/name")
+```
+
+## CLI Options
+
+```bash
+foundry push-to-hf --repo [options]
+
+Options:
+ --repo TEXT HuggingFace repository ID (required)
+ --token TEXT HuggingFace API token
+ --private Create private repository
+ --help Show this message
+```
+
+## Best Practices
+
+### Repository Naming
+
+Use descriptive, lowercase names with hyphens:
+- Good: `materials-science/oqmd-band-gaps`
+- Bad: `my_dataset_v1`
+
+### Organization
+
+Consider creating an organization for your lab/group:
+- `your-lab/dataset-1`
+- `your-lab/dataset-2`
+
+### Documentation
+
+The auto-generated README is a starting point. Consider adding:
+- More detailed description
+- Example usage code
+- Related papers
+- Acknowledgments
+
+## Troubleshooting
+
+### Authentication Failed
+
+```python
+# Ensure you're logged in
+from huggingface_hub import login
+login() # Interactive login
+
+# Or set token explicitly
+push_to_hub(ds, "org/name", token="hf_xxxxx")
+```
+
+### Repository Already Exists
+
+HuggingFace won't overwrite existing repos by default. Either:
+1. Use a different repo name
+2. Delete the existing repo first
+3. Use the HuggingFace web interface to update
+
+### Large Datasets
+
+For very large datasets (>10GB), the upload may take time. Consider:
+- Exporting specific splits: `split="train"`
+- Using a stable internet connection
+- Running in a cloud environment
diff --git a/docs/features/mcp-server.md b/docs/features/mcp-server.md
new file mode 100644
index 00000000..d4e8e807
--- /dev/null
+++ b/docs/features/mcp-server.md
@@ -0,0 +1,218 @@
+# MCP Server (AI Agent Integration)
+
+Foundry includes an MCP (Model Context Protocol) server that enables AI assistants like Claude to discover and use materials science datasets.
+
+## What is MCP?
+
+MCP is a protocol that allows AI assistants to use external tools. With Foundry's MCP server, you can ask Claude:
+
+- "Find me a materials science dataset for band gap prediction"
+- "What fields are in the OQMD dataset?"
+- "Load the training data and show me the first few rows"
+
+## Quick Start
+
+### Install for Claude Code
+
+```bash
+foundry mcp install
+```
+
+This adds Foundry to your Claude Code configuration. Restart Claude Code to activate.
+
+### Manual Start
+
+For custom integrations:
+
+```bash
+foundry mcp start
+```
+
+## Available Tools
+
+The MCP server provides these tools to AI agents:
+
+### search_datasets
+
+Search for materials science datasets.
+
+**Parameters:**
+- `query` (string, required) - Search terms
+- `limit` (integer, optional) - Maximum results (default: 10)
+
+**Returns:** List of datasets with name, title, DOI, description
+
+**Example prompt:** "Search for datasets about crystal structures"
+
+### get_dataset_schema
+
+Get the schema of a dataset - what fields it contains.
+
+**Parameters:**
+- `doi` (string, required) - The dataset DOI
+
+**Returns:** Schema with splits, fields, data types, and descriptions
+
+**Example prompt:** "What fields are in dataset 10.18126/abc123?"
+
+### load_dataset
+
+Load a dataset and return its data with schema.
+
+**Parameters:**
+- `doi` (string, required) - The dataset DOI
+- `split` (string, optional) - Which split to load (default: all)
+
+**Returns:** Data with schema information and citation
+
+**Example prompt:** "Load the training data from the band gap dataset"
+
+### list_all_datasets
+
+List all available Foundry datasets.
+
+**Parameters:**
+- `limit` (integer, optional) - Maximum results (default: 100)
+
+**Returns:** Complete catalog of available datasets
+
+**Example prompt:** "What datasets are available in Foundry?"
+
+## Configuration
+
+### Claude Code
+
+The `foundry mcp install` command adds this to your Claude Code config:
+
+```json
+{
+ "mcpServers": {
+ "foundry-ml": {
+ "command": "foundry",
+ "args": ["mcp", "start"]
+ }
+ }
+}
+```
+
+### Custom Integration
+
+For other MCP-compatible clients, the server uses stdio transport:
+
+```python
+from foundry.mcp.server import create_server
+
+config = create_server()
+# config contains server name, version, and tool definitions
+```
+
+## Example Conversations
+
+### Finding a Dataset
+
+**You:** Find me a dataset for predicting band gaps of materials
+
+**Claude:** I'll search for band gap datasets in Foundry.
+
+*Uses search_datasets tool*
+
+I found 5 datasets related to band gaps:
+1. **OQMD Band Gaps** (10.18126/abc) - 50,000 materials
+2. **AFLOW Band Gaps** (10.18126/def) - 30,000 materials
+...
+
+### Understanding a Dataset
+
+**You:** What's in the OQMD band gaps dataset?
+
+**Claude:** Let me get the schema for that dataset.
+
+*Uses get_dataset_schema tool*
+
+The OQMD Band Gaps dataset contains:
+- **Inputs:** composition (chemical formula), structure (crystal structure)
+- **Targets:** band_gap (eV)
+- **Splits:** train (80%), test (20%)
+
+### Loading Data
+
+**You:** Load the training data and show me some examples
+
+**Claude:** I'll load the training split.
+
+*Uses load_dataset tool*
+
+Here are the first 5 rows:
+| composition | band_gap |
+|-------------|----------|
+| Si | 1.12 |
+| GaAs | 1.42 |
+...
+
+## Troubleshooting
+
+### Server Not Starting
+
+Ensure Foundry is installed correctly:
+
+```bash
+pip install --upgrade foundry-ml
+foundry version
+```
+
+### Tools Not Available
+
+Restart Claude Code after installing:
+
+```bash
+foundry mcp install
+# Restart Claude Code
+```
+
+### Authentication Issues
+
+For datasets requiring authentication, ensure you've authenticated:
+
+```python
+from foundry import Foundry
+f = Foundry() # This triggers auth flow if needed
+```
+
+## Technical Details
+
+### Protocol
+
+The MCP server uses JSON-RPC 2.0 over stdio.
+
+### Server Info
+
+```python
+from foundry.mcp.server import create_server
+
+config = create_server()
+print(config)
+# {
+# "name": "foundry-ml",
+# "version": "1.1.0",
+# "tools": [...]
+# }
+```
+
+### Tool Definitions
+
+Each tool follows the MCP tool schema:
+
+```python
+{
+ "name": "search_datasets",
+ "description": "Search for materials science datasets...",
+ "inputSchema": {
+ "type": "object",
+ "properties": {
+ "query": {"type": "string", "description": "..."},
+ "limit": {"type": "integer", "default": 10}
+ },
+ "required": ["query"]
+ }
+}
+```
diff --git a/docs/guide/loading-data.md b/docs/guide/loading-data.md
new file mode 100644
index 00000000..351e86bd
--- /dev/null
+++ b/docs/guide/loading-data.md
@@ -0,0 +1,165 @@
+# Loading Data
+
+Once you've found a dataset, here's how to load and use it.
+
+## Basic Loading
+
+```python
+from foundry import Foundry
+
+f = Foundry()
+results = f.search("band gap", limit=1)
+dataset = results.iloc[0].FoundryDataset
+
+# Load all data
+data = dataset.get_as_dict()
+```
+
+## Understanding the Data Structure
+
+Most datasets have this structure:
+
+```python
+data = {
+ 'train': (X_train, y_train), # Inputs and targets
+ 'test': (X_test, y_test),
+}
+```
+
+Access training data:
+
+```python
+X_train, y_train = data['train']
+```
+
+## Loading Specific Splits
+
+```python
+# Load only training data
+train_data = dataset.get_as_dict(split='train')
+
+# Load only test data
+test_data = dataset.get_as_dict(split='test')
+```
+
+## Loading with Schema
+
+Get data and metadata together:
+
+```python
+result = dataset.get_as_dict(include_schema=True)
+
+data = result['data']
+schema = result['schema']
+
+print(f"Dataset: {schema['name']}")
+print(f"Fields: {schema['fields']}")
+```
+
+## Data Types
+
+### Tabular Data
+
+Most common format - dictionaries of arrays:
+
+```python
+X, y = data['train']
+
+# X might be:
+# {'composition': [...], 'structure': [...]}
+
+# y might be:
+# {'band_gap': [...]}
+```
+
+### Working with DataFrames
+
+```python
+import pandas as pd
+
+X, y = data['train']
+df = pd.DataFrame(X)
+df['target'] = list(y.values())[0]
+```
+
+## HDF5 Data
+
+For large datasets, use lazy loading:
+
+```python
+data = dataset.get_as_dict(as_hdf5=True)
+# Returns h5py objects that load on access
+```
+
+## Caching
+
+Data is cached locally after first download:
+
+```python
+# First call downloads
+data = dataset.get_as_dict() # Slow
+
+# Subsequent calls use cache
+data = dataset.get_as_dict() # Fast
+```
+
+### Custom Cache Location
+
+```python
+f = Foundry(local_cache_dir="/path/to/cache")
+```
+
+### Clear Cache
+
+```python
+f.delete_dataset_cache("dataset_name")
+```
+
+## Common Patterns
+
+### Train/Test Split
+
+```python
+data = dataset.get_as_dict()
+
+X_train, y_train = data['train']
+X_test, y_test = data['test']
+
+# Train model
+from sklearn.ensemble import RandomForestRegressor
+model = RandomForestRegressor()
+model.fit(pd.DataFrame(X_train), list(y_train.values())[0])
+```
+
+### Single Target Column
+
+```python
+X, y = data['train']
+target_name = list(y.keys())[0] # Get first target
+target_values = y[target_name]
+```
+
+### Multiple Inputs
+
+```python
+X, y = data['train']
+
+# Combine inputs into DataFrame
+import pandas as pd
+df = pd.DataFrame(X)
+print(df.columns) # See all input features
+```
+
+## Error Handling
+
+```python
+from foundry.errors import DownloadError, DataLoadError
+
+try:
+ data = dataset.get_as_dict()
+except DownloadError as e:
+ print(f"Download failed: {e.message}")
+ print(f"Try: {e.recovery_hint}")
+except DataLoadError as e:
+ print(f"Could not load data: {e.message}")
+```
diff --git a/docs/guide/ml-frameworks.md b/docs/guide/ml-frameworks.md
new file mode 100644
index 00000000..9fbaa898
--- /dev/null
+++ b/docs/guide/ml-frameworks.md
@@ -0,0 +1,190 @@
+# Using with ML Frameworks
+
+Foundry integrates with popular ML frameworks.
+
+## PyTorch
+
+### Load as PyTorch Dataset
+
+```python
+# Get PyTorch-compatible dataset
+torch_dataset = dataset.get_as_torch(split='train')
+
+# Use with DataLoader
+from torch.utils.data import DataLoader
+
+loader = DataLoader(torch_dataset, batch_size=32, shuffle=True)
+
+for batch in loader:
+ inputs, targets = batch
+ # Train your model
+```
+
+### Full Training Example
+
+```python
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from foundry import Foundry
+
+# Load data
+f = Foundry()
+ds = f.search("band gap", limit=1).iloc[0].FoundryDataset
+
+train_dataset = ds.get_as_torch(split='train')
+test_dataset = ds.get_as_torch(split='test')
+
+train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
+test_loader = DataLoader(test_dataset, batch_size=32)
+
+# Define model
+model = nn.Sequential(
+ nn.Linear(input_size, 64),
+ nn.ReLU(),
+ nn.Linear(64, 1)
+)
+
+# Train
+optimizer = torch.optim.Adam(model.parameters())
+criterion = nn.MSELoss()
+
+for epoch in range(10):
+ for inputs, targets in train_loader:
+ optimizer.zero_grad()
+ outputs = model(inputs)
+ loss = criterion(outputs, targets)
+ loss.backward()
+ optimizer.step()
+```
+
+## TensorFlow
+
+### Load as tf.data.Dataset
+
+```python
+# Get TensorFlow-compatible dataset
+tf_dataset = dataset.get_as_tensorflow(split='train')
+
+# Batch and prefetch
+tf_dataset = tf_dataset.batch(32).prefetch(1)
+
+# Use in training
+model.fit(tf_dataset, epochs=10)
+```
+
+### Full Training Example
+
+```python
+import tensorflow as tf
+from foundry import Foundry
+
+# Load data
+f = Foundry()
+ds = f.search("band gap", limit=1).iloc[0].FoundryDataset
+
+train_ds = ds.get_as_tensorflow(split='train')
+test_ds = ds.get_as_tensorflow(split='test')
+
+train_ds = train_ds.batch(32).prefetch(tf.data.AUTOTUNE)
+test_ds = test_ds.batch(32)
+
+# Define model
+model = tf.keras.Sequential([
+ tf.keras.layers.Dense(64, activation='relu'),
+ tf.keras.layers.Dense(1)
+])
+
+model.compile(
+ optimizer='adam',
+ loss='mse',
+ metrics=['mae']
+)
+
+# Train
+model.fit(train_ds, validation_data=test_ds, epochs=10)
+```
+
+## Scikit-learn
+
+Use the dictionary format:
+
+```python
+from sklearn.ensemble import RandomForestRegressor
+from sklearn.model_selection import cross_val_score
+import pandas as pd
+from foundry import Foundry
+
+# Load data
+f = Foundry()
+ds = f.search("band gap", limit=1).iloc[0].FoundryDataset
+data = ds.get_as_dict()
+
+X_train, y_train = data['train']
+X_test, y_test = data['test']
+
+# Convert to arrays
+X_train_df = pd.DataFrame(X_train)
+y_train_arr = list(y_train.values())[0]
+
+# Train
+model = RandomForestRegressor(n_estimators=100)
+model.fit(X_train_df, y_train_arr)
+
+# Evaluate
+X_test_df = pd.DataFrame(X_test)
+y_test_arr = list(y_test.values())[0]
+score = model.score(X_test_df, y_test_arr)
+print(f"R² score: {score:.3f}")
+```
+
+## Generic Python
+
+For any framework, use the dictionary format:
+
+```python
+data = dataset.get_as_dict()
+X, y = data['train']
+
+# X is a dict: {'feature1': [...], 'feature2': [...]}
+# y is a dict: {'target': [...]}
+
+# Convert as needed for your framework
+import numpy as np
+X_array = np.column_stack([X[k] for k in X.keys()])
+y_array = np.array(list(y.values())[0])
+```
+
+## Tips
+
+### Check Data Shape
+
+```python
+data = dataset.get_as_dict()
+X, y = data['train']
+
+print(f"Features: {list(X.keys())}")
+print(f"Targets: {list(y.keys())}")
+print(f"Samples: {len(list(X.values())[0])}")
+```
+
+### Handle Missing Values
+
+```python
+import pandas as pd
+
+X, y = data['train']
+df = pd.DataFrame(X)
+print(df.isnull().sum()) # Check for missing values
+df = df.fillna(0) # Or handle as appropriate
+```
+
+### Feature Engineering
+
+```python
+# Get schema to understand features
+schema = dataset.get_schema()
+
+for field in schema['fields']:
+ print(f"{field['name']}: {field['description']} ({field['units']})")
+```
diff --git a/docs/guide/schemas.md b/docs/guide/schemas.md
new file mode 100644
index 00000000..12a4319e
--- /dev/null
+++ b/docs/guide/schemas.md
@@ -0,0 +1,179 @@
+# Dataset Schemas
+
+Schemas describe what data a dataset contains, helping you understand before you load.
+
+## Getting the Schema
+
+```python
+from foundry import Foundry
+
+f = Foundry()
+dataset = f.search("band gap", limit=1).iloc[0].FoundryDataset
+
+schema = dataset.get_schema()
+```
+
+## Schema Structure
+
+```python
+{
+ 'name': 'foundry_oqmd_band_gaps_v1.1',
+ 'title': 'OQMD Band Gaps Dataset',
+ 'doi': '10.18126/abc123',
+ 'description': 'Band gaps calculated using DFT...',
+ 'data_type': 'tabular',
+ 'fields': [...],
+ 'splits': [...]
+}
+```
+
+## Fields
+
+Fields describe each column/feature in the dataset:
+
+```python
+for field in schema['fields']:
+ print(f"Name: {field['name']}")
+ print(f"Role: {field['role']}") # 'input' or 'target'
+ print(f"Description: {field['description']}")
+ print(f"Units: {field['units']}")
+ print("---")
+```
+
+Example output:
+```
+Name: composition
+Role: input
+Description: Chemical composition formula
+Units: None
+---
+Name: band_gap
+Role: target
+Description: DFT-calculated band gap
+Units: eV
+---
+```
+
+## Splits
+
+Splits show how data is divided:
+
+```python
+for split in schema['splits']:
+ print(f"{split['name']}: {split.get('type', 'data')}")
+```
+
+Example output:
+```
+train: train
+test: test
+```
+
+## Data Types
+
+The `data_type` field indicates the format:
+
+| Type | Description |
+|------|-------------|
+| `tabular` | Rows and columns (most common) |
+| `hierarchical` | Nested/tree structure |
+| `image` | Image data |
+
+## Using Schema Information
+
+### Filter by Field Role
+
+```python
+input_fields = [f for f in schema['fields'] if f['role'] == 'input']
+target_fields = [f for f in schema['fields'] if f['role'] == 'target']
+
+print(f"Inputs: {[f['name'] for f in input_fields]}")
+print(f"Targets: {[f['name'] for f in target_fields]}")
+```
+
+### Check Units
+
+```python
+for field in schema['fields']:
+ if field['units']:
+ print(f"{field['name']}: {field['units']}")
+```
+
+### Include Schema with Data
+
+```python
+result = dataset.get_as_dict(include_schema=True)
+
+data = result['data']
+schema = result['schema']
+
+# Now you have both together
+X, y = data['train']
+print(f"Loading {schema['name']}...")
+```
+
+## CLI Schema
+
+```bash
+foundry schema 10.18126/abc123
+```
+
+Output:
+```
+Dataset: foundry_oqmd_band_gaps_v1.1
+Title: OQMD Band Gaps Dataset
+DOI: 10.18126/abc123
+Data Type: tabular
+
+Fields:
+ [input ] composition: Chemical composition formula
+ [target] band_gap: DFT-calculated band gap (eV)
+
+Splits:
+ - train
+ - test
+```
+
+## Best Practices
+
+### Always Check Schema First
+
+```python
+# Before loading (no download)
+schema = dataset.get_schema()
+print(f"Fields: {len(schema['fields'])}")
+print(f"Splits: {[s['name'] for s in schema['splits']]}")
+
+# If it looks right, load
+data = dataset.get_as_dict()
+```
+
+### Validate Data Against Schema
+
+```python
+schema = dataset.get_schema()
+data = dataset.get_as_dict()
+
+X, y = data['train']
+
+input_names = [f['name'] for f in schema['fields'] if f['role'] == 'input']
+for name in input_names:
+ if name not in X:
+ print(f"Warning: {name} not in data")
+```
+
+### Document Your Usage
+
+```python
+schema = dataset.get_schema()
+print(f"""
+Using dataset: {schema['name']}
+DOI: {schema['doi']}
+
+Features used:
+{chr(10).join(f"- {f['name']}: {f['description']}" for f in schema['fields'] if f['role'] == 'input')}
+
+Target:
+{chr(10).join(f"- {f['name']}: {f['description']}" for f in schema['fields'] if f['role'] == 'target')}
+""")
+```
diff --git a/docs/guide/searching.md b/docs/guide/searching.md
new file mode 100644
index 00000000..a40016a1
--- /dev/null
+++ b/docs/guide/searching.md
@@ -0,0 +1,129 @@
+# Searching for Datasets
+
+Foundry provides multiple ways to discover datasets.
+
+## Keyword Search
+
+Search by topic, material, or property:
+
+```python
+from foundry import Foundry
+
+f = Foundry()
+
+# Search by keyword
+results = f.search("band gap")
+results = f.search("crystal structure")
+results = f.search("formation energy")
+```
+
+### Limit Results
+
+```python
+results = f.search("band gap", limit=5)
+```
+
+### JSON Output
+
+For programmatic access:
+
+```python
+results = f.search("band gap", as_json=True)
+
+for ds in results:
+ print(f"{ds['name']}: {ds['title']}")
+```
+
+## Browse the Catalog
+
+List all available datasets:
+
+```python
+# List datasets
+catalog = f.list(limit=20)
+
+# As JSON
+catalog = f.list(as_json=True)
+```
+
+## Get by DOI
+
+If you know the DOI:
+
+```python
+dataset = f.get_dataset("10.18126/abc123")
+```
+
+## Search Results
+
+Search returns a DataFrame with columns:
+
+| Column | Description |
+|--------|-------------|
+| `dataset_name` | Unique identifier |
+| `title` | Human-readable title |
+| `DOI` | Digital Object Identifier |
+| `year` | Publication year |
+| `FoundryDataset` | Dataset object for loading |
+
+## Accessing Datasets
+
+From search results:
+
+```python
+# By index
+dataset = results.iloc[0].FoundryDataset
+
+# By name
+dataset = results.get_dataset_by_name("foundry_oqmd_band_gaps_v1.1")
+
+# By DOI
+dataset = results.get_dataset_by_doi("10.18126/abc123")
+```
+
+## CLI Search
+
+```bash
+# Search from terminal
+foundry search "band gap"
+
+# Limit results
+foundry search "band gap" --limit 5
+
+# JSON output
+foundry search "band gap" --json
+```
+
+## Tips
+
+### Broad vs. Specific
+
+```python
+# Broad (more results)
+f.search("energy")
+
+# Specific (fewer, more relevant)
+f.search("formation energy DFT")
+```
+
+### Check What's Available
+
+```python
+# See all datasets first
+all_ds = f.list(limit=100)
+print(f"Total datasets: {len(all_ds)}")
+
+# Then search
+results = f.search("your topic")
+```
+
+### Inspect Before Loading
+
+```python
+# Check schema before downloading
+dataset = results.iloc[0].FoundryDataset
+schema = dataset.get_schema()
+
+print(f"Fields: {[f['name'] for f in schema['fields']]}")
+print(f"Splits: {[s['name'] for s in schema['splits']]}")
+```
diff --git a/docs/installation.md b/docs/installation.md
new file mode 100644
index 00000000..559edac9
--- /dev/null
+++ b/docs/installation.md
@@ -0,0 +1,117 @@
+# Installation
+
+## Requirements
+
+- Python 3.8 or higher
+- pip package manager
+
+## Basic Installation
+
+Install Foundry-ML from PyPI:
+
+```bash
+pip install foundry-ml
+```
+
+This installs the core package with HTTPS download support. No additional setup required.
+
+## Optional Dependencies
+
+### HuggingFace Integration
+
+To export datasets to HuggingFace Hub:
+
+```bash
+pip install foundry-ml[huggingface]
+```
+
+### All Optional Dependencies
+
+```bash
+pip install foundry-ml[all]
+```
+
+## Verify Installation
+
+```python
+from foundry import Foundry
+
+f = Foundry()
+print("Foundry installed successfully!")
+
+# Test search
+results = f.search("band gap", limit=1)
+print(f"Found {len(results)} datasets")
+```
+
+## Cloud Environments
+
+### Google Colab
+
+Foundry works in Colab without additional setup:
+
+```python
+!pip install foundry-ml
+
+from foundry import Foundry
+f = Foundry(no_browser=True, no_local_server=True)
+```
+
+### Jupyter Notebooks
+
+For Jupyter running on remote servers:
+
+```python
+from foundry import Foundry
+f = Foundry(no_browser=True, no_local_server=True)
+```
+
+## Globus Setup (Optional)
+
+For large dataset transfers, you can use Globus instead of HTTPS:
+
+1. Install [Globus Connect Personal](https://www.globus.org/globus-connect-personal)
+2. Start the Globus endpoint
+3. Enable Globus in Foundry:
+
+```python
+f = Foundry(use_globus=True)
+```
+
+**Note:** HTTPS is the default and works for most use cases. Only use Globus if you're transferring very large datasets (>10GB) or have institutional Globus endpoints.
+
+## Troubleshooting
+
+### Import Errors
+
+If you get import errors, ensure you have the latest version:
+
+```bash
+pip install --upgrade foundry-ml
+```
+
+### Network Issues
+
+Foundry requires internet access to search and download datasets. If behind a proxy:
+
+```python
+import os
+os.environ['HTTP_PROXY'] = 'http://proxy:port'
+os.environ['HTTPS_PROXY'] = 'http://proxy:port'
+
+from foundry import Foundry
+f = Foundry()
+```
+
+### Cache Location
+
+By default, datasets are cached in your home directory. To change:
+
+```python
+f = Foundry(local_cache_dir="/path/to/cache")
+```
+
+## Next Steps
+
+- [Quick Start](quickstart.md) - Load your first dataset
+- [CLI](features/cli.md) - Use Foundry from the command line
diff --git a/docs/quickstart.md b/docs/quickstart.md
new file mode 100644
index 00000000..433c56f4
--- /dev/null
+++ b/docs/quickstart.md
@@ -0,0 +1,111 @@
+# Quick Start
+
+Load your first materials science dataset in under 5 minutes.
+
+## 1. Install
+
+```bash
+pip install foundry-ml
+```
+
+## 2. Connect
+
+```python
+from foundry import Foundry
+
+f = Foundry()
+```
+
+For cloud environments (Colab, remote Jupyter):
+
+```python
+f = Foundry(no_browser=True, no_local_server=True)
+```
+
+## 3. Search
+
+Find datasets by keyword:
+
+```python
+results = f.search("band gap", limit=5)
+results
+```
+
+Output:
+```
+ dataset_name ...
+0 foundry_oqmd_band_gaps_v1.1 ...
+1 foundry_aflow_band_gaps_v1.1 ...
+2 foundry_experimental_band_gaps_v1.1 ...
+...
+```
+
+## 4. Load
+
+Get a dataset and load its data:
+
+```python
+# Get the first result
+dataset = results.iloc[0].FoundryDataset
+
+# Load training data
+data = dataset.get_as_dict()
+X_train, y_train = data['train']
+
+print(f"Samples: {len(X_train)}")
+```
+
+## 5. Understand
+
+Check what fields the dataset contains:
+
+```python
+schema = dataset.get_schema()
+
+print(f"Dataset: {schema['name']}")
+print(f"Fields:")
+for field in schema['fields']:
+ print(f" - {field['name']} ({field['role']})")
+```
+
+## 6. Cite
+
+Get the citation for publications:
+
+```python
+print(dataset.get_citation())
+```
+
+## Complete Example
+
+```python
+from foundry import Foundry
+
+# Connect
+f = Foundry()
+
+# Search
+results = f.search("band gap", limit=5)
+
+# Load
+dataset = results.iloc[0].FoundryDataset
+X, y = dataset.get_as_dict()['train']
+
+# Use with sklearn
+from sklearn.ensemble import RandomForestRegressor
+model = RandomForestRegressor()
+# model.fit(X, y) # Train your model
+
+# Cite
+print(dataset.get_citation())
+```
+
+## What's Next?
+
+| Task | Guide |
+|------|-------|
+| Search effectively | [Searching for Datasets](guide/searching.md) |
+| Load different formats | [Loading Data](guide/loading-data.md) |
+| Use with PyTorch/TensorFlow | [ML Frameworks](guide/ml-frameworks.md) |
+| Use from terminal | [CLI](features/cli.md) |
+| Publish your data | [Publishing Datasets](publishing/publishing-datasets.md) |
diff --git a/docs/support/faq.md b/docs/support/faq.md
new file mode 100644
index 00000000..4ef5c431
--- /dev/null
+++ b/docs/support/faq.md
@@ -0,0 +1,211 @@
+# Frequently Asked Questions
+
+## General
+
+### What is Foundry?
+
+Foundry-ML is a Python library for discovering and loading machine learning-ready datasets in materials science and chemistry. It provides standardized access to curated scientific datasets with rich metadata.
+
+### Is Foundry free?
+
+Yes. Foundry is open source and the datasets are freely available. Some datasets may have specific licenses - check the citation information for details.
+
+### Do I need to create an account?
+
+No account is required for basic usage with HTTPS download. Some features (like Globus transfers) may require authentication.
+
+## Installation
+
+### What Python version do I need?
+
+Python 3.8 or higher.
+
+### How do I install Foundry?
+
+```bash
+pip install foundry-ml
+```
+
+### I get import errors after installing
+
+Try upgrading:
+
+```bash
+pip install --upgrade foundry-ml
+```
+
+## Data Loading
+
+### Why is my first download slow?
+
+Data is downloaded on first access and cached locally. Subsequent loads are fast.
+
+### Where is data cached?
+
+By default, in your home directory. To change:
+
+```python
+f = Foundry(local_cache_dir="/path/to/cache")
+```
+
+### How do I clear the cache?
+
+```python
+f.delete_dataset_cache("dataset_name")
+```
+
+### Can I use Foundry offline?
+
+You need internet to search and download datasets. Once cached, data loads locally.
+
+## Cloud Environments
+
+### How do I use Foundry in Google Colab?
+
+```python
+!pip install foundry-ml
+
+from foundry import Foundry
+f = Foundry(no_browser=True, no_local_server=True)
+```
+
+### Does it work with Jupyter on a remote server?
+
+Yes, use the same settings:
+
+```python
+f = Foundry(no_browser=True, no_local_server=True)
+```
+
+## Data Format
+
+### What format is the data in?
+
+Most datasets use a dictionary format:
+
+```python
+data = {
+ 'train': (X_dict, y_dict),
+ 'test': (X_dict, y_dict)
+}
+```
+
+### How do I get a pandas DataFrame?
+
+```python
+import pandas as pd
+
+X, y = data['train']
+df = pd.DataFrame(X)
+```
+
+### Does it work with PyTorch?
+
+Yes:
+
+```python
+torch_dataset = dataset.get_as_torch(split='train')
+```
+
+### Does it work with TensorFlow?
+
+Yes:
+
+```python
+tf_dataset = dataset.get_as_tensorflow(split='train')
+```
+
+## Publishing
+
+### How do I publish my own dataset?
+
+See [Publishing Datasets](../publishing/publishing-datasets.md) for the full workflow.
+
+### What metadata format is required?
+
+Foundry uses DataCite-compliant metadata. See [Metadata Reference](../publishing/metadata-reference.md).
+
+### Can I update a published dataset?
+
+Create a new version with an updated source_id (e.g., `my_dataset_v2`).
+
+## Globus
+
+### Do I need Globus?
+
+No. HTTPS download is the default and works for most use cases.
+
+### When should I use Globus?
+
+For very large datasets (>10GB) or if you have institutional Globus endpoints.
+
+### How do I enable Globus?
+
+```python
+f = Foundry(use_globus=True)
+```
+
+You'll need [Globus Connect Personal](https://www.globus.org/globus-connect-personal) running.
+
+## AI Integration
+
+### How do I use Foundry with Claude?
+
+Install the MCP server:
+
+```bash
+foundry mcp install
+```
+
+Restart Claude Code. You can now ask Claude to find and load datasets.
+
+### What AI assistants are supported?
+
+Any MCP-compatible assistant. Currently tested with Claude Code.
+
+## HuggingFace
+
+### Can I export to HuggingFace Hub?
+
+Yes:
+
+```bash
+pip install foundry-ml[huggingface]
+foundry push-to-hf 10.18126/abc123 --repo your-username/dataset-name
+```
+
+### Who is listed as author on HuggingFace?
+
+The original dataset creators from the DataCite metadata, not the person pushing.
+
+## Troubleshooting
+
+### I get "Dataset not found"
+
+Check:
+1. The DOI is correct
+2. Try a broader search term
+3. List available datasets: `f.list()`
+
+### Download keeps failing
+
+Try:
+1. Check your internet connection
+2. Try again (transient errors)
+3. If using Globus, switch to HTTPS: `f = Foundry(use_globus=False)`
+
+### The data format is unexpected
+
+Check the schema first:
+
+```python
+schema = dataset.get_schema()
+print(schema['fields'])
+print(schema['splits'])
+```
+
+## More Help
+
+- [Documentation](https://ai-materials-and-chemistry.gitbook.io/foundry/)
+- [GitHub Issues](https://github.com/MLMI2-CSSI/foundry/issues)
+- [Troubleshooting](troubleshooting.md)
diff --git a/examples/00_hello_foundry/README.md b/examples/00_hello_foundry/README.md
new file mode 100644
index 00000000..192a339d
--- /dev/null
+++ b/examples/00_hello_foundry/README.md
@@ -0,0 +1,51 @@
+# Hello Foundry!
+
+This is the **beginner-friendly** example for Foundry-ML.
+
+No domain expertise required - just Python basics.
+
+## What You'll Learn
+
+1. How to connect to Foundry
+2. How to search for datasets
+3. How to load data into Python
+4. How to explore dataset schemas
+5. How to get proper citations
+
+## Quick Start
+
+```python
+from foundry import Foundry
+
+# Connect
+f = Foundry()
+
+# Search
+results = f.search("band gap", limit=5)
+
+# Load
+dataset = results.iloc[0].FoundryDataset
+data = dataset.get_as_dict()
+
+# Use
+X, y = data['train']
+```
+
+## Running in Google Colab
+
+For cloud environments, use:
+
+```python
+f = Foundry(no_browser=True, no_local_server=True)
+```
+
+## Next Steps
+
+After this example, check out:
+- `/examples/bandgap/` - Working with band gap datasets
+- `/examples/publishing/` - How to publish your own datasets
+
+## Need Help?
+
+- Documentation: https://github.com/MLMI2-CSSI/foundry
+- CLI help: `foundry --help`
diff --git a/examples/00_hello_foundry/hello_foundry.ipynb b/examples/00_hello_foundry/hello_foundry.ipynb
new file mode 100644
index 00000000..36b0b65e
--- /dev/null
+++ b/examples/00_hello_foundry/hello_foundry.ipynb
@@ -0,0 +1,371 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Hello Foundry! 🚀\n",
+ "\n",
+ "Welcome to Foundry-ML! This notebook will show you how to:\n",
+ "\n",
+ "1. **Search** for materials science datasets\n",
+ "2. **Load** a dataset into Python\n",
+ "3. **Explore** the data\n",
+ "\n",
+ "No domain expertise required - just Python basics!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 1: Install Foundry\n",
+ "\n",
+ "If you haven't already, install Foundry-ML:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Uncomment the line below to install\n",
+ "# !pip install foundry-ml"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 2: Import and Connect\n",
+ "\n",
+ "First, import Foundry and create a client. If you're running this in Google Colab or a cloud environment, use `no_browser=True`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "from foundry import Foundry\n\n# Create a Foundry client (uses HTTPS download by default)\n# For cloud environments (Colab, etc.), add: no_browser=True, no_local_server=True\nf = Foundry()"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 3: Search for Datasets\n",
+ "\n",
+ "Let's search for datasets. You can search by keyword - no need to know the exact name!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " dataset_name | \n",
+ " title | \n",
+ " year | \n",
+ " DOI | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " foundry_g4mp2_solvation_v1.2 | \n",
+ " DFT Estimates of Solvation Energy in Multiple ... | \n",
+ " root=2022 | \n",
+ " 10.18126/jos5-wj65 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " foundry_assorted_computational_band_gaps_v1.1 | \n",
+ " Graph Network Based Deep Learning of Band Gaps... | \n",
+ " root=2021 | \n",
+ " 10.18126/7io9-1z9k | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " foundry_experimental_band_gaps_v1.1 | \n",
+ " Graph Network Based Deep Learning of Band Gaps... | \n",
+ " root=2021 | \n",
+ " 10.18126/wg3u-g8vu | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " foundry_aflow_band_gaps_v1.1 | \n",
+ " Graph Network Based Deep Learning of Band Gaps... | \n",
+ " root=2021 | \n",
+ " 10.18126/6fdy-bsam | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " foundry_oqmd_band_gaps_v1.1 | \n",
+ " Graph Network Based Deep Learning of Band Gaps... | \n",
+ " root=2021 | \n",
+ " 10.18126/w1ey-9y8b | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " dataset_name \\\n",
+ "0 foundry_g4mp2_solvation_v1.2 \n",
+ "1 foundry_assorted_computational_band_gaps_v1.1 \n",
+ "2 foundry_experimental_band_gaps_v1.1 \n",
+ "3 foundry_aflow_band_gaps_v1.1 \n",
+ "4 foundry_oqmd_band_gaps_v1.1 \n",
+ "\n",
+ " title year \\\n",
+ "0 DFT Estimates of Solvation Energy in Multiple ... root=2022 \n",
+ "1 Graph Network Based Deep Learning of Band Gaps... root=2021 \n",
+ "2 Graph Network Based Deep Learning of Band Gaps... root=2021 \n",
+ "3 Graph Network Based Deep Learning of Band Gaps... root=2021 \n",
+ "4 Graph Network Based Deep Learning of Band Gaps... root=2021 \n",
+ "\n",
+ " DOI FoundryDataset \n",
+ "0 10.18126/jos5-wj65 DFT Estimates of Solvation Energy in Multiple SolventsWard, Logan; Dandu, Naveen; Blaiszik, Ben; Narayanan, Badri; Assary, Rajeev S.; Redfern, Paul C.; Foster, Ian; Curtiss, Larry A.DOI: 10.18126/jos5-wj65
Dataset
| short_name | g4mp2_solvation |
|---|
| data_type | tabular |
|---|
| task_type | |
|---|
| domain | - materials science
- chemistry
|
|---|
| n_items | 130258.0 |
|---|
| splits | | type | train |
|---|
| path | g4mp2_data.json |
|---|
| label | train |
|---|
|
|---|
| keys | | key | type | filter | description | units | classes |
|---|
| input | | Input SMILES string | | | | input | | SMILES string after relaxation | | | | input | | InChi after generating coordinates with CORINA | | | | input | | InChi after relaxation | | | | input | | InChi after relaxation | XYZ coordinates after relaxation | | | input | | Atomic charges on each atom, as predicted from B3LYP | | | | input | | Rotational constant, A | GHz | | | input | | Rotational constant, B | GHz | | | input | | Rotational constant, C | GHz | | | input | | InChi after relaxation | | | | input | | Number of electrons | | | | input | | Number of non-hydrogen atoms | | | | input | | Number of atoms in molecule | | | | input | | Dipole moment | D | | | input | | Isotropic polarizability | a_0^3 | | | input | | Electronic spatial extant | a_0^2 | | | input | | Heat capacity at 298.15K | cal/mol-K | | | target | | G4MP2 Standard Enthalpy of Formation, 298K | kcal/mol | | | input | | B3LYP Band gap energy | Ha | | | input | | B3LYP Energy of HOMO | Ha | | | input | | B3LYP Energy of LUMO | Ha | | | input | | B3LYP Zero point vibrational energy | Ha | | | input | | B3LYP Internal energy at 0K | Ha | | | input | | B3LYP Internal energy at 298.15K | Ha | | | input | | B3LYP Enthalpy at 298.15K | Ha | | | input | | B3LYP atomization energy at 0K | Ha | | | input | | B3LYP Free energy at 298.15K | Ha | | | target | | G4MP2 Internal energy at 0K | Ha | | | target | | G4MP2 Internal energy at 298.15K | Ha | | | target | | G4MP2 Enthalpy at 298.15K | Ha | | | target | | G4MP2 Free eergy at 0K | Ha | | | target | | G4MP2 atomization energy at 0K | Ha | | | target | | Solvation energy, acetone | kcal/mol | | | target | | Solvation energy, acetonitrile | kcal/mol | | | target | | Solvation energy, dimethyl sulfoxide | kcal/mol | | | target | | Solvation energy, ethanol | kcal/mol | | | target | | Solvation energy, water | kcal/mol | |
|
|---|
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Get the first dataset from our search results\n",
+ "dataset = results.iloc[0].FoundryDataset\n",
+ "\n",
+ "# Display dataset info\n",
+ "dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 5: Understand the Schema\n",
+ "\n",
+ "Before loading data, let's see what fields it contains:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "# Get the schema - what columns/fields are in this dataset?\nschema = dataset.get_schema()\n\nprint(f\"Dataset: {schema['name']}\")\nprint(f\"Data Type: {schema['data_type']}\")\nprint(f\"\\nSplits: {[s['name'] for s in schema['splits']]}\")\nprint(f\"\\nFields:\")\nfor field in schema['fields']:\n print(f\" - {field['name']} ({field['role']}): {field['description'] or 'No description'}\")"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 6: Load the Data\n",
+ "\n",
+ "Now let's load the actual data. Foundry handles downloading and caching automatically!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Processing records: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 3266.59it/s]\n",
+ "Transferring data: 0%| | 0/1 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Error: GC_NOT_CONNECTED - globus connect offline\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Load data as a dictionary\n",
+ "data = dataset.get_as_dict()\n",
+ "\n",
+ "# See what we got\n",
+ "print(f\"Data keys: {data.keys()}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# For ML datasets, data is typically split into inputs (X) and targets (y)\n",
+ "# Let's explore the training split\n",
+ "if 'train' in data:\n",
+ " train_data = data['train']\n",
+ " print(f\"Training data shape: {type(train_data)}\")\n",
+ " \n",
+ " # If it's a tuple of (inputs, targets)\n",
+ " if isinstance(train_data, tuple) and len(train_data) == 2:\n",
+ " X, y = train_data\n",
+ " print(f\"\\nInputs (X): {type(X)}\")\n",
+ " print(f\"Targets (y): {type(y)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 7: Use with Your Favorite ML Framework\n",
+ "\n",
+ "Foundry datasets work seamlessly with PyTorch and TensorFlow!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# For PyTorch users:\n",
+ "# torch_dataset = dataset.get_as_torch(split='train')\n",
+ "# from torch.utils.data import DataLoader\n",
+ "# loader = DataLoader(torch_dataset, batch_size=32)\n",
+ "\n",
+ "# For TensorFlow users:\n",
+ "# tf_dataset = dataset.get_as_tensorflow(split='train')\n",
+ "# model.fit(tf_dataset)\n",
+ "\n",
+ "print(\"Foundry works with PyTorch and TensorFlow out of the box!\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 8: Get the Citation\n",
+ "\n",
+ "When you use a dataset in research, cite it properly!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Get BibTeX citation\n",
+ "citation = dataset.get_citation()\n",
+ "print(citation)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🎉 That's It!\n",
+ "\n",
+ "You've just:\n",
+ "- Connected to Foundry\n",
+ "- Searched for datasets\n",
+ "- Loaded data into Python\n",
+ "- Explored the schema\n",
+ "- Got a proper citation\n",
+ "\n",
+ "### Next Steps\n",
+ "\n",
+ "- Explore more datasets with `f.list()`\n",
+ "- Check out other examples in the `/examples` folder\n",
+ "- Use the CLI: `foundry search \"your query\"`\n",
+ "- Read the docs: https://github.com/MLMI2-CSSI/foundry\n",
+ "\n",
+ "Happy researching! 🔬"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.1"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/examples/01_quickstart/quickstart.ipynb b/examples/01_quickstart/quickstart.ipynb
new file mode 100644
index 00000000..5a57eae9
--- /dev/null
+++ b/examples/01_quickstart/quickstart.ipynb
@@ -0,0 +1,167 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Foundry Quickstart\n",
+ "\n",
+ "**Time:** 5 minutes \n",
+ "**Prerequisites:** Python basics \n",
+ "**What you'll learn:** Search, load, and use a materials science dataset\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Installation\n",
+ "\n",
+ "```bash\n",
+ "pip install foundry-ml\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Connect to Foundry"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "from foundry import Foundry\n\n# Create a Foundry client (uses HTTPS download by default)\n# For Google Colab or cloud environments, add: no_browser=True, no_local_server=True\nf = Foundry()"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. Search for Datasets\n",
+ "\n",
+ "Search by keyword - you don't need to know exact dataset names."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Search for band gap datasets (a key property in materials science)\n",
+ "results = f.search(\"band gap\", limit=5)\n",
+ "results"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3. Load a Dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Get the first result\n",
+ "dataset = results.iloc[0].FoundryDataset\n",
+ "\n",
+ "# Load the data\n",
+ "data = dataset.get_as_dict()\n",
+ "\n",
+ "# Most datasets have train/test splits with (inputs, targets)\n",
+ "print(f\"Available splits: {list(data.keys())}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Extract training data\n",
+ "X_train, y_train = data['train']\n",
+ "\n",
+ "print(f\"Input features: {type(X_train)}\")\n",
+ "print(f\"Targets: {type(y_train)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 4. Use the Data\n",
+ "\n",
+ "Foundry data works with any ML framework."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Example: Simple sklearn model\n",
+ "from sklearn.ensemble import RandomForestRegressor\n",
+ "from sklearn.metrics import mean_absolute_error\n",
+ "import pandas as pd\n",
+ "\n",
+ "# Convert to arrays if needed\n",
+ "if isinstance(X_train, dict):\n",
+ " X_train_df = pd.DataFrame(X_train)\n",
+ " y_train_arr = list(y_train.values())[0] # Get first target\n",
+ "else:\n",
+ " X_train_df = X_train\n",
+ " y_train_arr = y_train\n",
+ "\n",
+ "print(f\"Training samples: {len(X_train_df)}\")\n",
+ "print(f\"Features: {X_train_df.shape[1] if hasattr(X_train_df, 'shape') else 'N/A'}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 5. Get Citation\n",
+ "\n",
+ "Always cite datasets you use in publications!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "citation = dataset.get_citation()\n",
+ "print(citation)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "## Summary\n\n```python\nfrom foundry import Foundry\n\nf = Foundry() # HTTPS download by default\nresults = f.search(\"band gap\")\ndataset = results.iloc[0].FoundryDataset\nX, y = dataset.get_as_dict()['train']\n```\n\n**Next:** See `02_working_with_data.ipynb` for advanced data handling."
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python",
+ "version": "3.9.0"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/examples/02_working_with_data/working_with_data.ipynb b/examples/02_working_with_data/working_with_data.ipynb
new file mode 100644
index 00000000..a20b5729
--- /dev/null
+++ b/examples/02_working_with_data/working_with_data.ipynb
@@ -0,0 +1,296 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Working with Foundry Data\n",
+ "\n",
+ "**Time:** 15 minutes \n",
+ "**Prerequisites:** Completed quickstart \n",
+ "**What you'll learn:**\n",
+ "- Understanding dataset schemas\n",
+ "- Loading specific splits\n",
+ "- Using data with PyTorch and TensorFlow\n",
+ "- Working with different data types\n",
+ "- JSON output for programmatic access\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "from foundry import Foundry\n\n# HTTPS download is now the default\nf = Foundry()"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Understanding Dataset Schemas\n",
+ "\n",
+ "Before loading data, understand what's in it using `get_schema()`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "# Get a dataset\nresults = f.search(\"band gap\", limit=1)\ndataset = results.iloc[0].FoundryDataset\n\n# Get the schema\nschema = dataset.get_schema()\n\nprint(f\"Dataset: {schema['name']}\")\nprint(f\"Title: {schema['title']}\")\nprint(f\"DOI: {schema['doi']}\")\nprint(f\"Data Type: {schema['data_type']}\")"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "# Examine fields (columns)\nprint(\"Fields:\")\nprint(\"-\" * 60)\nfor field in schema['fields']:\n role = field['role'] # 'input' or 'target'\n name = field['name']\n desc = field['description'] or 'No description'\n units = field['units'] or ''\n print(f\" [{role:6}] {name}: {desc} {f'({units})' if units else ''}\")"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "# Examine splits (train/test/validation)\nprint(\"Splits:\")\nprint(\"-\" * 60)\nfor split in schema['splits']:\n print(f\" - {split['name']}: {split.get('type', 'data')}\")"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. Loading Specific Splits\n",
+ "\n",
+ "Load only the data you need."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Starting Download of: https://data.materialsdatafacility.org/foundry/foundry_g4mp2_solvation_v1.2/g4mp2_data.json\n",
+ "Downloading... 206.19 MBTraining data keys: dict_keys(['train'])\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Load only training data\n",
+ "train_data = dataset.get_as_dict(split='train')\n",
+ "print(f\"Training data keys: {train_data.keys() if isinstance(train_data, dict) else type(train_data)}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "All splits: ['train']\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Load all splits at once\n",
+ "all_data = dataset.get_as_dict()\n",
+ "print(f\"All splits: {list(all_data.keys())}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3. Loading with Schema Information\n",
+ "\n",
+ "Use `include_schema=True` to get data AND metadata together. This is especially useful for programmatic/agent workflows."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "# Get data with schema attached\nresult = dataset.get_as_dict(include_schema=True)\n\nprint(f\"Result keys: {result.keys()}\")\nprint(f\"\\nSchema name: {result['schema']['name']}\")\nprint(f\"Data splits: {list(result['data'].keys())}\")"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 4. PyTorch Integration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Load as a PyTorch Dataset\n",
+ "try:\n",
+ " torch_dataset = dataset.get_as_torch(split='train')\n",
+ " \n",
+ " # Use with DataLoader\n",
+ " from torch.utils.data import DataLoader\n",
+ " loader = DataLoader(torch_dataset, batch_size=32, shuffle=True)\n",
+ " \n",
+ " # Get a batch\n",
+ " batch = next(iter(loader))\n",
+ " print(f\"Batch type: {type(batch)}\")\n",
+ " print(f\"Batch size: {len(batch[0]) if isinstance(batch, tuple) else batch.shape[0]}\")\n",
+ "except ImportError:\n",
+ " print(\"PyTorch not installed. Install with: pip install torch\")\n",
+ "except Exception as e:\n",
+ " print(f\"Could not load as PyTorch: {e}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 5. TensorFlow Integration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Load as a TensorFlow Dataset\n",
+ "try:\n",
+ " tf_dataset = dataset.get_as_tensorflow(split='train')\n",
+ " \n",
+ " # Batch and prefetch\n",
+ " tf_dataset = tf_dataset.batch(32).prefetch(1)\n",
+ " \n",
+ " # Get a batch\n",
+ " for batch in tf_dataset.take(1):\n",
+ " print(f\"Batch type: {type(batch)}\")\n",
+ "except ImportError:\n",
+ " print(\"TensorFlow not installed. Install with: pip install tensorflow\")\n",
+ "except Exception as e:\n",
+ " print(f\"Could not load as TensorFlow: {e}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6. JSON Output for Programmatic Access\n",
+ "\n",
+ "Use `as_json=True` for agent-friendly output (lists of dicts instead of DataFrames)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "# Search with JSON output\n# as_json=True returns a list of dicts instead of a DataFrame\nresults_json = f.search(\"band gap\", limit=3, as_json=True)\n\nprint(f\"Type: {type(results_json)}\")\nprint(f\"Number of results: {len(results_json)}\")\n\nfor ds in results_json:\n print(f\"\\n- {ds['name']}\")\n print(f\" Title: {ds['title']}\")\n print(f\" DOI: {ds['doi']}\")\n print(f\" Fields: {ds.get('fields', [])}\")"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "# List all datasets as JSON\nimport json\n\nall_datasets = f.list(limit=5, as_json=True)\nprint(json.dumps(all_datasets[0], indent=2))"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 7. Browsing the Catalog"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# List all available datasets\n",
+ "catalog = f.list(limit=10)\n",
+ "catalog"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Get a specific dataset by DOI\n",
+ "# Replace with a real DOI from your search results\n",
+ "# dataset = f.get_dataset(\"10.18126/xyz\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 8. Working with HDF5 Data\n",
+ "\n",
+ "Some datasets use HDF5 format for large arrays."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Load data keeping HDF5 format (for very large datasets)\n",
+ "# data_hdf5 = dataset.get_as_dict(as_hdf5=True)\n",
+ "# This returns h5py objects that load lazily\n",
+ "print(\"Use as_hdf5=True for lazy loading of large datasets\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Summary\n",
+ "\n",
+ "| Method | Use Case |\n",
+ "|--------|----------|\n",
+ "| `get_schema()` | Understand dataset structure before loading |\n",
+ "| `get_as_dict()` | General purpose loading |\n",
+ "| `get_as_dict(split='train')` | Load specific split |\n",
+ "| `get_as_dict(include_schema=True)` | Data + metadata together |\n",
+ "| `get_as_torch()` | PyTorch DataLoader compatible |\n",
+ "| `get_as_tensorflow()` | tf.data.Dataset compatible |\n",
+ "| `f.search(as_json=True)` | Programmatic/agent access |\n",
+ "\n",
+ "**Next:** See `03_advanced_workflows.ipynb` for publishing, CLI, and agent integration."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.1"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/examples/03_advanced_workflows/advanced_workflows.ipynb b/examples/03_advanced_workflows/advanced_workflows.ipynb
new file mode 100644
index 00000000..fabe855a
--- /dev/null
+++ b/examples/03_advanced_workflows/advanced_workflows.ipynb
@@ -0,0 +1,368 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Advanced Foundry Workflows\n",
+ "\n",
+ "**Time:** 20 minutes \n",
+ "**Prerequisites:** Completed previous examples \n",
+ "**What you'll learn:**\n",
+ "- Publishing datasets to Foundry\n",
+ "- Exporting to HuggingFace Hub\n",
+ "- Using the CLI\n",
+ "- MCP server for AI agent integration\n",
+ "- Structured error handling\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Publishing Datasets\n",
+ "\n",
+ "Share your data with the materials science community."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "from foundry import Foundry\n\n# HTTPS download is now the default\nf = Foundry()"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1.1 Prepare Your Metadata\n",
+ "\n",
+ "Foundry uses DataCite metadata standard. Create a JSON file describing your dataset:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "metadata = {\n",
+ " \"dc\": {\n",
+ " \"titles\": [{\"title\": \"My Band Gap Dataset\"}],\n",
+ " \"creators\": [\n",
+ " {\"creatorName\": \"Smith, John\", \"affiliation\": \"University of Example\"}\n",
+ " ],\n",
+ " \"descriptions\": [{\n",
+ " \"description\": \"Band gap predictions for 1000 materials using DFT calculations.\",\n",
+ " \"descriptionType\": \"Abstract\"\n",
+ " }],\n",
+ " \"publicationYear\": 2024,\n",
+ " \"publisher\": \"Foundry\",\n",
+ " \"resourceType\": {\"resourceType\": \"Dataset\", \"resourceTypeGeneral\": \"Dataset\"}\n",
+ " },\n",
+ " \"foundry\": {\n",
+ " \"data_type\": \"tabular\",\n",
+ " \"keys\": [\n",
+ " {\n",
+ " \"key\": [\"composition\"],\n",
+ " \"type\": \"input\",\n",
+ " \"description\": \"Chemical composition formula\"\n",
+ " },\n",
+ " {\n",
+ " \"key\": [\"band_gap\"],\n",
+ " \"type\": \"target\",\n",
+ " \"description\": \"Calculated band gap\",\n",
+ " \"units\": \"eV\"\n",
+ " }\n",
+ " ],\n",
+ " \"splits\": [\n",
+ " {\"label\": \"train\", \"type\": \"train\"},\n",
+ " {\"label\": \"test\", \"type\": \"test\"}\n",
+ " ]\n",
+ " }\n",
+ "}\n",
+ "\n",
+ "print(\"Metadata structure ready!\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1.2 Publish (requires Globus authentication)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# To publish:\n",
+ "# 1. Save metadata to foundry.json\n",
+ "# 2. Prepare data files in a folder\n",
+ "# 3. Run:\n",
+ "\n",
+ "# result = f.publish(\n",
+ "# metadata,\n",
+ "# data_path=\"./my_data_folder\",\n",
+ "# source_id=\"my_dataset_v1.0\"\n",
+ "# )\n",
+ "\n",
+ "print(\"See foundry.publish() documentation for full publishing workflow\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Check publication status\n",
+ "# status = f.check_status(\"my_dataset_v1.0\")\n",
+ "# print(status)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. Exporting to HuggingFace Hub\n",
+ "\n",
+ "Make your Foundry dataset discoverable on HuggingFace Hub."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Install HuggingFace dependencies\n",
+ "# !pip install foundry-ml[huggingface]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Export a Foundry dataset to HuggingFace Hub\n",
+ "from foundry.integrations.huggingface import push_to_hub\n",
+ "\n",
+ "# Get a dataset\n",
+ "results = f.search(\"band gap\", limit=1)\n",
+ "dataset = results.iloc[0].FoundryDataset\n",
+ "\n",
+ "# Export to HF Hub (requires HF token)\n",
+ "# url = push_to_hub(\n",
+ "# dataset,\n",
+ "# repo_id=\"your-username/dataset-name\",\n",
+ "# token=\"hf_YOUR_TOKEN\", # or set HF_TOKEN env var\n",
+ "# private=False\n",
+ "# )\n",
+ "# print(f\"Published at: {url}\")\n",
+ "\n",
+ "print(\"HuggingFace export ready! Set your HF token to publish.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### What Gets Created on HuggingFace\n",
+ "\n",
+ "The export automatically creates:\n",
+ "- **Data files** in Parquet/Arrow format\n",
+ "- **Dataset Card (README.md)** with:\n",
+ " - Title, description from DataCite\n",
+ " - **Authors from original creators** (not the person pushing)\n",
+ " - DOI link to original source\n",
+ " - Field descriptions and units\n",
+ " - BibTeX citation\n",
+ " - Usage examples for both Foundry and HF"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3. Using the CLI\n",
+ "\n",
+ "Foundry includes a command-line interface for quick operations.\n",
+ "\n",
+ "```bash\n",
+ "# Search for datasets\n",
+ "foundry search \"band gap\"\n",
+ "\n",
+ "# Get dataset info\n",
+ "foundry get 10.18126/abc123\n",
+ "\n",
+ "# View schema\n",
+ "foundry schema 10.18126/abc123\n",
+ "\n",
+ "# List all datasets\n",
+ "foundry catalog --limit 10\n",
+ "\n",
+ "# JSON output for scripting\n",
+ "foundry catalog --json | jq '.[] | .name'\n",
+ "\n",
+ "# Export to HuggingFace\n",
+ "foundry push-to-hf 10.18126/abc123 --repo your-org/dataset-name\n",
+ "\n",
+ "# Check version\n",
+ "foundry version\n",
+ "\n",
+ "# Get help\n",
+ "foundry --help\n",
+ "foundry search --help\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Run CLI from notebook\n",
+ "!foundry --help"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 4. MCP Server for AI Agents\n",
+ "\n",
+ "Foundry includes an MCP (Model Context Protocol) server that allows AI agents like Claude Code to discover and use datasets.\n",
+ "\n",
+ "### 4.1 Install for Claude Code\n",
+ "\n",
+ "```bash\n",
+ "# Automatically configure Claude Code to use Foundry\n",
+ "foundry mcp install\n",
+ "```\n",
+ "\n",
+ "This adds Foundry to your Claude Code configuration, enabling commands like:\n",
+ "- \"Find me a materials science dataset for band gap prediction\"\n",
+ "- \"What fields are in dataset X?\"\n",
+ "- \"Load the training data from dataset Y\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 4.2 Available MCP Tools\n",
+ "\n",
+ "The MCP server exposes these tools to AI agents:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "from foundry.mcp.server import TOOLS, create_server\n\nprint(\"Available MCP Tools:\")\nprint(\"=\" * 50)\nfor tool in TOOLS:\n print(f\"\\n{tool['name']}\")\n print(f\" {tool['description'][:80]}...\")\n print(f\" Parameters: {list(tool['inputSchema']['properties'].keys())}\")"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "# View full server configuration\nconfig = create_server()\nprint(f\"Server: {config['name']} v{config['version']}\")\nprint(f\"Tools: {len(config['tools'])}\")"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 4.3 Start MCP Server Manually\n",
+ "\n",
+ "```bash\n",
+ "# Start the server (for custom agent integrations)\n",
+ "foundry mcp start\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 5. Structured Error Handling\n",
+ "\n",
+ "Foundry uses structured errors that provide clear context for both humans and AI agents."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "from foundry.errors import (\n FoundryError,\n DatasetNotFoundError,\n AuthenticationError,\n DownloadError,\n)\n\n# Example: Handle a not-found error\ntry:\n raise DatasetNotFoundError(\"fake-doi-12345\")\nexcept DatasetNotFoundError as e:\n print(f\"Error Code: {e.code}\")\n print(f\"Message: {e.message}\")\n print(f\"Details: {e.details}\")\n print(f\"Recovery Hint: {e.recovery_hint}\")"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "# Errors can be serialized for API responses\nimport json\n\nerror = DatasetNotFoundError(\"test-query\")\nerror_dict = error.to_dict()\nprint(json.dumps(error_dict, indent=2))"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Error Types\n",
+ "\n",
+ "| Error Class | Code | When It's Raised |\n",
+ "|------------|------|------------------|\n",
+ "| `DatasetNotFoundError` | DATASET_NOT_FOUND | Search/get returns no results |\n",
+ "| `AuthenticationError` | AUTH_FAILED | Globus/service auth fails |\n",
+ "| `DownloadError` | DOWNLOAD_FAILED | File download fails |\n",
+ "| `DataLoadError` | DATA_LOAD_FAILED | Cannot parse data file |\n",
+ "| `ValidationError` | VALIDATION_FAILED | Metadata validation error |\n",
+ "| `PublishError` | PUBLISH_FAILED | Publishing workflow fails |\n",
+ "| `CacheError` | CACHE_ERROR | Local cache issue |\n",
+ "| `ConfigurationError` | CONFIG_ERROR | Invalid config setting |"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6. Complete Workflow Example\n",
+ "\n",
+ "Here's a complete workflow from discovery to model training:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "from foundry import Foundry\nfrom foundry.errors import DatasetNotFoundError\n\ndef train_band_gap_model():\n \"\"\"Complete workflow: discover -> understand -> load -> train.\"\"\"\n \n f = Foundry()\n \n # 1. Discover\n print(\"1. Searching for datasets...\")\n results = f.search(\"band gap\", limit=5, as_json=True)\n \n if not results:\n raise DatasetNotFoundError(\"band gap\")\n \n print(f\" Found {len(results)} datasets\")\n \n # 2. Understand\n print(\"\\n2. Getting dataset schema...\")\n dataset = f.list(limit=1).iloc[0].FoundryDataset\n schema = dataset.get_schema()\n \n print(f\" Dataset: {schema['name']}\")\n print(f\" Fields: {[f['name'] for f in schema['fields']]}\")\n print(f\" Splits: {[s['name'] for s in schema['splits']]}\")\n \n # 3. Load (with schema for context)\n print(\"\\n3. Loading data...\")\n result = dataset.get_as_dict(include_schema=True)\n \n data = result['data']\n print(f\" Loaded splits: {list(data.keys())}\")\n \n # 4. Train\n print(\"\\n4. Ready to train!\")\n if 'train' in data:\n X_train, y_train = data['train']\n print(f\" Training samples available\")\n \n # 5. Cite\n print(\"\\n5. Citation:\")\n print(dataset.get_citation())\n \n return dataset\n\n# Run it\ntry:\n ds = train_band_gap_model()\nexcept Exception as e:\n print(f\"Workflow failed: {e}\")"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "## Summary\n\n### Publishing\n```python\nf.publish(metadata, data_path=\"./data\", source_id=\"my_dataset_v1\")\nf.check_status(\"my_dataset_v1\")\n```\n\n### HuggingFace Export\n```python\nfrom foundry.integrations.huggingface import push_to_hub\npush_to_hub(dataset, \"org/name\", token=\"hf_xxx\")\n```\n\n### CLI\n```bash\nfoundry search \"query\"\nfoundry schema \nfoundry mcp install\n```\n\n### Error Handling\n```python\nfrom foundry.errors import DatasetNotFoundError\ntry:\n f.get_dataset(doi)\nexcept DatasetNotFoundError as e:\n print(e.recovery_hint)\n```\n\n### Configuration\n```python\n# Default: HTTPS download (no Globus needed)\nf = Foundry()\n\n# For cloud environments (Colab, etc.)\nf = Foundry(no_browser=True, no_local_server=True)\n\n# For Globus transfers (large datasets, institutional endpoints)\nf = Foundry(use_globus=True)\n```\n\n---\n\n**You've completed the Foundry tutorial!**\n\n- Documentation: https://github.com/MLMI2-CSSI/foundry\n- Issues: https://github.com/MLMI2-CSSI/foundry/issues"
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python",
+ "version": "3.9.0"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/examples/README.md b/examples/README.md
index cd6dfc27..fbed38ea 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,8 +1,53 @@
# Examples using Foundry
+
If you're wondering how to get started with Foundry or want to see it in action, you're in the right place!
-Each notebook walks through instantiating Foundry, loading data from Foundry, and working with the data in different ways. Some notebooks also use machine learning models with the data.
+## Tutorials (Start Here)
+
+| # | Name | Time | Description |
+|---|------|------|-------------|
+| 00 | [Hello Foundry](./00_hello_foundry/) | 5 min | Absolute basics - your first dataset |
+| 01 | [Quickstart](./01_quickstart/) | 5 min | Search, load, use in ML workflow |
+| 02 | [Working with Data](./02_working_with_data/) | 15 min | Schemas, splits, PyTorch/TensorFlow |
+| 03 | [Advanced Workflows](./03_advanced_workflows/) | 20 min | Publishing, HuggingFace, CLI, MCP |
+
+## Domain Examples
+
+Each folder contains a notebook and `requirements.txt` file. The notebooks can be run locally or in [Google Colab](https://colab.research.google.com/).
+
+| Example | Domain | Description |
+|---------|--------|-------------|
+| [bandgap](./bandgap/) | Materials | Band gap prediction |
+| [oqmd](./oqmd/) | Materials | Open Quantum Materials Database |
+| [zeolite](./zeolite/) | Chemistry | Zeolite structure analysis |
+| [dendrite-segmentation](./dendrite-segmentation/) | Imaging | Microscopy segmentation |
+| [atom-position-finding](./atom-position-finding/) | Imaging | Atom localization |
+
+## Quick Reference
+
+```python
+from foundry import Foundry
+
+f = Foundry() # HTTPS download by default
+results = f.search("band gap", limit=5)
+dataset = results.iloc[0].FoundryDataset
+X, y = dataset.get_as_dict()['train']
+```
+
+**Cloud environments (Colab, etc.):**
+```python
+f = Foundry(no_browser=True, no_local_server=True)
+```
+
+**For large datasets with Globus:**
+```python
+f = Foundry(use_globus=True) # Requires Globus Connect Personal
+```
-Each folder contains a notebook and `requirements.txt` file. The notebooks can be run locally (using the `requirements.txt`) or in [Google Colab](https://colab.research.google.com/).
+**CLI:**
+```bash
+foundry search "band gap"
+foundry schema
+```
-If you have any trouble with the notebooks, please check our [documentation](https://ai-materials-and-chemistry.gitbook.io/foundry/v/docs/) or create an issue on the repo.
+If you have any trouble, check our [documentation](https://ai-materials-and-chemistry.gitbook.io/foundry/v/docs/) or create an issue.
diff --git a/foundry/__init__.py b/foundry/__init__.py
index 78883ed9..5721a2c7 100644
--- a/foundry/__init__.py
+++ b/foundry/__init__.py
@@ -3,3 +3,14 @@
from . import https_download # noqa F401 (import unused)
from . import https_upload # noqa F401 (import unused)
from .foundry_dataset import FoundryDataset # noqa F401 (import unused)
+from .errors import ( # noqa F401 (import unused)
+ FoundryError,
+ DatasetNotFoundError,
+ AuthenticationError,
+ DownloadError,
+ DataLoadError,
+ ValidationError,
+ PublishError,
+ CacheError,
+ ConfigurationError,
+)
diff --git a/foundry/__main__.py b/foundry/__main__.py
new file mode 100644
index 00000000..ca3b1c73
--- /dev/null
+++ b/foundry/__main__.py
@@ -0,0 +1,419 @@
+"""Foundry CLI - Beautiful command-line interface for materials science datasets.
+
+Usage:
+ foundry search "bandgap" # Search datasets
+ foundry get # Download a dataset
+ foundry list # List available datasets
+ foundry schema # Show dataset schema
+ foundry status # Check publication status
+ foundry mcp start # Start MCP server for agent integration
+"""
+
+import json
+import sys
+from typing import Optional
+
+import typer
+from rich.console import Console
+from rich.table import Table
+from rich.panel import Panel
+from rich.progress import Progress, SpinnerColumn, TextColumn
+from rich import print as rprint
+
+app = typer.Typer(
+ name="foundry",
+ help="Foundry-ML: Materials science datasets for machine learning.",
+ add_completion=False,
+ no_args_is_help=True,
+)
+mcp_app = typer.Typer(help="MCP server commands for agent integration.")
+app.add_typer(mcp_app, name="mcp")
+
+console = Console()
+
+
+def get_foundry(no_browser: bool = True):
+ """Get a Foundry client instance."""
+ from foundry import Foundry
+ return Foundry(no_browser=no_browser, no_local_server=True)
+
+
+@app.command()
+def search(
+ query: str = typer.Argument(..., help="Search query (e.g., 'bandgap', 'crystal structure')"),
+ limit: int = typer.Option(10, "--limit", "-l", help="Maximum number of results"),
+ output_json: bool = typer.Option(False, "--json", "-j", help="Output as JSON"),
+):
+ """Search for datasets matching a query."""
+ with Progress(
+ SpinnerColumn(),
+ TextColumn("[progress.description]{task.description}"),
+ console=console,
+ transient=True,
+ ) as progress:
+ progress.add_task("Searching datasets...", total=None)
+ f = get_foundry()
+ results = f.search(query, limit=limit)
+
+ if len(results) == 0:
+ console.print(f"[yellow]No datasets found matching '{query}'[/yellow]")
+ raise typer.Exit(1)
+
+ if output_json:
+ # Output as JSON for programmatic use
+ output = []
+ for _, row in results.iterrows():
+ ds = row.FoundryDataset
+ output.append({
+ "name": ds.dataset_name,
+ "title": ds.dc.titles[0].title if ds.dc.titles else None,
+ "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None,
+ })
+ console.print(json.dumps(output, indent=2))
+ else:
+ # Pretty table output
+ table = Table(title=f"Search Results for '{query}'")
+ table.add_column("Name", style="cyan", no_wrap=True)
+ table.add_column("Title", style="green")
+ table.add_column("DOI", style="dim")
+
+ for _, row in results.iterrows():
+ ds = row.FoundryDataset
+ title = ds.dc.titles[0].title if ds.dc.titles else "N/A"
+ doi = str(ds.dc.identifier.identifier) if ds.dc.identifier else "N/A"
+ table.add_row(ds.dataset_name, title[:50] + "..." if len(title) > 50 else title, doi)
+
+ console.print(table)
+ console.print(f"\n[dim]Found {len(results)} dataset(s). Use 'foundry get ' to download.[/dim]")
+
+
+@app.command()
+def list_datasets(
+ limit: int = typer.Option(20, "--limit", "-l", help="Maximum number of results"),
+ output_json: bool = typer.Option(False, "--json", "-j", help="Output as JSON"),
+):
+ """List all available datasets."""
+ with Progress(
+ SpinnerColumn(),
+ TextColumn("[progress.description]{task.description}"),
+ console=console,
+ transient=True,
+ ) as progress:
+ progress.add_task("Fetching dataset list...", total=None)
+ f = get_foundry()
+ results = f.list(limit=limit)
+
+ if output_json:
+ output = []
+ for _, row in results.iterrows():
+ ds = row.FoundryDataset
+ output.append({
+ "name": ds.dataset_name,
+ "title": ds.dc.titles[0].title if ds.dc.titles else None,
+ "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None,
+ })
+ console.print(json.dumps(output, indent=2))
+ else:
+ table = Table(title="Available Datasets")
+ table.add_column("Name", style="cyan", no_wrap=True)
+ table.add_column("Title", style="green")
+ table.add_column("DOI", style="dim")
+
+ for _, row in results.iterrows():
+ ds = row.FoundryDataset
+ title = ds.dc.titles[0].title if ds.dc.titles else "N/A"
+ doi = str(ds.dc.identifier.identifier) if ds.dc.identifier else "N/A"
+ table.add_row(ds.dataset_name, title[:50] + "..." if len(title) > 50 else title, doi)
+
+ console.print(table)
+ console.print(f"\n[dim]Showing {len(results)} dataset(s).[/dim]")
+
+
+# Alias for list
+app.command(name="list")(list_datasets)
+
+
+@app.command()
+def get(
+ doi: str = typer.Argument(..., help="DOI of the dataset to download"),
+ split: Optional[str] = typer.Option(None, "--split", "-s", help="Specific split to download"),
+ output_dir: Optional[str] = typer.Option(None, "--output", "-o", help="Output directory"),
+):
+ """Download a dataset by DOI."""
+ with Progress(
+ SpinnerColumn(),
+ TextColumn("[progress.description]{task.description}"),
+ console=console,
+ ) as progress:
+ task = progress.add_task("Connecting to Foundry...", total=None)
+ f = get_foundry()
+
+ progress.update(task, description=f"Fetching dataset {doi}...")
+ try:
+ dataset = f.get_dataset(doi)
+ except Exception as e:
+ console.print(f"[red]Error: Could not find dataset '{doi}'[/red]")
+ console.print(f"[dim]{e}[/dim]")
+ raise typer.Exit(1)
+
+ progress.update(task, description="Downloading data...")
+ try:
+ data = dataset.get_as_dict(split=split)
+ except Exception as e:
+ console.print(f"[red]Error downloading data: {e}[/red]")
+ raise typer.Exit(1)
+
+ # Show summary
+ console.print(Panel(
+ f"[green]Successfully downloaded![/green]\n\n"
+ f"[bold]Dataset:[/bold] {dataset.dataset_name}\n"
+ f"[bold]Title:[/bold] {dataset.dc.titles[0].title if dataset.dc.titles else 'N/A'}\n"
+ f"[bold]Splits:[/bold] {', '.join(data.keys()) if isinstance(data, dict) else 'N/A'}",
+ title="Download Complete",
+ ))
+
+
+@app.command()
+def schema(
+ doi: str = typer.Argument(..., help="DOI of the dataset"),
+ output_json: bool = typer.Option(False, "--json", "-j", help="Output as JSON"),
+):
+ """Show the schema of a dataset - what fields it contains."""
+ with Progress(
+ SpinnerColumn(),
+ TextColumn("[progress.description]{task.description}"),
+ console=console,
+ transient=True,
+ ) as progress:
+ progress.add_task("Fetching schema...", total=None)
+ f = get_foundry()
+ try:
+ dataset = f.get_dataset(doi)
+ except Exception as e:
+ console.print(f"[red]Error: Could not find dataset '{doi}'[/red]")
+ raise typer.Exit(1)
+
+ schema_info = {
+ "name": dataset.dataset_name,
+ "title": dataset.dc.titles[0].title if dataset.dc.titles else None,
+ "doi": str(dataset.dc.identifier.identifier) if dataset.dc.identifier else None,
+ "data_type": dataset.foundry_schema.data_type,
+ "splits": [
+ {"name": s.label, "type": s.type}
+ for s in (dataset.foundry_schema.splits or [])
+ ],
+ "fields": [
+ {
+ "name": k.key[0] if k.key else None,
+ "role": k.type,
+ "description": k.description,
+ "units": k.units,
+ }
+ for k in (dataset.foundry_schema.keys or [])
+ ],
+ }
+
+ if output_json:
+ console.print(json.dumps(schema_info, indent=2))
+ else:
+ # Pretty output
+ console.print(Panel(
+ f"[bold]{schema_info['title']}[/bold]\n"
+ f"[dim]DOI: {schema_info['doi']}[/dim]\n"
+ f"[dim]Type: {schema_info['data_type']}[/dim]",
+ title=f"Dataset: {schema_info['name']}",
+ ))
+
+ if schema_info['splits']:
+ console.print("\n[bold]Splits:[/bold]")
+ for split in schema_info['splits']:
+ console.print(f" - {split['name']} ({split['type']})")
+
+ if schema_info['fields']:
+ console.print("\n[bold]Fields:[/bold]")
+ table = Table(show_header=True)
+ table.add_column("Name", style="cyan")
+ table.add_column("Role", style="green")
+ table.add_column("Description")
+ table.add_column("Units", style="dim")
+
+ for field in schema_info['fields']:
+ table.add_row(
+ field['name'] or "N/A",
+ field['role'] or "N/A",
+ (field['description'] or "")[:40],
+ field['units'] or "",
+ )
+ console.print(table)
+
+
+@app.command()
+def status(
+ source_id: str = typer.Argument(..., help="Source ID to check status for"),
+ output_json: bool = typer.Option(False, "--json", "-j", help="Output as JSON"),
+):
+ """Check the publication status of a dataset."""
+ with Progress(
+ SpinnerColumn(),
+ TextColumn("[progress.description]{task.description}"),
+ console=console,
+ transient=True,
+ ) as progress:
+ progress.add_task("Checking status...", total=None)
+ f = get_foundry()
+ try:
+ result = f.check_status(source_id)
+ except Exception as e:
+ console.print(f"[red]Error checking status: {e}[/red]")
+ raise typer.Exit(1)
+
+ if output_json:
+ console.print(json.dumps(result, indent=2, default=str))
+ else:
+ console.print(Panel(str(result), title=f"Status: {source_id}"))
+
+
+@app.command()
+def catalog(
+ output_json: bool = typer.Option(True, "--json", "-j", help="Output as JSON (default)"),
+):
+ """Dump all available datasets as JSON catalog."""
+ with Progress(
+ SpinnerColumn(),
+ TextColumn("[progress.description]{task.description}"),
+ console=console,
+ transient=True,
+ ) as progress:
+ progress.add_task("Building catalog...", total=None)
+ f = get_foundry()
+ results = f.list(limit=1000)
+
+ output = []
+ for _, row in results.iterrows():
+ ds = row.FoundryDataset
+ output.append({
+ "name": ds.dataset_name,
+ "title": ds.dc.titles[0].title if ds.dc.titles else None,
+ "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None,
+ "description": ds.dc.descriptions[0].description if ds.dc.descriptions else None,
+ })
+
+ console.print(json.dumps(output, indent=2))
+
+
+@app.command(name="push-to-hf")
+def push_to_hf(
+ doi: str = typer.Argument(..., help="DOI of the dataset to export"),
+ repo: str = typer.Option(..., "--repo", "-r", help="HuggingFace repo ID (e.g., 'org/dataset-name')"),
+ token: Optional[str] = typer.Option(None, "--token", "-t", help="HuggingFace API token"),
+ private: bool = typer.Option(False, "--private", "-p", help="Create a private repository"),
+):
+ """Export a Foundry dataset to Hugging Face Hub.
+
+ This makes materials science datasets discoverable in the broader ML ecosystem.
+
+ Example:
+ foundry push-to-hf 10.18126/abc123 --repo my-org/bandgap-data
+ """
+ try:
+ from foundry.integrations.huggingface import push_to_hub
+ except ImportError:
+ console.print(
+ "[red]HuggingFace integration not installed.[/red]\n"
+ "Install with: [cyan]pip install foundry-ml[huggingface][/cyan]"
+ )
+ raise typer.Exit(1)
+
+ with Progress(
+ SpinnerColumn(),
+ TextColumn("[progress.description]{task.description}"),
+ console=console,
+ ) as progress:
+ task = progress.add_task("Connecting to Foundry...", total=None)
+ f = get_foundry()
+
+ progress.update(task, description=f"Loading dataset {doi}...")
+ try:
+ dataset = f.get_dataset(doi)
+ except Exception as e:
+ console.print(f"[red]Error: Could not find dataset '{doi}'[/red]")
+ console.print(f"[dim]{e}[/dim]")
+ raise typer.Exit(1)
+
+ progress.update(task, description=f"Exporting to HuggingFace Hub...")
+ try:
+ url = push_to_hub(dataset, repo, token=token, private=private)
+ except Exception as e:
+ console.print(f"[red]Error exporting to HuggingFace: {e}[/red]")
+ raise typer.Exit(1)
+
+ console.print(Panel(
+ f"[green]Successfully exported to HuggingFace Hub![/green]\n\n"
+ f"[bold]Dataset:[/bold] {dataset.dataset_name}\n"
+ f"[bold]Repository:[/bold] {repo}\n"
+ f"[bold]URL:[/bold] [link={url}]{url}[/link]",
+ title="Export Complete",
+ ))
+
+
+# MCP subcommands
+@mcp_app.command()
+def start(
+ host: str = typer.Option("localhost", "--host", "-h", help="Host to bind to"),
+ port: int = typer.Option(8765, "--port", "-p", help="Port to bind to"),
+):
+ """Start the MCP server for agent integration."""
+ console.print(Panel(
+ "[bold green]Starting Foundry MCP Server[/bold green]\n\n"
+ f"Host: {host}\n"
+ f"Port: {port}\n\n"
+ "This server allows AI agents to discover and load materials science datasets.\n"
+ "Connect using the Model Context Protocol.",
+ title="MCP Server",
+ ))
+
+ try:
+ from foundry.mcp.server import run_server
+ run_server(host=host, port=port)
+ except ImportError:
+ console.print("[yellow]MCP server not yet implemented. Coming soon![/yellow]")
+ raise typer.Exit(1)
+
+
+@mcp_app.command()
+def install():
+ """Install Foundry as an MCP server in Claude Code."""
+ console.print(Panel(
+ "[bold]To install Foundry in Claude Code:[/bold]\n\n"
+ "Add this to your Claude Code MCP configuration:\n\n"
+ "[cyan]{\n"
+ ' "mcpServers": {\n'
+ ' "foundry": {\n'
+ ' "command": "python",\n'
+ ' "args": ["-m", "foundry", "mcp", "start"]\n'
+ " }\n"
+ " }\n"
+ "}[/cyan]\n\n"
+ "Then restart Claude Code.",
+ title="MCP Installation",
+ ))
+
+
+@app.command()
+def version():
+ """Show Foundry version."""
+ try:
+ from importlib.metadata import version as get_version
+ v = get_version("foundry_ml")
+ except Exception:
+ v = "unknown"
+ console.print(f"foundry-ml version {v}")
+
+
+def main():
+ """Entry point for the CLI."""
+ app()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/foundry/errors.py b/foundry/errors.py
new file mode 100644
index 00000000..7b0b8bc0
--- /dev/null
+++ b/foundry/errors.py
@@ -0,0 +1,171 @@
+"""Structured error classes for Foundry.
+
+These errors provide:
+1. Error codes for programmatic handling
+2. Human-readable messages
+3. Recovery hints for agents and users
+4. Structured details for debugging
+
+This enables both humans and AI agents to understand and recover from errors.
+"""
+
+from dataclasses import dataclass, field
+from typing import Any, Dict, Optional
+
+
+@dataclass
+class FoundryError(Exception):
+ """Base error class with structured information for agents and users.
+
+ Attributes:
+ code: Machine-readable error code (e.g., "DATASET_NOT_FOUND")
+ message: Human-readable error description
+ details: Additional context for debugging
+ recovery_hint: Actionable suggestion for resolving the error
+ """
+
+ code: str
+ message: str
+ details: Optional[Dict[str, Any]] = field(default=None)
+ recovery_hint: Optional[str] = field(default=None)
+
+ def __post_init__(self):
+ super().__init__(self.message)
+
+ def __str__(self) -> str:
+ parts = [f"[{self.code}] {self.message}"]
+ if self.recovery_hint:
+ parts.append(f"Hint: {self.recovery_hint}")
+ return " ".join(parts)
+
+ def to_dict(self) -> Dict[str, Any]:
+ """Serialize error to dictionary for JSON responses."""
+ return {
+ "code": self.code,
+ "message": self.message,
+ "details": self.details,
+ "recovery_hint": self.recovery_hint,
+ }
+
+
+class DatasetNotFoundError(FoundryError):
+ """Raised when a dataset cannot be found."""
+
+ def __init__(self, query: str, search_type: str = "query"):
+ super().__init__(
+ code="DATASET_NOT_FOUND",
+ message=f"No dataset found matching {search_type}: '{query}'",
+ details={"query": query, "search_type": search_type},
+ recovery_hint=(
+ "Try a broader search term, check the DOI format, "
+ "or use foundry.list() to see available datasets."
+ ),
+ )
+
+
+class AuthenticationError(FoundryError):
+ """Raised when authentication fails."""
+
+ def __init__(self, service: str, reason: str = None):
+ msg = f"Authentication failed for {service}"
+ if reason:
+ msg += f": {reason}"
+ super().__init__(
+ code="AUTH_FAILED",
+ message=msg,
+ details={"service": service, "reason": reason},
+ recovery_hint=(
+ "Run Foundry(no_browser=False) to re-authenticate, "
+ "or check your Globus credentials."
+ ),
+ )
+
+
+class DownloadError(FoundryError):
+ """Raised when a file download fails."""
+
+ def __init__(self, url: str, reason: str, destination: str = None):
+ super().__init__(
+ code="DOWNLOAD_FAILED",
+ message=f"Failed to download from {url}: {reason}",
+ details={"url": url, "reason": reason, "destination": destination},
+ recovery_hint=(
+ "Check your network connection. "
+ "Try use_globus=False for HTTPS fallback, or use_globus=True for Globus transfer."
+ ),
+ )
+
+
+class DataLoadError(FoundryError):
+ """Raised when loading data from a file fails."""
+
+ def __init__(self, file_path: str, reason: str, data_type: str = None):
+ super().__init__(
+ code="DATA_LOAD_FAILED",
+ message=f"Failed to load data from {file_path}: {reason}",
+ details={"file_path": file_path, "reason": reason, "data_type": data_type},
+ recovery_hint=(
+ "Check that the file exists and is not corrupted. "
+ "Try clearing the cache with dataset.clear_dataset_cache()."
+ ),
+ )
+
+
+class ValidationError(FoundryError):
+ """Raised when metadata validation fails."""
+
+ def __init__(self, field_name: str, error_msg: str, schema_type: str = "metadata"):
+ super().__init__(
+ code="VALIDATION_FAILED",
+ message=f"Validation failed for {schema_type} field '{field_name}': {error_msg}",
+ details={"field_name": field_name, "error_msg": error_msg, "schema_type": schema_type},
+ recovery_hint=(
+ "Check the field value against the expected schema. "
+ "See documentation for required metadata format."
+ ),
+ )
+
+
+class PublishError(FoundryError):
+ """Raised when publishing a dataset fails."""
+
+ def __init__(self, reason: str, source_id: str = None, status: str = None):
+ super().__init__(
+ code="PUBLISH_FAILED",
+ message=f"Failed to publish dataset: {reason}",
+ details={"source_id": source_id, "status": status, "reason": reason},
+ recovery_hint=(
+ "Check your metadata is complete and valid. "
+ "Use foundry.check_status(source_id) to monitor publication progress."
+ ),
+ )
+
+
+class CacheError(FoundryError):
+ """Raised when cache operations fail."""
+
+ def __init__(self, operation: str, reason: str, cache_path: str = None):
+ super().__init__(
+ code="CACHE_ERROR",
+ message=f"Cache {operation} failed: {reason}",
+ details={"operation": operation, "reason": reason, "cache_path": cache_path},
+ recovery_hint=(
+ "Try clearing the cache directory manually, "
+ "or check disk space and permissions."
+ ),
+ )
+
+
+class ConfigurationError(FoundryError):
+ """Raised when Foundry is misconfigured."""
+
+ def __init__(self, setting: str, reason: str, current_value: Any = None):
+ super().__init__(
+ code="CONFIG_ERROR",
+ message=f"Configuration error for '{setting}': {reason}",
+ details={"setting": setting, "reason": reason, "current_value": current_value},
+ recovery_hint=(
+ "Check your Foundry initialization parameters. "
+ "See documentation for valid configuration options."
+ ),
+ )
diff --git a/foundry/foundry.py b/foundry/foundry.py
index a3d0ca5b..5785dff1 100644
--- a/foundry/foundry.py
+++ b/foundry/foundry.py
@@ -5,7 +5,7 @@
from pydantic import Field, ConfigDict
from mdf_connect_client import MDFConnectClient
-from mdf_forge import Forge
+from .mdf_client import MDFClient
from globus_sdk import AuthClient
from .auth import PubAuths
@@ -91,7 +91,7 @@ class Foundry(FoundryBase):
index: str = Field(default="")
auths: Any = Field(default=None)
- use_globus: bool = Field(default=True)
+ use_globus: bool = Field(default=False)
verbose: bool = Field(default=False)
interval: int = Field(default=10)
parallel_https: int = Field(default=4)
@@ -108,7 +108,7 @@ def __init__(self,
no_local_server: bool = False,
index: str = "mdf",
authorizers: dict = None,
- use_globus: bool = True,
+ use_globus: bool = False,
verbose: bool = False,
interval: int = 10,
parallel_https: int = 4,
@@ -157,7 +157,7 @@ def __init__(self,
# add special SearchAuthorizer object
self.auths['search_authorizer'] = search_auth['search']
- self.forge_client = Forge(
+ self.forge_client = MDFClient(
index=index,
services=None,
search_client=self.auths["search"],
@@ -194,7 +194,7 @@ def __init__(self,
verbose,
local_cache_dir)
- def search(self, query: str = None, limit: int = None, as_list: bool = False) -> List[FoundryDataset]:
+ def search(self, query: str = None, limit: int = None, as_list: bool = False, as_json: bool = False) -> List[FoundryDataset]:
"""Search available Foundry datasets
This method searches for available Foundry datasets based on the provided query string.
@@ -206,9 +206,10 @@ def search(self, query: str = None, limit: int = None, as_list: bool = False) ->
query (str): The query string to match. If a DOI is provided, it retrieves the metadata for that specific dataset.
limit (int): The maximum number of results to return.
as_list (bool): If True, the search results will be returned as a list instead of a DataFrame.
+ as_json (bool): If True, return results as a list of dictionaries (agent-friendly).
Returns:
- List[FoundryDataset] or DataFrame: A list of search results as FoundryDataset objects or a DataFrame if as_list is False.
+ List[FoundryDataset] or DataFrame or List[dict]: Search results in the requested format.
Raises:
Exception: If no results are found for the provided query.
@@ -219,13 +220,15 @@ def search(self, query: str = None, limit: int = None, as_list: bool = False) ->
>>> print(len(results))
10
"""
+ from .errors import DatasetNotFoundError
+
if (query is not None) and (is_doi(query)):
metadata_list = [self.get_metadata_by_doi(query)]
else:
metadata_list = self.get_metadata_by_query(query, limit)
if len(metadata_list) == 0:
- raise Exception(f"load: No results found for the query '{query}'")
+ raise DatasetNotFoundError(query or "all datasets", "query")
foundry_datasets = []
for metadata in metadata_list:
@@ -235,6 +238,9 @@ def search(self, query: str = None, limit: int = None, as_list: bool = False) ->
logger.info(f"Search for '{query}' returned {len(foundry_datasets)} foundry datasets out of {len(metadata_list)} matches")
+ if as_json:
+ return [self._dataset_to_dict(ds) for ds in foundry_datasets]
+
if as_list:
return foundry_datasets
@@ -242,16 +248,37 @@ def search(self, query: str = None, limit: int = None, as_list: bool = False) ->
return foundry_datasets
- def list(self, limit: int = None):
+ def _dataset_to_dict(self, ds: FoundryDataset) -> Dict[str, Any]:
+ """Convert a FoundryDataset to an agent-friendly dictionary.
+
+ Args:
+ ds: The FoundryDataset to convert.
+
+ Returns:
+ Dictionary with dataset metadata suitable for JSON serialization.
+ """
+ return {
+ "name": ds.dataset_name,
+ "title": ds.dc.titles[0].title if ds.dc.titles else None,
+ "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None,
+ "description": ds.dc.descriptions[0].description if ds.dc.descriptions else None,
+ "year": ds.dc.publicationYear if hasattr(ds.dc, 'publicationYear') else None,
+ "fields": [k.key[0] for k in (ds.foundry_schema.keys or []) if k.key],
+ "splits": [s.label for s in (ds.foundry_schema.splits or [])],
+ "data_type": ds.foundry_schema.data_type,
+ }
+
+ def list(self, limit: int = None, as_json: bool = False):
"""List available Foundry datasets
Args:
limit (int): maximum number of results to return
+ as_json (bool): If True, return results as list of dicts (agent-friendly)
Returns:
- List[FoundryDataset]: List of FoundryDataset objects
+ List[FoundryDataset] or DataFrame or List[dict]: Available datasets
"""
- return self.search(limit=limit)
+ return self.search(limit=limit, as_json=as_json)
def dataset_from_metadata(self, metadata: dict) -> FoundryDataset:
""" Converts the result of a forge query to a FoundryDatset object
diff --git a/foundry/foundry_cache.py b/foundry/foundry_cache.py
index 7fbfab8a..f92317e9 100644
--- a/foundry/foundry_cache.py
+++ b/foundry/foundry_cache.py
@@ -4,7 +4,7 @@
from concurrent.futures import ThreadPoolExecutor, as_completed
import h5py
-from mdf_forge import Forge
+from .mdf_client import MDFClient
import numpy as np
import pandas as pd
import shutil
@@ -23,7 +23,7 @@ class FoundryCache():
"""The FoundryCache manages the local storage of FoundryDataset objects"""
def __init__(self,
- forge_client: Forge,
+ forge_client: MDFClient,
transfer_client: Any,
use_globus,
interval,
@@ -34,7 +34,7 @@ def __init__(self,
Initializes a FoundryCache object.
Args:
- forge_client (Forge): The Forge client object.
+ forge_client (MDFClient): The MDF client object.
transfer_client (Any): The transfer client object.
use_globus (bool): Flag indicating whether to use Globus for downloading.
interval (int): How often to wait before checking Globus transfer status.
@@ -354,7 +354,7 @@ def _load_data(self,
# Sort version folders and get the latest one
latest_version = sorted(version_folders, key=lambda x: [int(n) for n in x.split('.')], reverse=True)[0]
path = os.path.join(path, latest_version)
- print(f"Loading from version folder: {latest_version}")
+ logger.info(f"Loading from version folder: {latest_version}")
path_to_file = os.path.join(path, file)
diff --git a/foundry/foundry_dataset.py b/foundry/foundry_dataset.py
index 75195894..e2ec54fc 100644
--- a/foundry/foundry_dataset.py
+++ b/foundry/foundry_dataset.py
@@ -45,24 +45,70 @@ def __init__(self,
try:
self.dc = FoundryDatacite(datacite_entry)
self.foundry_schema = FoundrySchema(foundry_schema)
+ except ValidationError as e:
+ logger.error(f"Invalid metadata for dataset '{dataset_name}': {e}")
+ raise
except Exception as e:
- raise Exception('there was a problem creating the dataset: ', e)
+ raise ValueError(
+ f"Failed to create dataset '{dataset_name}': {e}. "
+ "Check that datacite_entry and foundry_schema are valid."
+ ) from e
self._foundry_cache = foundry_cache
- def get_as_dict(self, split: str = None, as_hdf5: bool = False):
+ def get_as_dict(self, split: str = None, as_hdf5: bool = False, include_schema: bool = False):
"""Returns the data from the dataset as a dictionary
Arguments:
split (string): Split to create dataset on.
**Default:** ``None``
+ as_hdf5 (bool): If True, return raw HDF5 data.
+ **Default:** ``False``
+ include_schema (bool): If True, return data with schema information.
+ Useful for AI agents that need to understand the data structure.
+ **Default:** ``False``
- Returns: (dict) Dictionary of all the data from the specified split
+ Returns:
+ dict: Dictionary of all the data from the specified split.
+ If include_schema is True, returns {"data": ..., "schema": ...}
"""
- return self._foundry_cache.load_as_dict(split,
+ data = self._foundry_cache.load_as_dict(split,
self.dataset_name,
self.foundry_schema,
as_hdf5)
+ if include_schema:
+ return {
+ "data": data,
+ "schema": self.get_schema(),
+ }
+ return data
+
+ def get_schema(self) -> dict:
+ """Get the schema of this dataset - what fields it contains.
+
+ Returns:
+ dict: Schema with fields, splits, data type, and metadata.
+ """
+ return {
+ "name": self.dataset_name,
+ "title": self.dc.titles[0].title if self.dc.titles else None,
+ "doi": str(self.dc.identifier.identifier) if self.dc.identifier else None,
+ "data_type": self.foundry_schema.data_type,
+ "splits": [
+ {"name": s.label, "type": s.type}
+ for s in (self.foundry_schema.splits or [])
+ ],
+ "fields": [
+ {
+ "name": k.key[0] if k.key else None,
+ "role": k.type, # "input" or "target"
+ "description": k.description,
+ "units": k.units,
+ }
+ for k in (self.foundry_schema.keys or [])
+ ],
+ }
+
load = get_as_dict
def get_as_torch(self, split: str = None):
@@ -212,7 +258,6 @@ def clear_dataset_cache(self):
def clean_dc_dict(self):
"""Clean the Datacite dictionary of None values"""
- print(json.loads(self.dc.json()))
return self.delete_none(json.loads(self.dc.json()))
def delete_none(self, _dict):
diff --git a/foundry/https_download.py b/foundry/https_download.py
index 77d40a2e..7364fdea 100644
--- a/foundry/https_download.py
+++ b/foundry/https_download.py
@@ -2,6 +2,7 @@
"""
+import logging
import os
from collections import deque
@@ -9,6 +10,9 @@
from globus_sdk import TransferClient
+logger = logging.getLogger(__name__)
+
+
def recursive_ls(tc: TransferClient, ep: str, path: str, max_depth: int = 3):
"""Find all files in a Globus directory recursively
@@ -52,6 +56,16 @@ def _get_files(tc, ep, queue, max_depth):
yield item
+class DownloadError(Exception):
+ """Raised when a file download fails."""
+
+ def __init__(self, url: str, reason: str, destination: str = None):
+ self.url = url
+ self.reason = reason
+ self.destination = destination
+ super().__init__(f"Failed to download {url}: {reason}")
+
+
def download_file(item, base_directory, https_config, timeout=1800):
"""Download a file to disk
@@ -60,6 +74,12 @@ def download_file(item, base_directory, https_config, timeout=1800):
base_directory: Base directory for storing downloaded files
https_config: Configuration defining the URL of the server and the name of the dataset
timeout: Timeout for the download request in seconds (default: 1800)
+
+ Returns:
+ str: Path to the downloaded file
+
+ Raises:
+ DownloadError: If the download fails for any reason
"""
base_url = https_config['base_url'].rstrip('/')
path = item.get('path', '').strip('/')
@@ -89,20 +109,18 @@ def download_file(item, base_directory, https_config, timeout=1800):
response.raise_for_status()
downloaded_size = 0
- print(f"\rStarting Download of: {url}")
+ logger.info(f"Starting download: {url}")
with open(destination, "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
downloaded_size += len(chunk)
- # Calculate and print the download progress
- print(f"\rDownloading... {downloaded_size/(1 << 20):,.2f} MB", end="")
+
+ logger.info(f"Downloaded {downloaded_size / (1 << 20):,.2f} MB to {destination}")
return destination
except requests.exceptions.RequestException as e:
- print(f"Error downloading file: {e}")
+ raise DownloadError(url, str(e), destination) from e
except IOError as e:
- print(f"Error writing file to disk: {e}")
-
- return {destination + " status": True}
+ raise DownloadError(url, f"Failed to write to disk: {e}", destination) from e
diff --git a/foundry/integrations/__init__.py b/foundry/integrations/__init__.py
new file mode 100644
index 00000000..87640f95
--- /dev/null
+++ b/foundry/integrations/__init__.py
@@ -0,0 +1,19 @@
+"""Foundry integrations with external platforms.
+
+This module provides bridges to other data ecosystems:
+- Hugging Face Hub: Export datasets to HF for broader visibility
+
+Usage:
+ from foundry.integrations.huggingface import push_to_hub
+
+ # Export a Foundry dataset to Hugging Face
+ dataset = foundry.get_dataset("10.18126/abc123")
+ push_to_hub(dataset, "my-org/dataset-name")
+"""
+
+try:
+ from .huggingface import push_to_hub
+ __all__ = ["push_to_hub"]
+except ImportError:
+ # huggingface extras not installed
+ __all__ = []
diff --git a/foundry/integrations/huggingface.py b/foundry/integrations/huggingface.py
new file mode 100644
index 00000000..07e409b1
--- /dev/null
+++ b/foundry/integrations/huggingface.py
@@ -0,0 +1,334 @@
+"""Hugging Face Hub integration for Foundry datasets.
+
+This module provides functionality to export Foundry datasets to Hugging Face Hub,
+making materials science datasets discoverable in the broader ML ecosystem.
+
+Requirements:
+ pip install foundry-ml[huggingface]
+
+Usage:
+ from foundry import Foundry
+ from foundry.integrations.huggingface import push_to_hub
+
+ f = Foundry()
+ dataset = f.get_dataset("10.18126/abc123")
+ push_to_hub(dataset, "materials-science/bandgap-data")
+"""
+
+import logging
+from typing import Optional, Dict, Any
+
+logger = logging.getLogger(__name__)
+
+try:
+ from datasets import Dataset, DatasetDict
+ from huggingface_hub import HfApi
+ HF_AVAILABLE = True
+except ImportError:
+ HF_AVAILABLE = False
+
+
+def _check_hf_available():
+ """Check if HuggingFace dependencies are installed."""
+ if not HF_AVAILABLE:
+ raise ImportError(
+ "HuggingFace integration requires additional dependencies. "
+ "Install with: pip install foundry-ml[huggingface]"
+ )
+
+
+def push_to_hub(
+ dataset, # FoundryDataset
+ repo_id: str,
+ token: Optional[str] = None,
+ private: bool = False,
+ split: Optional[str] = None,
+) -> str:
+ """Export a Foundry dataset to Hugging Face Hub.
+
+ This creates a new dataset repository on HF Hub with the Foundry data,
+ automatically generating a Dataset Card from the DataCite metadata.
+
+ Args:
+ dataset: A FoundryDataset object (from foundry.get_dataset())
+ repo_id: HF Hub repository ID (e.g., "materials-science/bandgap-data")
+ token: HuggingFace API token. If None, uses cached credentials.
+ private: If True, create a private repository.
+ split: Specific split to export. If None, exports all splits.
+
+ Returns:
+ str: URL of the created dataset on HF Hub.
+
+ Example:
+ >>> from foundry import Foundry
+ >>> from foundry.integrations.huggingface import push_to_hub
+ >>> f = Foundry()
+ >>> ds = f.get_dataset("10.18126/abc123")
+ >>> url = push_to_hub(ds, "my-org/my-dataset")
+ >>> print(f"Dataset published at: {url}")
+ """
+ _check_hf_available()
+
+ logger.info(f"Exporting Foundry dataset '{dataset.dataset_name}' to HF Hub: {repo_id}")
+
+ # Load data from Foundry
+ data = dataset.get_as_dict(split=split)
+
+ # Convert to HuggingFace Dataset format
+ if isinstance(data, dict) and all(isinstance(v, tuple) for v in data.values()):
+ # Data has splits: {split_name: (inputs, targets)}
+ hf_datasets = {}
+ for split_name, (inputs, targets) in data.items():
+ combined = _combine_inputs_targets(inputs, targets)
+ hf_datasets[split_name] = Dataset.from_dict(combined)
+ hf_dataset = DatasetDict(hf_datasets)
+ else:
+ # Single dataset without splits
+ hf_dataset = Dataset.from_dict(_flatten_data(data))
+
+ # Generate Dataset Card (README.md) from DataCite metadata
+ readme_content = _generate_dataset_card(dataset)
+
+ # Push to Hub
+ hf_dataset.push_to_hub(
+ repo_id,
+ token=token,
+ private=private,
+ )
+
+ # Update the README with our generated card
+ api = HfApi(token=token)
+ api.upload_file(
+ path_or_fileobj=readme_content.encode(),
+ path_in_repo="README.md",
+ repo_id=repo_id,
+ repo_type="dataset",
+ )
+
+ url = f"https://huggingface.co/datasets/{repo_id}"
+ logger.info(f"Successfully published to: {url}")
+ return url
+
+
+def _combine_inputs_targets(inputs: Dict, targets: Dict) -> Dict[str, Any]:
+ """Combine input and target dictionaries into a single flat dict."""
+ import pandas as pd
+ import numpy as np
+
+ combined = {}
+
+ for key, value in inputs.items():
+ col_name = f"input_{key}" if key in targets else key
+ combined[col_name] = _to_list(value)
+
+ for key, value in targets.items():
+ col_name = f"target_{key}" if key in inputs else key
+ combined[col_name] = _to_list(value)
+
+ return combined
+
+
+def _flatten_data(data: Any) -> Dict[str, Any]:
+ """Flatten nested data structure to a dict suitable for HF Dataset."""
+ import pandas as pd
+ import numpy as np
+
+ if isinstance(data, pd.DataFrame):
+ return {col: data[col].tolist() for col in data.columns}
+ elif isinstance(data, dict):
+ result = {}
+ for key, value in data.items():
+ result[key] = _to_list(value)
+ return result
+ else:
+ return {"data": _to_list(data)}
+
+
+def _to_list(value: Any) -> list:
+ """Convert various types to list for HF compatibility."""
+ import pandas as pd
+ import numpy as np
+
+ if isinstance(value, np.ndarray):
+ return value.tolist()
+ elif isinstance(value, pd.Series):
+ return value.tolist()
+ elif isinstance(value, pd.DataFrame):
+ return value.to_dict(orient='records')
+ elif isinstance(value, (list, tuple)):
+ return list(value)
+ else:
+ return [value]
+
+
+def _normalize_license(license_str: str) -> str:
+ """Map license string to a valid HuggingFace license identifier."""
+ if not license_str:
+ return "other"
+
+ license_lower = license_str.lower()
+
+ # Direct matches to HF identifiers
+ hf_licenses = {
+ "cc0", "cc0-1.0", "cc-by-4.0", "cc-by-sa-4.0", "cc-by-nc-4.0",
+ "cc-by-nc-sa-4.0", "cc-by-nc-nd-4.0", "mit", "apache-2.0",
+ "bsd", "bsd-2-clause", "bsd-3-clause", "gpl-3.0", "lgpl-3.0",
+ "unknown", "other"
+ }
+
+ # Check for direct match
+ if license_lower in hf_licenses:
+ return license_lower
+
+ # Common mappings
+ mappings = {
+ "creative commons": "cc-by-4.0",
+ "cc by": "cc-by-4.0",
+ "cc-by": "cc-by-4.0",
+ "cc by 4": "cc-by-4.0",
+ "cc by-sa": "cc-by-sa-4.0",
+ "cc by-nc": "cc-by-nc-4.0",
+ "cc0": "cc0-1.0",
+ "public domain": "cc0-1.0",
+ "apache": "apache-2.0",
+ "mit license": "mit",
+ "bsd license": "bsd-3-clause",
+ "gpl": "gpl-3.0",
+ "lgpl": "lgpl-3.0",
+ }
+
+ for pattern, hf_id in mappings.items():
+ if pattern in license_lower:
+ return hf_id
+
+ # If we can't map it, use "other"
+ return "other"
+
+
+def _generate_dataset_card(dataset) -> str:
+ """Generate a HuggingFace Dataset Card from Foundry DataCite metadata."""
+ dc = dataset.dc
+ schema = dataset.foundry_schema
+
+ # Extract metadata
+ title = dc.titles[0].title if dc.titles else dataset.dataset_name
+ description = dc.descriptions[0].description if dc.descriptions else ""
+
+ # DOI is a RootModel with .root containing the actual string
+ doi = ""
+ if dc.identifier and dc.identifier.identifier:
+ doi_obj = dc.identifier.identifier
+ doi = doi_obj.root if hasattr(doi_obj, 'root') else str(doi_obj)
+
+ # Handle creators (can be dicts or pydantic objects)
+ authors = []
+ for c in (dc.creators or []):
+ if isinstance(c, dict):
+ authors.append(c.get('creatorName', 'Unknown'))
+ elif hasattr(c, 'creatorName'):
+ authors.append(c.creatorName or 'Unknown')
+ else:
+ authors.append(str(c))
+
+ # Year is a RootModel with .root containing the actual int
+ year = ""
+ if hasattr(dc, 'publicationYear') and dc.publicationYear:
+ year_obj = dc.publicationYear
+ year = year_obj.root if hasattr(year_obj, 'root') else str(year_obj)
+
+ # Get license if available (rightsList contains RightsListItem objects)
+ license_raw = None
+ if hasattr(dc, 'rightsList') and dc.rightsList:
+ rights_item = dc.rightsList[0]
+ if isinstance(rights_item, dict):
+ license_raw = rights_item.get('rights')
+ elif hasattr(rights_item, 'rights'):
+ license_raw = rights_item.rights
+
+ license_id = _normalize_license(license_raw)
+ # For display, show original if different from ID
+ license_display = license_raw if license_raw and license_raw != license_id else license_id
+
+ # Build field documentation
+ fields_doc = ""
+ if schema.keys:
+ fields_doc = "\n### Fields\n\n| Field | Role | Description | Units |\n|-------|------|-------------|-------|\n"
+ for key in schema.keys:
+ name = key.key[0] if key.key else "N/A"
+ role = key.type or "N/A"
+ desc = (key.description or "")[:50]
+ units = key.units or ""
+ fields_doc += f"| {name} | {role} | {desc} | {units} |\n"
+
+ # Build splits documentation
+ splits_doc = ""
+ if schema.splits:
+ splits_doc = "\n### Splits\n\n"
+ for split in schema.splits:
+ splits_doc += f"- **{split.label}**: {split.type or 'data'}\n"
+
+ # Generate citation
+ citation = dataset.get_citation()
+
+ return f"""---
+license: {license_id}
+task_categories:
+ - tabular-regression
+ - tabular-classification
+tags:
+ - materials-science
+ - chemistry
+ - foundry-ml
+ - scientific-data
+size_categories:
+ - 1K List[Dict[str, Any]]:
+ """Search for materials science datasets.
+
+ Args:
+ query: Search terms (e.g., "band gap", "crystal structure", "zeolite")
+ limit: Maximum number of results to return (default: 10)
+
+ Returns:
+ List of datasets with name, title, DOI, and description
+ """
+ f = _get_foundry()
+ results = f.search(query, limit=limit)
+
+ output = []
+ for _, row in results.iterrows():
+ ds = row.FoundryDataset
+ output.append({
+ "name": ds.dataset_name,
+ "title": ds.dc.titles[0].title if ds.dc.titles else None,
+ "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None,
+ "description": ds.dc.descriptions[0].description if ds.dc.descriptions else None,
+ "year": ds.dc.publicationYear if hasattr(ds.dc, 'publicationYear') else None,
+ })
+ return output
+
+
+def get_dataset_schema(doi: str) -> Dict[str, Any]:
+ """Get the schema of a dataset - what fields it contains.
+
+ Use this to understand the structure of a dataset before loading it.
+
+ Args:
+ doi: The DOI of the dataset (e.g., "10.18126/abc123")
+
+ Returns:
+ Schema with name, splits, fields (with descriptions and units), and data type
+ """
+ f = _get_foundry()
+ ds = f.get_dataset(doi)
+
+ return {
+ "name": ds.dataset_name,
+ "title": ds.dc.titles[0].title if ds.dc.titles else None,
+ "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None,
+ "data_type": ds.foundry_schema.data_type,
+ "splits": [
+ {"name": s.label, "type": s.type}
+ for s in (ds.foundry_schema.splits or [])
+ ],
+ "fields": [
+ {
+ "name": k.key[0] if k.key else None,
+ "role": k.type, # "input" or "target"
+ "description": k.description,
+ "units": k.units,
+ }
+ for k in (ds.foundry_schema.keys or [])
+ ],
+ }
+
+
+def load_dataset(doi: str, split: str = "train") -> Dict[str, Any]:
+ """Load a dataset and return its data with schema.
+
+ This downloads the data if not cached, then returns it along with schema information.
+
+ Args:
+ doi: The DOI of the dataset
+ split: Which split to load (default: "train")
+
+ Returns:
+ Dictionary with "schema" (field information) and "data" (the actual data)
+ """
+ f = _get_foundry()
+ ds = f.get_dataset(doi)
+ data = ds.get_as_dict(split=split)
+ schema = get_dataset_schema(doi)
+
+ return {
+ "schema": schema,
+ "data": _serialize_data(data),
+ "citation": ds.get_citation(),
+ }
+
+
+def list_all_datasets(limit: int = 100) -> List[Dict[str, Any]]:
+ """List all available Foundry datasets.
+
+ Returns a catalog of all datasets that can be loaded.
+
+ Args:
+ limit: Maximum number of datasets to return (default: 100)
+
+ Returns:
+ List of all available datasets with basic info
+ """
+ f = _get_foundry()
+ results = f.list(limit=limit)
+
+ output = []
+ for _, row in results.iterrows():
+ ds = row.FoundryDataset
+ output.append({
+ "name": ds.dataset_name,
+ "title": ds.dc.titles[0].title if ds.dc.titles else None,
+ "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None,
+ "description": ds.dc.descriptions[0].description if ds.dc.descriptions else None,
+ })
+ return output
+
+
+def _serialize_data(data: Any) -> Any:
+ """Convert numpy arrays and other non-JSON-serializable types to lists."""
+ import numpy as np
+ import pandas as pd
+
+ if isinstance(data, np.ndarray):
+ return data.tolist()
+ elif isinstance(data, pd.DataFrame):
+ return data.to_dict(orient='records')
+ elif isinstance(data, pd.Series):
+ return data.tolist()
+ elif isinstance(data, dict):
+ return {k: _serialize_data(v) for k, v in data.items()}
+ elif isinstance(data, (list, tuple)):
+ return [_serialize_data(item) for item in data]
+ elif isinstance(data, (np.integer, np.floating)):
+ return data.item()
+ else:
+ return data
+
+
+# MCP Server Implementation using stdio transport
+TOOLS = [
+ {
+ "name": "search_datasets",
+ "description": "Search for materials science datasets. Returns datasets matching the query with name, title, DOI, and description.",
+ "inputSchema": {
+ "type": "object",
+ "properties": {
+ "query": {
+ "type": "string",
+ "description": "Search terms (e.g., 'band gap', 'crystal structure', 'zeolite')"
+ },
+ "limit": {
+ "type": "integer",
+ "description": "Maximum number of results (default: 10)",
+ "default": 10
+ }
+ },
+ "required": ["query"]
+ }
+ },
+ {
+ "name": "get_dataset_schema",
+ "description": "Get the schema of a dataset - what fields it contains, their descriptions, and units. Use this to understand the data structure before loading.",
+ "inputSchema": {
+ "type": "object",
+ "properties": {
+ "doi": {
+ "type": "string",
+ "description": "The DOI of the dataset (e.g., '10.18126/abc123')"
+ }
+ },
+ "required": ["doi"]
+ }
+ },
+ {
+ "name": "load_dataset",
+ "description": "Load a dataset and return its data with schema information. Downloads data if not cached.",
+ "inputSchema": {
+ "type": "object",
+ "properties": {
+ "doi": {
+ "type": "string",
+ "description": "The DOI of the dataset"
+ },
+ "split": {
+ "type": "string",
+ "description": "Which split to load (default: 'train')",
+ "default": "train"
+ }
+ },
+ "required": ["doi"]
+ }
+ },
+ {
+ "name": "list_all_datasets",
+ "description": "List all available Foundry datasets. Returns a catalog with basic info for each dataset.",
+ "inputSchema": {
+ "type": "object",
+ "properties": {
+ "limit": {
+ "type": "integer",
+ "description": "Maximum number of datasets to return (default: 100)",
+ "default": 100
+ }
+ },
+ "required": []
+ }
+ }
+]
+
+
+def handle_tool_call(name: str, arguments: Dict[str, Any]) -> Any:
+ """Handle a tool call and return the result."""
+ if name == "search_datasets":
+ return search_datasets(
+ query=arguments["query"],
+ limit=arguments.get("limit", 10)
+ )
+ elif name == "get_dataset_schema":
+ return get_dataset_schema(doi=arguments["doi"])
+ elif name == "load_dataset":
+ return load_dataset(
+ doi=arguments["doi"],
+ split=arguments.get("split", "train")
+ )
+ elif name == "list_all_datasets":
+ return list_all_datasets(limit=arguments.get("limit", 100))
+ else:
+ raise ValueError(f"Unknown tool: {name}")
+
+
+def create_server():
+ """Create and return the MCP server configuration."""
+ return {
+ "name": "foundry-ml",
+ "version": "1.0.0",
+ "description": "Materials science dataset discovery and loading for ML",
+ "tools": TOOLS,
+ }
+
+
+def run_server(host: str = "localhost", port: int = 8765):
+ """Run the MCP server using stdio transport.
+
+ This implements a simple JSON-RPC style protocol for MCP.
+ """
+ import sys
+
+ logger.info(f"Starting Foundry MCP server on {host}:{port}")
+
+ # For now, use a simple stdio-based protocol
+ # In production, this would use the full MCP SDK
+ print(json.dumps({
+ "jsonrpc": "2.0",
+ "method": "server/info",
+ "params": create_server()
+ }), flush=True)
+
+ # Read and process requests
+ for line in sys.stdin:
+ try:
+ request = json.loads(line.strip())
+ method = request.get("method", "")
+
+ if method == "tools/list":
+ response = {
+ "jsonrpc": "2.0",
+ "id": request.get("id"),
+ "result": {"tools": TOOLS}
+ }
+ elif method == "tools/call":
+ params = request.get("params", {})
+ tool_name = params.get("name")
+ arguments = params.get("arguments", {})
+ try:
+ result = handle_tool_call(tool_name, arguments)
+ response = {
+ "jsonrpc": "2.0",
+ "id": request.get("id"),
+ "result": {"content": [{"type": "text", "text": json.dumps(result, default=str)}]}
+ }
+ except Exception as e:
+ response = {
+ "jsonrpc": "2.0",
+ "id": request.get("id"),
+ "error": {"code": -32000, "message": str(e)}
+ }
+ else:
+ response = {
+ "jsonrpc": "2.0",
+ "id": request.get("id"),
+ "error": {"code": -32601, "message": f"Method not found: {method}"}
+ }
+
+ print(json.dumps(response), flush=True)
+
+ except json.JSONDecodeError:
+ continue
+ except Exception as e:
+ logger.error(f"Error processing request: {e}")
+ print(json.dumps({
+ "jsonrpc": "2.0",
+ "id": None,
+ "error": {"code": -32603, "message": str(e)}
+ }), flush=True)
diff --git a/foundry/models.py b/foundry/models.py
index 2b490e00..a11d04cc 100644
--- a/foundry/models.py
+++ b/foundry/models.py
@@ -76,15 +76,18 @@ def __init__(self, project_dict: Dict[str, Any]):
try:
super().__init__(**project_dict)
except ValidationError as e:
- print("FoundrySchema validation failed!")
+ logger.error("FoundrySchema validation failed!")
for error in e.errors():
field_name = ".".join([str(item) for item in error['loc']])
error_description = error['msg']
- error_message = f"""There is an issue validating the entry for the field '{field_name}':
- The error message returned is: '{error_description}'.
- The description for this field is: '{FoundryModel.model_json_schema()['properties'][field_name]['description']}'"""
- print(error_message)
- raise e
+ schema_props = FoundryModel.model_json_schema().get('properties', {})
+ field_desc = schema_props.get(field_name, {}).get('description', 'No description available')
+ error_message = (
+ f"Validation error for field '{field_name}': {error_description}. "
+ f"Field description: {field_desc}"
+ )
+ logger.error(error_message)
+ raise
class FoundryDatacite(DataciteModel):
@@ -100,15 +103,18 @@ def __init__(self, datacite_dict: Dict[str, Any], **kwargs):
dc_dict['identifier']['identifier'] = dc_dict['identifier']['identifier']['__root__']
super().__init__(**dc_dict, **kwargs)
except ValidationError as e:
- print("Datacite validation failed!")
+ logger.error("Datacite validation failed!")
+ schema_props = DataciteModel.model_json_schema().get('properties', {})
for error in e.errors():
field_name = ".".join(str(loc) for loc in error["loc"])
error_description = error['msg']
- error_message = f"""There is an issue validating the entry for the field '{field_name}':
- The error message returned is: '{error_description}'.
- The description is: '{self.model_json_schema()['properties'].get(field_name, {}).get('description', 'No description available')}'"""
- print(error_message)
- raise e
+ field_desc = schema_props.get(field_name, {}).get('description', 'No description available')
+ error_message = (
+ f"Validation error for field '{field_name}': {error_description}. "
+ f"Field description: {field_desc}"
+ )
+ logger.error(error_message)
+ raise
class FoundryBase(BaseModel):
diff --git a/scripts/README.md b/scripts/README.md
new file mode 100644
index 00000000..6cd7b218
--- /dev/null
+++ b/scripts/README.md
@@ -0,0 +1,31 @@
+# Foundry Scripts
+
+Utility scripts for managing Foundry datasets.
+
+## batch_push_to_hf.py
+
+Push all Foundry datasets to HuggingFace Hub for broader discoverability.
+
+### Quick Start
+
+```bash
+# 1. Install dependencies
+pip install foundry-ml[huggingface]
+
+# 2. Set your HuggingFace token
+export HF_TOKEN="hf_your_token_here"
+
+# 3. Dry run (see what would be uploaded)
+python scripts/batch_push_to_hf.py --dry-run
+
+# 4. Upload all datasets
+python scripts/batch_push_to_hf.py --org foundry-ml
+```
+
+### Setup
+
+See the full setup instructions at the top of `batch_push_to_hf.py` or run:
+
+```bash
+python scripts/batch_push_to_hf.py --help
+```
diff --git a/scripts/batch_push_to_hf.py b/scripts/batch_push_to_hf.py
new file mode 100755
index 00000000..bd12c3af
--- /dev/null
+++ b/scripts/batch_push_to_hf.py
@@ -0,0 +1,459 @@
+#!/usr/bin/env python3
+"""
+Batch Push Foundry Datasets to HuggingFace Hub
+===============================================
+
+This script exports all Foundry datasets to HuggingFace Hub, making them
+discoverable in the broader ML ecosystem.
+
+SETUP INSTRUCTIONS
+------------------
+
+1. CREATE A HUGGINGFACE ACCOUNT
+ - Go to https://huggingface.co/join
+ - Create an account
+
+2. CREATE AN ORGANIZATION (Recommended)
+ - Go to https://huggingface.co/organizations/new
+ - Create organization named "foundry-ml" (or your preferred name)
+ - This keeps all datasets under one namespace: foundry-ml/dataset-name
+
+3. GET YOUR API TOKEN
+ - Go to https://huggingface.co/settings/tokens
+ - Click "New token"
+ - Name: "foundry-batch-upload"
+ - Role: "Write" (required to create repos)
+ - Copy the token (starts with "hf_...")
+
+4. SET YOUR TOKEN (choose one method):
+
+ Option A - Environment variable (recommended):
+ ```bash
+ export HF_TOKEN="hf_your_token_here"
+ python scripts/batch_push_to_hf.py
+ ```
+
+ Option B - Login via CLI (persists across sessions):
+ ```bash
+ pip install huggingface_hub
+ huggingface-cli login
+ # Paste your token when prompted
+ python scripts/batch_push_to_hf.py
+ ```
+
+ Option C - Pass directly (not recommended for shared scripts):
+ ```bash
+ python scripts/batch_push_to_hf.py --token "hf_your_token_here"
+ ```
+
+5. INSTALL DEPENDENCIES
+ ```bash
+ pip install foundry-ml[huggingface]
+ # or
+ pip install datasets huggingface_hub
+ ```
+
+6. RUN THE SCRIPT
+ ```bash
+ python scripts/batch_push_to_hf.py --org foundry-ml
+ ```
+
+USAGE
+-----
+ python scripts/batch_push_to_hf.py [OPTIONS]
+
+OPTIONS
+-------
+ --org ORG HuggingFace organization name (default: foundry-ml)
+ --token TOKEN HuggingFace API token (or set HF_TOKEN env var)
+ --private Create private repositories
+ --dry-run List datasets without uploading
+ --limit N Only process first N datasets (for testing)
+ --skip N Skip first N datasets (for resuming)
+ --dataset NAME Process only this specific dataset
+ --output FILE Save results to JSON file
+
+EXAMPLES
+--------
+ # Dry run - see what would be uploaded
+ python scripts/batch_push_to_hf.py --dry-run
+
+ # Upload all datasets to foundry-ml organization
+ python scripts/batch_push_to_hf.py --org foundry-ml
+
+ # Test with first 3 datasets
+ python scripts/batch_push_to_hf.py --org foundry-ml --limit 3
+
+ # Resume from dataset 10 (if previous run failed)
+ python scripts/batch_push_to_hf.py --org foundry-ml --skip 10
+
+ # Upload a single specific dataset
+ python scripts/batch_push_to_hf.py --org foundry-ml --dataset "foundry_bandgap_oqmd"
+"""
+
+import argparse
+import json
+import logging
+import os
+import sys
+import time
+from dataclasses import dataclass, asdict
+from datetime import datetime
+from pathlib import Path
+from typing import Optional, List
+
+# Configure logging
+logging.basicConfig(
+ level=logging.INFO,
+ format='%(asctime)s - %(levelname)s - %(message)s',
+ datefmt='%H:%M:%S'
+)
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class UploadResult:
+ """Result of a single dataset upload."""
+ dataset_name: str
+ doi: str
+ repo_id: str
+ status: str # 'success', 'failed', 'skipped'
+ url: Optional[str] = None
+ error: Optional[str] = None
+ duration_seconds: Optional[float] = None
+
+
+def get_hf_token(args_token: Optional[str] = None) -> str:
+ """Get HuggingFace token from args, env, or cached login."""
+ # 1. Check command line argument
+ if args_token:
+ return args_token
+
+ # 2. Check environment variable
+ env_token = os.environ.get('HF_TOKEN') or os.environ.get('HUGGINGFACE_TOKEN')
+ if env_token:
+ return env_token
+
+ # 3. Check if logged in via huggingface-cli
+ try:
+ from huggingface_hub import HfFolder
+ cached_token = HfFolder.get_token()
+ if cached_token:
+ return cached_token
+ except Exception:
+ pass
+
+ # 4. No token found
+ raise ValueError(
+ "No HuggingFace token found. Please either:\n"
+ " 1. Set HF_TOKEN environment variable\n"
+ " 2. Run 'huggingface-cli login'\n"
+ " 3. Pass --token argument\n"
+ "\nGet your token at: https://huggingface.co/settings/tokens"
+ )
+
+
+def sanitize_repo_name(name: str) -> str:
+ """Convert dataset name to valid HF repo name."""
+ # HF repo names: lowercase, alphanumeric, hyphens, underscores
+ # Max 96 characters
+ import re
+ name = name.lower()
+ name = re.sub(r'[^a-z0-9_-]', '-', name)
+ name = re.sub(r'-+', '-', name) # Collapse multiple hyphens
+ name = name.strip('-_')
+ return name[:96]
+
+
+def check_repo_exists(api, repo_id: str) -> bool:
+ """Check if a repository already exists on HF Hub."""
+ try:
+ api.repo_info(repo_id=repo_id, repo_type="dataset")
+ return True
+ except Exception:
+ return False
+
+
+def push_dataset(
+ dataset,
+ org: str,
+ token: str,
+ private: bool = False,
+) -> UploadResult:
+ """Push a single dataset to HuggingFace Hub."""
+ from foundry.integrations.huggingface import push_to_hub
+ from huggingface_hub import HfApi
+
+ dataset_name = dataset.dataset_name
+ doi = str(dataset.dc.identifier.identifier) if dataset.dc.identifier else "unknown"
+ repo_name = sanitize_repo_name(dataset_name)
+ repo_id = f"{org}/{repo_name}"
+
+ start_time = time.time()
+
+ try:
+ # Check if already exists
+ api = HfApi(token=token)
+ if check_repo_exists(api, repo_id):
+ logger.info(f" Repository {repo_id} already exists, skipping")
+ return UploadResult(
+ dataset_name=dataset_name,
+ doi=doi,
+ repo_id=repo_id,
+ status='skipped',
+ url=f"https://huggingface.co/datasets/{repo_id}",
+ error="Repository already exists"
+ )
+
+ # Push to hub
+ url = push_to_hub(
+ dataset=dataset,
+ repo_id=repo_id,
+ token=token,
+ private=private,
+ )
+
+ duration = time.time() - start_time
+ logger.info(f" Successfully pushed to {url} ({duration:.1f}s)")
+
+ return UploadResult(
+ dataset_name=dataset_name,
+ doi=doi,
+ repo_id=repo_id,
+ status='success',
+ url=url,
+ duration_seconds=duration
+ )
+
+ except Exception as e:
+ duration = time.time() - start_time
+ error_msg = str(e)
+ logger.error(f" Failed: {error_msg}")
+
+ return UploadResult(
+ dataset_name=dataset_name,
+ doi=doi,
+ repo_id=repo_id,
+ status='failed',
+ error=error_msg,
+ duration_seconds=duration
+ )
+
+
+def main():
+ parser = argparse.ArgumentParser(
+ description="Batch push Foundry datasets to HuggingFace Hub",
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog=__doc__
+ )
+ parser.add_argument(
+ '--org',
+ default='foundry-ml',
+ help='HuggingFace organization name (default: foundry-ml)'
+ )
+ parser.add_argument(
+ '--token',
+ help='HuggingFace API token (or set HF_TOKEN env var)'
+ )
+ parser.add_argument(
+ '--private',
+ action='store_true',
+ help='Create private repositories'
+ )
+ parser.add_argument(
+ '--dry-run',
+ action='store_true',
+ help='List datasets without uploading'
+ )
+ parser.add_argument(
+ '--limit',
+ type=int,
+ help='Only process first N datasets'
+ )
+ parser.add_argument(
+ '--skip',
+ type=int,
+ default=0,
+ help='Skip first N datasets'
+ )
+ parser.add_argument(
+ '--dataset',
+ help='Process only this specific dataset name'
+ )
+ parser.add_argument(
+ '--output',
+ help='Save results to JSON file'
+ )
+
+ args = parser.parse_args()
+
+ # Import Foundry
+ try:
+ from foundry import Foundry
+ except ImportError:
+ logger.error("Foundry not installed. Run: pip install foundry-ml")
+ sys.exit(1)
+
+ # Check HF dependencies
+ try:
+ from datasets import Dataset
+ from huggingface_hub import HfApi
+ except ImportError:
+ logger.error(
+ "HuggingFace dependencies not installed.\n"
+ "Run: pip install foundry-ml[huggingface]"
+ )
+ sys.exit(1)
+
+ # Get token (skip for dry run)
+ token = None
+ if not args.dry_run:
+ try:
+ token = get_hf_token(args.token)
+ logger.info(f"Using HuggingFace token: {token[:10]}...")
+ except ValueError as e:
+ logger.error(str(e))
+ sys.exit(1)
+
+ # Initialize Foundry
+ logger.info("Connecting to Foundry...")
+ f = Foundry()
+
+ # List all datasets
+ logger.info("Fetching dataset list...")
+ datasets_df = f.list()
+ total_datasets = len(datasets_df)
+ logger.info(f"Found {total_datasets} datasets")
+
+ # Apply filters
+ if args.dataset:
+ datasets_df = datasets_df[datasets_df['dataset_name'] == args.dataset]
+ if len(datasets_df) == 0:
+ logger.error(f"Dataset '{args.dataset}' not found")
+ sys.exit(1)
+
+ if args.skip:
+ datasets_df = datasets_df.iloc[args.skip:]
+ logger.info(f"Skipping first {args.skip} datasets")
+
+ if args.limit:
+ datasets_df = datasets_df.iloc[:args.limit]
+ logger.info(f"Limiting to {args.limit} datasets")
+
+ # Preview
+ print("\n" + "=" * 60)
+ print("DATASETS TO PROCESS")
+ print("=" * 60)
+ for i, (_, row) in enumerate(datasets_df.iterrows()):
+ name = row.get('dataset_name', 'N/A')
+ title = str(row.get('title', 'N/A'))[:50]
+ repo_name = sanitize_repo_name(name)
+ print(f"{i+1:3}. {args.org}/{repo_name}")
+ print(f" Title: {title}")
+ print("=" * 60 + "\n")
+
+ if args.dry_run:
+ logger.info("Dry run complete. No datasets were uploaded.")
+ return
+
+ # Confirm
+ response = input(f"Push {len(datasets_df)} datasets to {args.org}? [y/N] ")
+ if response.lower() != 'y':
+ logger.info("Aborted.")
+ return
+
+ # Process datasets
+ results: List[UploadResult] = []
+ success_count = 0
+ failed_count = 0
+ skipped_count = 0
+
+ print("\n" + "=" * 60)
+ print("UPLOADING")
+ print("=" * 60)
+
+ for i, (_, row) in enumerate(datasets_df.iterrows()):
+ name = row.get('dataset_name', 'N/A')
+ doi = row.get('DOI', 'N/A')
+ # Use the pre-loaded FoundryDataset object if available
+ foundry_dataset = row.get('FoundryDataset', None)
+
+ logger.info(f"[{i+1}/{len(datasets_df)}] Processing: {name}")
+
+ try:
+ # Use pre-loaded dataset or fetch it
+ if foundry_dataset is not None:
+ dataset = foundry_dataset
+ else:
+ dataset = f.get_dataset(doi)
+
+ # Push to HF
+ result = push_dataset(
+ dataset=dataset,
+ org=args.org,
+ token=token,
+ private=args.private,
+ )
+ results.append(result)
+
+ if result.status == 'success':
+ success_count += 1
+ elif result.status == 'skipped':
+ skipped_count += 1
+ else:
+ failed_count += 1
+
+ except Exception as e:
+ logger.error(f" Error loading dataset: {e}")
+ results.append(UploadResult(
+ dataset_name=name,
+ doi=doi,
+ repo_id=f"{args.org}/{sanitize_repo_name(name)}",
+ status='failed',
+ error=str(e)
+ ))
+ failed_count += 1
+
+ # Brief pause to avoid rate limiting
+ time.sleep(1)
+
+ # Summary
+ print("\n" + "=" * 60)
+ print("SUMMARY")
+ print("=" * 60)
+ print(f"Total processed: {len(results)}")
+ print(f" Successful: {success_count}")
+ print(f" Skipped: {skipped_count}")
+ print(f" Failed: {failed_count}")
+
+ if failed_count > 0:
+ print("\nFailed datasets:")
+ for r in results:
+ if r.status == 'failed':
+ print(f" - {r.dataset_name}: {r.error}")
+
+ # Save results
+ if args.output:
+ output_data = {
+ 'timestamp': datetime.now().isoformat(),
+ 'organization': args.org,
+ 'total': len(results),
+ 'success': success_count,
+ 'skipped': skipped_count,
+ 'failed': failed_count,
+ 'results': [asdict(r) for r in results]
+ }
+ with open(args.output, 'w') as f:
+ json.dump(output_data, f, indent=2)
+ logger.info(f"Results saved to {args.output}")
+
+ # Print successful URLs
+ if success_count > 0:
+ print("\nSuccessfully uploaded:")
+ for r in results:
+ if r.status == 'success':
+ print(f" {r.url}")
+
+
+if __name__ == '__main__':
+ main()
diff --git a/setup.py b/setup.py
index 1783614e..811026a9 100644
--- a/setup.py
+++ b/setup.py
@@ -5,7 +5,7 @@
packages = (setuptools.find_packages(),)
setuptools.setup(
name="foundry_ml",
- version="1.0.4",
+ version="1.1.0",
author="""Aristana Scourtas, KJ Schmidt, Isaac Darling, Aadit Ambadkar, Braeden Cullen,
Imogen Foster, Ribhav Bose, Zoa Katok, Ethan Truelove, Ian Foster, Ben Blaiszik""",
author_email="blaiszik@uchicago.edu",
@@ -14,7 +14,7 @@
long_description=long_description,
long_description_content_type="text/markdown",
install_requires=[
- "mdf_forge>=0.8.0",
+ "mdf_toolbox>=0.6.0",
"globus-sdk>=3,<4",
"dlhub_sdk>=1.0.0",
"numpy>=1.15.4",
@@ -23,19 +23,38 @@
"mdf_connect_client>=0.5.0",
"h5py>=2.10.0",
"json2table",
- "openpyxl>=3.1.0"
+ "openpyxl>=3.1.0",
+ # CLI and agent support (core)
+ "typer[all]>=0.9.0",
+ "rich>=13.0.0",
],
- python_requires=">=3.7",
+ extras_require={
+ "huggingface": [
+ "datasets>=2.14.0",
+ "huggingface_hub>=0.17.0",
+ ],
+ },
+ entry_points={
+ "console_scripts": [
+ "foundry=foundry.__main__:main",
+ ],
+ },
+ python_requires=">=3.8",
classifiers=[
- "Development Status :: 3 - Alpha",
+ "Development Status :: 4 - Beta",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: MIT License",
"Natural Language :: English",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
+ "Programming Language :: Python :: 3.8",
+ "Programming Language :: Python :: 3.9",
+ "Programming Language :: Python :: 3.10",
+ "Programming Language :: Python :: 3.11",
+ "Programming Language :: Python :: 3.12",
"Topic :: Scientific/Engineering",
],
- keywords=[],
+ keywords=["materials science", "machine learning", "datasets", "MCP", "AI agents"],
license="MIT License",
url="https://github.com/MLMI2-CSSI/foundry",
)
diff --git a/tests/test_errors.py b/tests/test_errors.py
new file mode 100644
index 00000000..8b5f3564
--- /dev/null
+++ b/tests/test_errors.py
@@ -0,0 +1,199 @@
+"""Tests for the structured error classes."""
+
+import pytest
+
+from foundry.errors import (
+ FoundryError,
+ DatasetNotFoundError,
+ AuthenticationError,
+ DownloadError,
+ DataLoadError,
+ ValidationError,
+ PublishError,
+ CacheError,
+ ConfigurationError,
+)
+
+
+class TestFoundryError:
+ """Tests for the base FoundryError class."""
+
+ def test_foundry_error_has_required_fields(self):
+ """Test that FoundryError has all required fields."""
+ error = FoundryError(
+ code="TEST_ERROR",
+ message="Test error message",
+ details={"key": "value"},
+ recovery_hint="Try again"
+ )
+
+ assert error.code == "TEST_ERROR"
+ assert error.message == "Test error message"
+ assert error.details == {"key": "value"}
+ assert error.recovery_hint == "Try again"
+
+ def test_foundry_error_str_includes_code_and_hint(self):
+ """Test that str() includes code and recovery hint."""
+ error = FoundryError(
+ code="TEST_ERROR",
+ message="Test message",
+ recovery_hint="Do this to fix"
+ )
+
+ error_str = str(error)
+ assert "[TEST_ERROR]" in error_str
+ assert "Test message" in error_str
+ assert "Do this to fix" in error_str
+
+ def test_foundry_error_to_dict(self):
+ """Test serialization to dict for JSON responses."""
+ error = FoundryError(
+ code="TEST_ERROR",
+ message="Test message",
+ details={"foo": "bar"},
+ recovery_hint="Fix it"
+ )
+
+ d = error.to_dict()
+ assert d["code"] == "TEST_ERROR"
+ assert d["message"] == "Test message"
+ assert d["details"] == {"foo": "bar"}
+ assert d["recovery_hint"] == "Fix it"
+
+ def test_foundry_error_is_exception(self):
+ """Test that FoundryError can be raised and caught."""
+ with pytest.raises(FoundryError) as exc_info:
+ raise FoundryError(code="TEST", message="Test")
+
+ assert exc_info.value.code == "TEST"
+
+
+class TestDatasetNotFoundError:
+ """Tests for DatasetNotFoundError."""
+
+ def test_dataset_not_found_error(self):
+ """Test DatasetNotFoundError initialization."""
+ error = DatasetNotFoundError("bandgap")
+
+ assert error.code == "DATASET_NOT_FOUND"
+ assert "bandgap" in error.message
+ assert error.details["query"] == "bandgap"
+ assert error.recovery_hint is not None
+
+ def test_dataset_not_found_error_search_type(self):
+ """Test DatasetNotFoundError with different search types."""
+ error = DatasetNotFoundError("10.18126/abc", search_type="DOI")
+
+ assert "DOI" in error.message
+ assert error.details["search_type"] == "DOI"
+
+
+class TestAuthenticationError:
+ """Tests for AuthenticationError."""
+
+ def test_authentication_error(self):
+ """Test AuthenticationError initialization."""
+ error = AuthenticationError("Globus", "Token expired")
+
+ assert error.code == "AUTH_FAILED"
+ assert "Globus" in error.message
+ assert "Token expired" in error.message
+ assert error.details["service"] == "Globus"
+ assert error.recovery_hint is not None
+
+
+class TestDownloadError:
+ """Tests for DownloadError."""
+
+ def test_download_error(self):
+ """Test DownloadError initialization."""
+ error = DownloadError(
+ url="https://example.com/file.dat",
+ reason="Connection timeout",
+ destination="/tmp/file.dat"
+ )
+
+ assert error.code == "DOWNLOAD_FAILED"
+ assert "https://example.com/file.dat" in error.message
+ assert error.details["url"] == "https://example.com/file.dat"
+ assert error.details["destination"] == "/tmp/file.dat"
+
+
+class TestDataLoadError:
+ """Tests for DataLoadError."""
+
+ def test_data_load_error(self):
+ """Test DataLoadError initialization."""
+ error = DataLoadError(
+ file_path="/data/dataset.json",
+ reason="Invalid JSON",
+ data_type="tabular"
+ )
+
+ assert error.code == "DATA_LOAD_FAILED"
+ assert "/data/dataset.json" in error.message
+ assert error.details["data_type"] == "tabular"
+
+
+class TestValidationError:
+ """Tests for ValidationError."""
+
+ def test_validation_error(self):
+ """Test ValidationError initialization."""
+ error = ValidationError(
+ field_name="creators",
+ error_msg="Field required",
+ schema_type="datacite"
+ )
+
+ assert error.code == "VALIDATION_FAILED"
+ assert "creators" in error.message
+ assert error.details["schema_type"] == "datacite"
+
+
+class TestPublishError:
+ """Tests for PublishError."""
+
+ def test_publish_error(self):
+ """Test PublishError initialization."""
+ error = PublishError(
+ reason="Metadata validation failed",
+ source_id="my_dataset_v1.0",
+ status="failed"
+ )
+
+ assert error.code == "PUBLISH_FAILED"
+ assert error.details["source_id"] == "my_dataset_v1.0"
+ assert error.details["status"] == "failed"
+
+
+class TestCacheError:
+ """Tests for CacheError."""
+
+ def test_cache_error(self):
+ """Test CacheError initialization."""
+ error = CacheError(
+ operation="write",
+ reason="Disk full",
+ cache_path="/tmp/cache"
+ )
+
+ assert error.code == "CACHE_ERROR"
+ assert "write" in error.message
+ assert error.details["cache_path"] == "/tmp/cache"
+
+
+class TestConfigurationError:
+ """Tests for ConfigurationError."""
+
+ def test_configuration_error(self):
+ """Test ConfigurationError initialization."""
+ error = ConfigurationError(
+ setting="use_globus",
+ reason="Invalid value",
+ current_value="maybe"
+ )
+
+ assert error.code == "CONFIG_ERROR"
+ assert "use_globus" in error.message
+ assert error.details["current_value"] == "maybe"
diff --git a/tests/test_https_download.py b/tests/test_https_download.py
index f992fe0e..3890f165 100644
--- a/tests/test_https_download.py
+++ b/tests/test_https_download.py
@@ -1,67 +1,99 @@
-# import os
-# import requests
-# import mock
-
-# from foundry.https_download import download_file
-
-
-# def test_download_file(tmp_path):
-# item = {
-# "path": tmp_path,
-# "name": "example_file.txt"
-# }
-# data_directory = tmp_path
-# https_config = {
-# "base_url": "https://example.com/",
-# "source_id": "12345"
-# }
-
-# # Mock the requests.get function to return a response with content
-# with mock.patch.object(requests, "get") as mock_get:
-# mock_get.return_value.content = b"Example file content"
-
-# # Call the function
-# result = download_file(item, data_directory, https_config)
-
-# # Assert that the file was downloaded and written correctly
-# assert os.path.exists(str(tmp_path) + "/12345/example_file.txt")
-# with open(str(tmp_path) + "/12345/example_file.txt", "rb") as f:
-# assert f.read() == b"Example file content"
-
-# # Assert that the result is as expected
-# assert result == {str(tmp_path) + "/12345/example_file.txt status": True}
-
-
-# def test_download_file_with_existing_directories(tmp_path):
-# temp_path_to_file = str(tmp_path) + '/file'
-# os.mkdir(temp_path_to_file)
-# temp_path_to_data = str(tmp_path) + '/data'
-# os.mkdir(temp_path_to_data)
-
-# item = {
-# "path": temp_path_to_file,
-# "name": "example_file.txt"
-# }
-# data_directory = temp_path_to_data
-# https_config = {
-# "base_url": "https://example.com/",
-# "source_id": "12345"
-# }
-
-# # Create the parent directories
-# os.makedirs(temp_path_to_data + "12345")
-
-# # Mock the requests.get function to return a response with content
-# with mock.patch.object(requests, "get") as mock_get:
-# mock_get.return_value.content = b"Example file content"
-
-# # Call the function
-# result = download_file(item, data_directory, https_config)
-
-# # Assert that the file was downloaded and written correctly
-# assert os.path.exists(temp_path_to_data + "/12345/example_file.txt")
-# with open(temp_path_to_data + "/12345/example_file.txt", "rb") as f:
-# assert f.read() == b"Example file content"
-
-# # Assert that the result is as expected
-# assert result == {temp_path_to_data + "/12345/example_file.txt status": True}
+"""Tests for https_download module."""
+
+import os
+import pytest
+from unittest import mock
+
+import requests
+
+from foundry.https_download import download_file, DownloadError
+
+
+class TestDownloadFile:
+ """Tests for the download_file function."""
+
+ def test_download_file_success(self, tmp_path):
+ """Test successful file download."""
+ item = {
+ "path": "/data",
+ "name": "example_file.txt"
+ }
+ https_config = {
+ "base_url": "https://example.com/",
+ "source_id": "test_dataset"
+ }
+
+ # Mock successful response
+ mock_response = mock.Mock()
+ mock_response.iter_content = mock.Mock(return_value=[b"Example file content"])
+ mock_response.raise_for_status = mock.Mock()
+ mock_response.__enter__ = mock.Mock(return_value=mock_response)
+ mock_response.__exit__ = mock.Mock(return_value=False)
+
+ with mock.patch.object(requests, "get", return_value=mock_response):
+ result = download_file(item, str(tmp_path), https_config)
+
+ # Assert file was downloaded
+ expected_path = tmp_path / "test_dataset" / "example_file.txt"
+ assert os.path.exists(expected_path)
+ assert result == str(expected_path)
+
+ def test_download_file_request_error(self, tmp_path):
+ """Test that RequestException raises DownloadError."""
+ item = {
+ "path": "/data",
+ "name": "example_file.txt"
+ }
+ https_config = {
+ "base_url": "https://example.com/",
+ "source_id": "test_dataset"
+ }
+
+ # Mock request failure
+ with mock.patch.object(requests, "get", side_effect=requests.exceptions.RequestException("Connection failed")):
+ with pytest.raises(DownloadError) as exc_info:
+ download_file(item, str(tmp_path), https_config)
+
+ error = exc_info.value
+ assert error.url == "https://example.com/data/example_file.txt"
+ assert "Connection failed" in error.reason
+ assert error.destination is not None
+
+ def test_download_file_io_error(self, tmp_path):
+ """Test that IOError raises DownloadError."""
+ item = {
+ "path": "/data",
+ "name": "example_file.txt"
+ }
+ https_config = {
+ "base_url": "https://example.com/",
+ "source_id": "test_dataset"
+ }
+
+ # Mock successful response but write failure
+ mock_response = mock.Mock()
+ mock_response.iter_content = mock.Mock(return_value=[b"data"])
+ mock_response.raise_for_status = mock.Mock()
+ mock_response.__enter__ = mock.Mock(return_value=mock_response)
+ mock_response.__exit__ = mock.Mock(return_value=False)
+
+ with mock.patch.object(requests, "get", return_value=mock_response):
+ with mock.patch("builtins.open", side_effect=IOError("Disk full")):
+ with pytest.raises(DownloadError) as exc_info:
+ download_file(item, str(tmp_path), https_config)
+
+ error = exc_info.value
+ assert "Disk full" in error.reason
+
+ def test_download_error_has_structured_info(self):
+ """Test that DownloadError provides structured information."""
+ error = DownloadError(
+ url="https://example.com/file.txt",
+ reason="Connection timeout",
+ destination="/tmp/file.txt"
+ )
+
+ assert error.url == "https://example.com/file.txt"
+ assert error.reason == "Connection timeout"
+ assert error.destination == "/tmp/file.txt"
+ assert "Connection timeout" in str(error)
diff --git a/tests/test_new_features.py b/tests/test_new_features.py
new file mode 100644
index 00000000..2f40dcc6
--- /dev/null
+++ b/tests/test_new_features.py
@@ -0,0 +1,265 @@
+"""Tests for new features: as_json, include_schema, get_schema, MCP tools."""
+
+import pytest
+from unittest import mock
+
+from foundry import Foundry, FoundryDataset
+from tests.test_data import datacite_data, valid_metadata
+
+
+class TestAsJsonParameter:
+ """Tests for the as_json parameter on search and list."""
+
+ def test_dataset_to_dict_method(self):
+ """Test that _dataset_to_dict converts a dataset to dict properly."""
+ # Create a mock dataset
+ mock_ds = mock.Mock()
+ mock_ds.dataset_name = "test_dataset"
+ mock_ds.dc.titles = [mock.Mock(title="Test Dataset")]
+ mock_ds.dc.identifier = mock.Mock()
+ mock_ds.dc.identifier.identifier = "10.18126/test"
+ mock_ds.dc.descriptions = [mock.Mock(description="A test dataset")]
+ mock_ds.dc.publicationYear = 2024
+ mock_ds.foundry_schema.keys = []
+ mock_ds.foundry_schema.splits = []
+ mock_ds.foundry_schema.data_type = "tabular"
+
+ # Test the _dataset_to_dict method directly
+ from foundry.foundry import Foundry
+ result = Foundry._dataset_to_dict(None, mock_ds)
+
+ assert isinstance(result, dict)
+ assert result["name"] == "test_dataset"
+ assert result["title"] == "Test Dataset"
+ assert result["doi"] == "10.18126/test"
+ assert result["data_type"] == "tabular"
+
+ def test_dataset_to_dict_includes_fields_and_splits(self):
+ """Test that _dataset_to_dict includes fields and splits."""
+ mock_key = mock.Mock()
+ mock_key.key = ["band_gap"]
+
+ mock_split = mock.Mock()
+ mock_split.label = "train"
+
+ mock_ds = mock.Mock()
+ mock_ds.dataset_name = "test_dataset"
+ mock_ds.dc.titles = [mock.Mock(title="Test")]
+ mock_ds.dc.identifier = mock.Mock()
+ mock_ds.dc.identifier.identifier = "10.18126/test"
+ mock_ds.dc.descriptions = []
+ mock_ds.dc.publicationYear = 2024
+ mock_ds.foundry_schema.keys = [mock_key]
+ mock_ds.foundry_schema.splits = [mock_split]
+ mock_ds.foundry_schema.data_type = "tabular"
+
+ from foundry.foundry import Foundry
+ result = Foundry._dataset_to_dict(None, mock_ds)
+
+ assert result["fields"] == ["band_gap"]
+ assert result["splits"] == ["train"]
+ assert result["data_type"] == "tabular"
+
+
+class TestGetSchema:
+ """Tests for the get_schema method on FoundryDataset."""
+
+ def test_get_schema_returns_dict(self):
+ """Test that get_schema returns a dictionary with expected fields."""
+ ds = FoundryDataset(
+ dataset_name='test_dataset',
+ foundry_schema=valid_metadata,
+ datacite_entry=datacite_data
+ )
+
+ schema = ds.get_schema()
+
+ assert isinstance(schema, dict)
+ assert schema["name"] == "test_dataset"
+ assert "title" in schema
+ assert "doi" in schema
+ assert "data_type" in schema
+ assert "splits" in schema
+ assert "fields" in schema
+
+ def test_get_schema_includes_field_details(self):
+ """Test that get_schema includes field descriptions and units."""
+ ds = FoundryDataset(
+ dataset_name='test_dataset',
+ foundry_schema=valid_metadata,
+ datacite_entry=datacite_data
+ )
+
+ schema = ds.get_schema()
+
+ # Check that fields have the expected structure
+ assert len(schema["fields"]) > 0
+ field = schema["fields"][0]
+ assert "name" in field
+ assert "role" in field
+ assert "description" in field
+ assert "units" in field
+
+ def test_get_schema_includes_splits(self):
+ """Test that get_schema includes split information."""
+ ds = FoundryDataset(
+ dataset_name='test_dataset',
+ foundry_schema=valid_metadata,
+ datacite_entry=datacite_data
+ )
+
+ schema = ds.get_schema()
+
+ assert len(schema["splits"]) == 2 # train and test from valid_metadata
+ split_names = [s["name"] for s in schema["splits"]]
+ assert "train" in split_names
+ assert "test" in split_names
+
+
+class TestIncludeSchema:
+ """Tests for the include_schema parameter on get_as_dict."""
+
+ def test_include_schema_false_returns_data_only(self):
+ """Test that include_schema=False returns just data."""
+ ds = FoundryDataset(
+ dataset_name='test_dataset',
+ foundry_schema=valid_metadata,
+ datacite_entry=datacite_data
+ )
+ ds._foundry_cache = mock.Mock()
+ ds._foundry_cache.load_as_dict.return_value = {"train": ({"x": [1, 2]}, {"y": [0, 1]})}
+
+ result = ds.get_as_dict(include_schema=False)
+
+ assert "schema" not in result
+ assert "train" in result
+
+ def test_include_schema_true_returns_data_and_schema(self):
+ """Test that include_schema=True returns data with schema."""
+ ds = FoundryDataset(
+ dataset_name='test_dataset',
+ foundry_schema=valid_metadata,
+ datacite_entry=datacite_data
+ )
+ ds._foundry_cache = mock.Mock()
+ ds._foundry_cache.load_as_dict.return_value = {"train": ({"x": [1, 2]}, {"y": [0, 1]})}
+
+ result = ds.get_as_dict(include_schema=True)
+
+ assert "data" in result
+ assert "schema" in result
+ assert result["schema"]["name"] == "test_dataset"
+
+
+class TestMCPTools:
+ """Tests for MCP server tools."""
+
+ def test_search_datasets_tool(self):
+ """Test the search_datasets MCP tool."""
+ from foundry.mcp.server import TOOLS
+
+ search_tool = next(t for t in TOOLS if t["name"] == "search_datasets")
+
+ assert search_tool["name"] == "search_datasets"
+ assert "query" in search_tool["inputSchema"]["properties"]
+ assert "limit" in search_tool["inputSchema"]["properties"]
+ assert "query" in search_tool["inputSchema"]["required"]
+
+ def test_get_dataset_schema_tool(self):
+ """Test the get_dataset_schema MCP tool."""
+ from foundry.mcp.server import TOOLS
+
+ schema_tool = next(t for t in TOOLS if t["name"] == "get_dataset_schema")
+
+ assert schema_tool["name"] == "get_dataset_schema"
+ assert "doi" in schema_tool["inputSchema"]["properties"]
+ assert "doi" in schema_tool["inputSchema"]["required"]
+
+ def test_load_dataset_tool(self):
+ """Test the load_dataset MCP tool."""
+ from foundry.mcp.server import TOOLS
+
+ load_tool = next(t for t in TOOLS if t["name"] == "load_dataset")
+
+ assert load_tool["name"] == "load_dataset"
+ assert "doi" in load_tool["inputSchema"]["properties"]
+ assert "split" in load_tool["inputSchema"]["properties"]
+
+ def test_list_all_datasets_tool(self):
+ """Test the list_all_datasets MCP tool."""
+ from foundry.mcp.server import TOOLS
+
+ list_tool = next(t for t in TOOLS if t["name"] == "list_all_datasets")
+
+ assert list_tool["name"] == "list_all_datasets"
+ assert "limit" in list_tool["inputSchema"]["properties"]
+
+ def test_create_server(self):
+ """Test that create_server returns proper server config."""
+ from foundry.mcp.server import create_server
+
+ config = create_server()
+
+ assert config["name"] == "foundry-ml"
+ assert "version" in config
+ assert "tools" in config
+ assert len(config["tools"]) == 4
+
+
+class TestCLI:
+ """Tests for CLI commands."""
+
+ def test_cli_app_exists(self):
+ """Test that CLI app is properly configured."""
+ from foundry.__main__ import app
+
+ assert app is not None
+ assert app.info.name == "foundry"
+
+ def test_cli_has_search_command(self):
+ """Test that CLI has search command."""
+ from foundry.__main__ import search
+ assert search is not None
+ assert callable(search)
+
+ def test_cli_has_get_command(self):
+ """Test that CLI has get command."""
+ from foundry.__main__ import get
+ assert get is not None
+ assert callable(get)
+
+ def test_cli_has_schema_command(self):
+ """Test that CLI has schema command."""
+ from foundry.__main__ import schema
+ assert schema is not None
+ assert callable(schema)
+
+ def test_cli_has_catalog_command(self):
+ """Test that CLI has catalog command."""
+ from foundry.__main__ import catalog
+ assert catalog is not None
+ assert callable(catalog)
+
+ def test_cli_has_push_to_hf_command(self):
+ """Test that CLI has push_to_hf command."""
+ from foundry.__main__ import push_to_hf
+ assert push_to_hf is not None
+ assert callable(push_to_hf)
+
+ def test_cli_has_version_command(self):
+ """Test that CLI has version command."""
+ from foundry.__main__ import version
+ assert version is not None
+ assert callable(version)
+
+ def test_cli_has_mcp_start_command(self):
+ """Test that CLI has MCP start command."""
+ from foundry.__main__ import start
+ assert start is not None
+ assert callable(start)
+
+ def test_cli_has_mcp_install_command(self):
+ """Test that CLI has MCP install command."""
+ from foundry.__main__ import install
+ assert install is not None
+ assert callable(install)