diff --git a/README.md b/README.md index bc72d143..55f69ae6 100644 --- a/README.md +++ b/README.md @@ -6,105 +6,123 @@ [![PyPI](https://img.shields.io/pypi/v/foundry_ml.svg)](https://pypi.python.org/pypi/foundry_ml) [![Tests](https://github.com/MLMI2-CSSI/foundry/actions/workflows/tests.yml/badge.svg)](https://github.com/MLMI2-CSSI/foundry/actions/workflows/tests.yml) -[![Tests](https://github.com/MLMI2-CSSI/foundry/actions/workflows/python-publish.yml/badge.svg)](https://github.com/MLMI2-CSSI/foundry/actions/workflows/python-publish.yml) [![NSF-1931306](https://img.shields.io/badge/NSF-1931306-blue)](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1931306&HistoricalAwards=false) [](https://ai-materials-and-chemistry.gitbook.io/foundry/) +**Foundry-ML** simplifies access to machine learning-ready datasets in materials science and chemistry. -Foundry-ML simplifies the discovery and usage of ML-ready datasets in materials science and chemistry providing a simple API to access even complex datasets. -* Load ML-ready data with just a few lines of code -* Work with datasets in local or cloud environments. -* Publish your own datasets with Foundry to promote community usage -* (in progress) Run published ML models without hassle +- **Search & Load** - Find and use curated datasets with a few lines of code +- **Understand** - Rich schemas describe what each field means +- **Cite** - Automatic citation generation for publications +- **Publish** - Share your datasets with the community +- **AI-Ready** - MCP server for Claude and other AI assistants -Learn more and see our available datasets on [Foundry-ML.org](https://foundry-ml.org/) +## Quick Start +```bash +pip install foundry-ml +``` + +```python +from foundry import Foundry + +# Connect +f = Foundry() +# Search +results = f.search("band gap", limit=5) -# Documentation -Information on how to install and use Foundry is available in our documentation [here](https://ai-materials-and-chemistry.gitbook.io/foundry/v/docs/). +# Load +dataset = results.iloc[0].FoundryDataset +X, y = dataset.get_as_dict()['train'] -DLHub documentation for model publication and running information can be found [here](https://dlhub-sdk.readthedocs.io/en/latest/servable-publication.html). +# Understand +schema = dataset.get_schema() +print(schema['fields']) -# Quick Start -Install Foundry-ML via command line with: -`pip install foundry_ml` +# Cite +print(dataset.get_citation()) +``` + +## Cloud Environments -You can use the following code to import and instantiate Foundry-ML, then load a dataset. +For Google Colab or remote Jupyter: ```python -from foundry import Foundry -f = Foundry(index="mdf") +f = Foundry(no_browser=True, no_local_server=True) +``` +## CLI -f = f.load("10.18126/e73h-3w6n", globus=True) +```bash +foundry search "band gap" +foundry schema 10.18126/abc123 +foundry --help ``` -*NOTE*: If you run locally and don't want to install the [Globus Connect Personal endpoint](https://www.globus.org/globus-connect-personal), just set the `globus=False`. -If running this code in a notebook, a table of metadata for the dataset will appear: +## AI Agent Integration -metadata +```bash +foundry mcp install # Add to Claude Code +``` -We can use the data with `f.load_data()` and specifying splits such as `train` for different segments of the dataset, then use matplotlib to visualize it. +## Documentation -```python -res = f.load_data() +- [Getting Started](https://ai-materials-and-chemistry.gitbook.io/foundry/quickstart) +- [User Guide](https://ai-materials-and-chemistry.gitbook.io/foundry/) +- [API Reference](https://ai-materials-and-chemistry.gitbook.io/foundry/api/foundry) +- [Examples](./examples) + +## Features + +| Feature | Description | +|---------|-------------| +| Search | Find datasets by keyword, DOI, or browse catalog | +| Load | Automatic download, caching, and format conversion | +| PyTorch/TensorFlow | `dataset.get_as_torch()`, `dataset.get_as_tensorflow()` | +| CLI | Terminal-based workflows | +| MCP Server | AI assistant integration | +| HuggingFace Export | Publish to HuggingFace Hub | -imgs = res['train']['input']['imgs'] -desc = res['train']['input']['metadata'] -coords = res['train']['target']['coords'] +## Available Datasets -n_images = 3 -offset = 150 -key_list = list(res['train']['input']['imgs'].keys())[0+offset:n_images+offset] +Browse datasets at [Foundry-ML.org](https://foundry-ml.org/) or: -fig, axs = plt.subplots(1, n_images, figsize=(20,20)) -for i in range(n_images): - axs[i].imshow(imgs[key_list[i]]) - axs[i].scatter(coords[key_list[i]][:,0], coords[key_list[i]][:,1], s = 20, c = 'r', alpha=0.5) +```python +f = Foundry() +f.list(limit=20) # See available datasets ``` -Screen Shot 2022-10-20 at 2 22 43 PM -[See full examples](./examples) +## How to Cite -# How to Cite -If you find Foundry-ML useful, please cite the following [paper](https://doi.org/10.21105/joss.05467) +If you use Foundry-ML, please cite: -``` +```bibtex @article{Schmidt2024, doi = {10.21105/joss.05467}, - url = {https://doi.org/10.21105/joss.05467}, - year = {2024}, publisher = {The Open Journal}, + year = {2024}, + publisher = {The Open Journal}, volume = {9}, number = {93}, pages = {5467}, - author = {Kj Schmidt and Aristana Scourtas and Logan Ward and Steve Wangen and Marcus Schwarting and Isaac Darling and Ethan Truelove and Aadit Ambadkar and Ribhav Bose and Zoa Katok and Jingrui Wei and Xiangguo Li and Ryan Jacobs and Lane Schultz and Doyeon Kim and Michael Ferris and Paul M. Voyles and Dane Morgan and Ian Foster and Ben Blaiszik}, - title = {Foundry-ML - Software and Services to Simplify Access to Machine Learning Datasets in Materials Science}, journal = {Journal of Open Source Software} + author = {Kj Schmidt and Aristana Scourtas and Logan Ward and others}, + title = {Foundry-ML - Software and Services to Simplify Access to Machine Learning Datasets in Materials Science}, + journal = {Journal of Open Source Software} } ``` -# Contributing -Foundry is an Open Source project and we encourage contributions from the community. To contribute, please fork from the `main` branch and open a Pull Request on the `main` branch. A member of our team will review your PR shortly. +## Contributing -## Developer notes -In order to enforce consistency with external schemas for the metadata and datacite structures ([contained in the MDF data schema repository](https://github.com/materials-data-facility/data-schemas)) the `dc_model.py` and `project_model.py` pydantic data models (found in the `foundry/jsonschema_models` folder) were generated using the [datamodel-code-generator](https://github.com/koxudaxi/datamodel-code-generator/) tool. In order to ensure compliance with the flake8 linting, the `--use-annoted` flag was passed to ensure regex patterns in `dc_model.py` were specified using pydantic's `Annotated` type vs the soon to be deprecated `constr` type. The command used to run the datamodel-code-generator looks like: -``` -datamodel-codegen --input dc.json --output dc_model.py --use-annotated -``` +Foundry is open source. To contribute: -# Primary Support -This work was supported by the National Science Foundation under NSF Award Number: 1931306 "Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure". +1. Fork from `main` +2. Make your changes +3. Open a Pull Request -# Other Support -Foundry-ML brings together many components in the materials data ecosystem. Including [MAST-ML](https://mastmldocs.readthedocs.io/en/latest/), the [Data and Learning Hub for Science](https://www.dlhub.org) (DLHub), and the [Materials Data Facility](https://materialsdatafacility.org) (MDF). +See [CONTRIBUTING.md](docs/how-to-contribute/contributing.md) for details. -## MAST-ML -This work was supported by the National Science Foundation (NSF) SI2 award No. 1148011 and DMREF award number DMR-1332851 +## Support -## The Data and Learning Hub for Science (DLHub) -This material is based upon work supported by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory, provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357. -https://www.dlhub.org +This work was supported by the National Science Foundation under NSF Award Number: 1931306 "Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure". -## The Materials Data Facility -This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the [Center for Hierarchical Material Design (CHiMaD)](http://chimad.northwestern.edu). This work was performed under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD). This work was also supported by the National Science Foundation as part of the [Midwest Big Data Hub](http://midwestbigdatahub.org) under NSF Award Number: 1636950 "BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design (IMaD): Leverage, Innovate, and Disseminate". -https://www.materialsdatafacility.org +Foundry integrates with [Materials Data Facility](https://materialsdatafacility.org), [DLHub](https://www.dlhub.org), and [MAST-ML](https://mastmldocs.readthedocs.io/). diff --git a/docs/README.md b/docs/README.md index abcc4c54..06a5e00c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,43 +1,87 @@ -# Getting started with Foundry +# Introduction -![](.gitbook/assets/foundry-purple%20%283%29.png) +

+ Foundry +

-## What is Foundry? +**Foundry-ML** is a Python library that simplifies access to machine learning-ready datasets in materials science and chemistry. -Foundry is a Python package that simplifies the discovery and usage of machine-learning ready datasets in materials science and chemistry. Foundry provides software tools that make it easy to load these datasets and work with them in local or cloud environments. Further, Foundry provides a dataset specification, and defined curation flows, that allow users to create new datasets for the community to use through this same interface. +## Features -## Installation +- **Search & Discover** - Find datasets by keyword or browse the catalog +- **Rich Metadata** - Understand datasets before downloading with detailed schemas +- **Easy Loading** - Get data in Python, PyTorch, or TensorFlow format +- **Automatic Caching** - Fast subsequent access after first download +- **Publishing** - Share your own datasets with the community +- **AI Integration** - MCP server for AI assistant access +- **CLI** - Terminal-based workflows -Foundry can be installed on any operating system with Python with pip +## Quick Example -```text -pip install foundry-ml +```python +from foundry import Foundry + +# Connect +f = Foundry() + +# Search for datasets +results = f.search("band gap", limit=5) + +# Load a dataset +dataset = results.iloc[0].FoundryDataset +X, y = dataset.get_as_dict()['train'] + +# Get citation for your paper +print(dataset.get_citation()) ``` -### Globus +## Installation -Foundry uses the Globus platform for authentication, search, and to optimize some data transfer operations. Follow the steps below to get set up. +```bash +pip install foundry-ml +``` -* [Create a free account.](https://app.globus.org) You can create a free account here with your institutional credentials or with free IDs \(GlobusID, Google, ORCID, etc\). -* [Set up a Globus Connect Personal endpoint ](https://www.globus.org/globus-connect-personal)_**\(optional\)**_. While this step is optional, some Foundry capabilities will work more efficiently when using GCP. +For cloud environments (Colab, remote Jupyter): -## Project Support +```python +f = Foundry(no_browser=True, no_local_server=True) +``` -This work was supported by the National Science Foundation under NSF Award Number: 1931306 "Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure". +## What's Next? -### Other Support + + + + + + +
-Foundry brings together many components in the materials data ecosystem. Including MAST-ML, the Data and Learning Hub for Science \(DLHub\), and The Materials Data Facility \(MDF\). +**Getting Started** +- [Installation](installation.md) +- [Quick Start](quickstart.md) -#### MAST-ML + -This work was supported by the National Science Foundation \(NSF\) SI2 award No. 1148011 and DMREF award number DMR-1332851 +**User Guide** +- [Searching](guide/searching.md) +- [Loading Data](guide/loading-data.md) +- [ML Frameworks](guide/ml-frameworks.md) -#### The Data and Learning Hub for Science \(DLHub\) + -This material is based upon work supported by Laboratory Directed Research and Development \(LDRD\) funding from Argonne National Laboratory, provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357. [https://www.dlhub.org](https://www.dlhub.org) +**Features** +- [CLI](features/cli.md) +- [MCP Server](features/mcp-server.md) +- [HuggingFace](features/huggingface.md) -#### The Materials Data Facility +
-This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the [Center for Hierarchical Material Design \(CHiMaD\)](http://chimad.northwestern.edu). This work was performed under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design \(CHiMaD\). This work was also supported by the National Science Foundation as part of the [Midwest Big Data Hub](http://midwestbigdatahub.org) under NSF Award Number: 1636950 "BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design \(IMaD\): Leverage, Innovate, and Disseminate". [https://www.materialsdatafacility.org](https://www.materialsdatafacility.org) +## Project Support + +This work was supported by the National Science Foundation under NSF Award Number: 1931306 "Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure". +Foundry brings together components from: +- [Materials Data Facility (MDF)](https://materialsdatafacility.org) +- [Data and Learning Hub for Science (DLHub)](https://www.dlhub.org) +- [MAST-ML](https://mastmldocs.readthedocs.io/) diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index aac625cd..d7bb6a42 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -1,14 +1,48 @@ -# Table of contents +# Table of Contents -* [Getting started with Foundry](README.md) +## Getting Started -## How to contribute +* [Introduction](README.md) +* [Installation](installation.md) +* [Quick Start](quickstart.md) -* [Contribution Process](how-to-contribute/contributing.md) -* [Contributor Covenant](how-to-contribute/code_of_conduct.md) +## User Guide ---- +* [Searching for Datasets](guide/searching.md) +* [Loading Data](guide/loading-data.md) +* [Using with ML Frameworks](guide/ml-frameworks.md) +* [Dataset Schemas](guide/schemas.md) -* [Sphinx Autogenerated documentation - markdown](sphinx-autogenerated-documentation.md) -* [foundry package — Foundry\_test 1.1 documentation - HTML AUTOGENERATION](foundry-package-foundry_test-1.1-documentation-html-autogeneration.md) +## Features +* [Command Line Interface](features/cli.md) +* [MCP Server (AI Agents)](features/mcp-server.md) +* [HuggingFace Integration](features/huggingface.md) +* [Error Handling](features/errors.md) + +## Concepts + +* [Overview](concepts/overview.md) +* [Foundry Datasets](concepts/foundry-datasets.md) +* [Data Packages](concepts/foundry-data-packages.md) + +## Publishing + +* [Publishing Datasets](publishing/publishing-datasets.md) +* [Metadata Reference](publishing/metadata-reference.md) + +## Reference + +* [API Reference](api/foundry.md) +* [CLI Reference](api/cli-reference.md) +* [Configuration](api/configuration.md) + +## Community + +* [Contributing](how-to-contribute/contributing.md) +* [Code of Conduct](how-to-contribute/code_of_conduct.md) + +## Support + +* [Troubleshooting](support/troubleshooting.md) +* [FAQ](support/faq.md) diff --git a/docs/concepts/overview.md b/docs/concepts/overview.md index 19be629b..23a80ffc 100644 --- a/docs/concepts/overview.md +++ b/docs/concepts/overview.md @@ -1,11 +1,111 @@ # Overview -TODO: +Foundry-ML is a Python library that simplifies access to machine learning-ready datasets in materials science and chemistry. It provides a unified interface to discover, load, and use curated scientific datasets. -* Change the code snippet in the image -* Write the text :\) +## What is Foundry? -![](../.gitbook/assets/foundry-overview.png) +Foundry serves as a bridge between data producers (researchers who create datasets) and data consumers (researchers who use datasets for ML). It standardizes how datasets are: +- **Discovered** - Search by keyword, browse catalogs, or get by DOI +- **Described** - Rich metadata including field descriptions, units, and citations +- **Delivered** - Automatic download, caching, and format conversion +## Key Features +### For Data Users + +```python +from foundry import Foundry + +# Connect and search +f = Foundry() +results = f.search("band gap", limit=5) + +# Load a dataset +dataset = results.iloc[0].FoundryDataset +X, y = dataset.get_as_dict()['train'] + +# Understand the data +schema = dataset.get_schema() +print(schema['fields']) # What columns exist and what they mean +``` + +### For AI Agents + +Foundry includes an MCP (Model Context Protocol) server that enables AI assistants like Claude to discover and use datasets programmatically: + +```bash +foundry mcp install # Add to Claude Code +``` + +### For Data Publishers + +Share your datasets with the community using standardized metadata: + +```python +f.publish(metadata, data_path="./my_data", source_id="my_dataset_v1") +``` + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Your Code │ +├─────────────────────────────────────────────────────────────┤ +│ Foundry Python API │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ Search │ │ Load │ │ Schema │ │ Publish │ │ +│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ +├─────────────────────────────────────────────────────────────┤ +│ Data Transport │ +│ HTTPS (default) │ Globus (large files) │ +├─────────────────────────────────────────────────────────────┤ +│ Materials Data Facility │ +│ (Storage, Metadata, DOI Registration) │ +└─────────────────────────────────────────────────────────────┘ +``` + +## Core Concepts + +### Datasets + +A Foundry dataset contains: +- **Data files** - The actual data (JSON, CSV, HDF5, etc.) +- **Schema** - Description of fields, types, and splits +- **Metadata** - DataCite-compliant citation information + +### Splits + +Datasets are organized into splits (e.g., `train`, `test`, `validation`) with input/target pairs: + +```python +data = dataset.get_as_dict() +X_train, y_train = data['train'] +X_test, y_test = data['test'] +``` + +### Keys (Fields) + +Each field in a dataset has: +- **Name** - The column/field identifier +- **Type** - `input` or `target` +- **Description** - What the field represents +- **Units** - Physical units (if applicable) + +## Ecosystem Integration + +Foundry integrates with the broader ML ecosystem: + +| Integration | Purpose | +|-------------|---------| +| **PyTorch** | `dataset.get_as_torch()` | +| **TensorFlow** | `dataset.get_as_tensorflow()` | +| **HuggingFace Hub** | Export datasets for broader visibility | +| **MCP Server** | AI agent access | +| **CLI** | Terminal-based workflows | + +## Next Steps + +- [Installation](../installation.md) - Get Foundry installed +- [Quick Start](../quickstart.md) - Load your first dataset in 5 minutes +- [Searching for Datasets](../guide/searching.md) - Find the right data diff --git a/docs/features/cli.md b/docs/features/cli.md new file mode 100644 index 00000000..4ab11896 --- /dev/null +++ b/docs/features/cli.md @@ -0,0 +1,175 @@ +# Command Line Interface + +Foundry includes a CLI for terminal-based workflows. + +## Basic Usage + +```bash +foundry --help +``` + +## Commands + +### Search Datasets + +```bash +# Search by keyword +foundry search "band gap" + +# Limit results +foundry search "band gap" --limit 10 + +# JSON output (for scripting) +foundry search "band gap" --json +``` + +### Get Dataset Info + +```bash +# Get info by DOI +foundry get 10.18126/abc123 + +# JSON output +foundry get 10.18126/abc123 --json +``` + +### View Schema + +See what fields a dataset contains: + +```bash +foundry schema 10.18126/abc123 +``` + +Output: +``` +Dataset: foundry_oqmd_band_gaps_v1.1 +Data Type: tabular + +Fields: + - composition (input): Chemical composition + - band_gap (target): Band gap value (eV) + +Splits: + - train + - test +``` + +### List All Datasets + +```bash +# List available datasets +foundry catalog + +# Limit results +foundry catalog --limit 20 + +# JSON output +foundry catalog --json +``` + +### Check Publication Status + +```bash +foundry status my_dataset_v1 +``` + +### Version + +```bash +foundry version +``` + +## HuggingFace Export + +Export a Foundry dataset to HuggingFace Hub: + +```bash +foundry push-to-hf 10.18126/abc123 --repo your-username/dataset-name +``` + +Options: +- `--repo` - HuggingFace repository ID (required) +- `--token` - HuggingFace API token (or set HF_TOKEN env var) +- `--private` - Create a private repository + +## MCP Server + +Start the MCP server for AI agent integration: + +```bash +# Start server +foundry mcp start + +# Install to Claude Code +foundry mcp install +``` + +See [MCP Server](mcp-server.md) for details. + +## JSON Output + +Most commands support `--json` for machine-readable output: + +```bash +# Pipe to jq for processing +foundry catalog --json | jq '.[].name' + +# Save to file +foundry search "crystal" --json > results.json +``` + +## Exit Codes + +| Code | Meaning | +|------|---------| +| 0 | Success | +| 1 | Error (see message) | + +## Environment Variables + +| Variable | Purpose | +|----------|---------| +| `HF_TOKEN` | HuggingFace API token | +| `GLOBUS_TOKEN` | Globus authentication | + +## Examples + +### Find and Download a Dataset + +```bash +# Search +foundry search "formation energy" --limit 5 + +# Get the DOI from results, then get details +foundry schema 10.18126/xyz789 + +# Use in Python +python -c " +from foundry import Foundry +f = Foundry() +ds = f.get_dataset('10.18126/xyz789') +print(ds.get_as_dict().keys()) +" +``` + +### Export to HuggingFace + +```bash +# Set token +export HF_TOKEN=hf_xxxxx + +# Export +foundry push-to-hf 10.18126/abc123 --repo materials-science/my-dataset +``` + +### Scripting with JSON + +```bash +#!/bin/bash +# Find all datasets with "band" in the name +foundry search "band" --json | jq -r '.[].doi' | while read doi; do + echo "Processing: $doi" + foundry schema "$doi" +done +``` diff --git a/docs/features/errors.md b/docs/features/errors.md new file mode 100644 index 00000000..a5a0392f --- /dev/null +++ b/docs/features/errors.md @@ -0,0 +1,254 @@ +# Error Handling + +Foundry uses structured error classes that provide clear context for both humans and AI agents. + +## Error Structure + +All Foundry errors include: + +```python +class FoundryError(Exception): + code: str # Machine-readable error code + message: str # Human-readable message + details: dict # Additional context + recovery_hint: str # How to fix the issue +``` + +## Error Types + +### DatasetNotFoundError + +Raised when a search or get operation returns no results. + +```python +from foundry.errors import DatasetNotFoundError + +try: + dataset = f.get_dataset("nonexistent-doi") +except DatasetNotFoundError as e: + print(e.code) # "DATASET_NOT_FOUND" + print(e.message) # "No dataset found matching..." + print(e.recovery_hint) # "Try a broader search term..." +``` + +### AuthenticationError + +Raised when authentication fails. + +```python +from foundry.errors import AuthenticationError + +try: + f = Foundry(use_globus=True) +except AuthenticationError as e: + print(e.code) # "AUTH_FAILED" + print(e.details) # {"service": "Globus"} + print(e.recovery_hint) # "Run Foundry(no_browser=False)..." +``` + +### DownloadError + +Raised when a file download fails. + +```python +from foundry.errors import DownloadError + +try: + data = dataset.get_as_dict() +except DownloadError as e: + print(e.code) # "DOWNLOAD_FAILED" + print(e.details) # {"url": "...", "reason": "..."} + print(e.recovery_hint) # "Check network connection..." +``` + +### DataLoadError + +Raised when data files cannot be parsed. + +```python +from foundry.errors import DataLoadError + +try: + data = dataset.get_as_dict() +except DataLoadError as e: + print(e.code) # "DATA_LOAD_FAILED" + print(e.details) # {"file_path": "...", "data_type": "..."} +``` + +### ValidationError + +Raised when metadata validation fails. + +```python +from foundry.errors import ValidationError + +try: + f.publish(invalid_metadata, ...) +except ValidationError as e: + print(e.code) # "VALIDATION_FAILED" + print(e.details) # {"field_name": "creators", "schema_type": "datacite"} +``` + +### PublishError + +Raised when dataset publication fails. + +```python +from foundry.errors import PublishError + +try: + f.publish(metadata, data_path="./data", source_id="my_dataset") +except PublishError as e: + print(e.code) # "PUBLISH_FAILED" + print(e.details) # {"source_id": "...", "status": "..."} +``` + +### CacheError + +Raised when local cache operations fail. + +```python +from foundry.errors import CacheError + +try: + data = dataset.get_as_dict() +except CacheError as e: + print(e.code) # "CACHE_ERROR" + print(e.details) # {"operation": "write", "cache_path": "..."} +``` + +### ConfigurationError + +Raised when configuration is invalid. + +```python +from foundry.errors import ConfigurationError + +try: + f = Foundry(use_globus="maybe") # Invalid value +except ConfigurationError as e: + print(e.code) # "CONFIG_ERROR" + print(e.details) # {"setting": "use_globus", "current_value": "maybe"} +``` + +## Error Codes Reference + +| Code | Error Class | Common Causes | +|------|-------------|---------------| +| `DATASET_NOT_FOUND` | DatasetNotFoundError | Invalid DOI, no search results | +| `AUTH_FAILED` | AuthenticationError | Expired token, no credentials | +| `DOWNLOAD_FAILED` | DownloadError | Network issues, URL not found | +| `DATA_LOAD_FAILED` | DataLoadError | Corrupted file, wrong format | +| `VALIDATION_FAILED` | ValidationError | Missing required fields | +| `PUBLISH_FAILED` | PublishError | Server error, permission denied | +| `CACHE_ERROR` | CacheError | Disk full, permission denied | +| `CONFIG_ERROR` | ConfigurationError | Invalid parameter values | + +## Handling Errors + +### Basic Pattern + +```python +from foundry import Foundry +from foundry.errors import DatasetNotFoundError, DownloadError + +f = Foundry() + +try: + dataset = f.get_dataset("10.18126/abc123") + data = dataset.get_as_dict() +except DatasetNotFoundError as e: + print(f"Dataset not found: {e.message}") + print(f"Try: {e.recovery_hint}") +except DownloadError as e: + print(f"Download failed: {e.message}") + print(f"URL: {e.details.get('url')}") +``` + +### Catch All Foundry Errors + +```python +from foundry.errors import FoundryError + +try: + # Your code + pass +except FoundryError as e: + print(f"[{e.code}] {e.message}") + if e.recovery_hint: + print(f"Suggestion: {e.recovery_hint}") +``` + +### Serialization for APIs + +Errors can be serialized for JSON responses: + +```python +from foundry.errors import DatasetNotFoundError +import json + +error = DatasetNotFoundError("missing-dataset") +error_dict = error.to_dict() + +print(json.dumps(error_dict, indent=2)) +# { +# "code": "DATASET_NOT_FOUND", +# "message": "No dataset found matching query: 'missing-dataset'", +# "details": {"query": "missing-dataset", "search_type": "query"}, +# "recovery_hint": "Try a broader search term..." +# } +``` + +## For AI Agents + +Structured errors are designed for programmatic handling: + +```python +def handle_foundry_operation(operation): + try: + return operation() + except FoundryError as e: + return { + "success": False, + "error_code": e.code, + "message": e.message, + "recovery_action": e.recovery_hint + } +``` + +The `recovery_hint` field is particularly useful for agents to suggest next steps to users. + +## Custom Error Handling + +### Retry Logic + +```python +import time +from foundry.errors import DownloadError + +def download_with_retry(dataset, max_retries=3): + for attempt in range(max_retries): + try: + return dataset.get_as_dict() + except DownloadError as e: + if attempt < max_retries - 1: + print(f"Retry {attempt + 1}/{max_retries}...") + time.sleep(2 ** attempt) # Exponential backoff + else: + raise +``` + +### Fallback Strategies + +```python +from foundry.errors import DownloadError + +try: + # Try HTTPS first (default) + f = Foundry() + data = dataset.get_as_dict() +except DownloadError: + # Fall back to Globus + f = Foundry(use_globus=True) + data = dataset.get_as_dict() +``` diff --git a/docs/features/huggingface.md b/docs/features/huggingface.md new file mode 100644 index 00000000..5f6fa6bb --- /dev/null +++ b/docs/features/huggingface.md @@ -0,0 +1,235 @@ +# HuggingFace Integration + +Export Foundry datasets to HuggingFace Hub to increase visibility and enable discovery by the broader ML community. + +## Installation + +```bash +pip install foundry-ml[huggingface] +``` + +## Quick Start + +### Python API + +```python +from foundry import Foundry +from foundry.integrations.huggingface import push_to_hub + +# Get a dataset +f = Foundry() +dataset = f.search("band gap", limit=1).iloc[0].FoundryDataset + +# Export to HuggingFace Hub +url = push_to_hub( + dataset, + repo_id="your-username/dataset-name", + token="hf_xxxxx" # Or set HF_TOKEN env var +) +print(f"Published at: {url}") +``` + +### CLI + +```bash +# Set your HuggingFace token +export HF_TOKEN=hf_xxxxx + +# Export a dataset +foundry push-to-hf 10.18126/abc123 --repo your-username/dataset-name +``` + +## What Gets Created + +When you export a dataset, Foundry creates: + +### 1. Data Files + +The dataset is converted to HuggingFace's format (Parquet/Arrow) with all splits preserved: + +``` +dataset/ + train/ + data-00000.parquet + test/ + data-00000.parquet +``` + +### 2. Dataset Card (README.md) + +A comprehensive README is auto-generated from the Foundry metadata: + +```markdown +--- +license: cc-by-4.0 +tags: + - materials-science + - foundry-ml +--- + +# Band Gap Dataset + +Calculated band gaps for 50,000 materials... + +## Fields +| Field | Role | Description | Units | +|-------|------|-------------|-------| +| composition | input | Chemical formula | - | +| band_gap | target | DFT band gap | eV | + +## Citation +@article{...} + +## Source +Original DOI: 10.18126/abc123 +``` + +### 3. Metadata + +HuggingFace-compatible metadata including: +- License information +- Task categories +- Tags for discoverability +- Size information + +## API Reference + +### push_to_hub + +```python +def push_to_hub( + dataset, # FoundryDataset object + repo_id: str, # HF Hub repo (e.g., "org/name") + token: str = None, # HF API token + private: bool = False, + split: str = None # Specific split to export +) -> str: # Returns URL +``` + +**Parameters:** + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| `dataset` | FoundryDataset | Yes | Dataset from Foundry | +| `repo_id` | str | Yes | HuggingFace repository ID | +| `token` | str | No | API token (uses cached if not provided) | +| `private` | bool | No | Create private repository | +| `split` | str | No | Export specific split only | + +**Returns:** URL of the created dataset + +## Author Attribution + +**Important:** The authors listed on HuggingFace come from the original DataCite metadata, not the person pushing. This preserves proper scientific attribution. + +```python +# The original creators from DataCite metadata +authors = dataset.dc.creators +# e.g., [{"creatorName": "Smith, John"}, {"creatorName": "Doe, Jane"}] +``` + +## Examples + +### Export All Splits + +```python +from foundry import Foundry +from foundry.integrations.huggingface import push_to_hub + +f = Foundry() +ds = f.get_dataset("10.18126/abc123") + +url = push_to_hub(ds, "materials-science/band-gaps") +``` + +### Export Single Split + +```python +url = push_to_hub( + ds, + "materials-science/band-gaps-train", + split="train" +) +``` + +### Private Repository + +```python +url = push_to_hub( + ds, + "my-org/private-dataset", + private=True +) +``` + +### Using Environment Variable + +```bash +export HF_TOKEN=hf_xxxxx +``` + +```python +# Token is picked up automatically +url = push_to_hub(ds, "org/name") +``` + +## CLI Options + +```bash +foundry push-to-hf --repo [options] + +Options: + --repo TEXT HuggingFace repository ID (required) + --token TEXT HuggingFace API token + --private Create private repository + --help Show this message +``` + +## Best Practices + +### Repository Naming + +Use descriptive, lowercase names with hyphens: +- Good: `materials-science/oqmd-band-gaps` +- Bad: `my_dataset_v1` + +### Organization + +Consider creating an organization for your lab/group: +- `your-lab/dataset-1` +- `your-lab/dataset-2` + +### Documentation + +The auto-generated README is a starting point. Consider adding: +- More detailed description +- Example usage code +- Related papers +- Acknowledgments + +## Troubleshooting + +### Authentication Failed + +```python +# Ensure you're logged in +from huggingface_hub import login +login() # Interactive login + +# Or set token explicitly +push_to_hub(ds, "org/name", token="hf_xxxxx") +``` + +### Repository Already Exists + +HuggingFace won't overwrite existing repos by default. Either: +1. Use a different repo name +2. Delete the existing repo first +3. Use the HuggingFace web interface to update + +### Large Datasets + +For very large datasets (>10GB), the upload may take time. Consider: +- Exporting specific splits: `split="train"` +- Using a stable internet connection +- Running in a cloud environment diff --git a/docs/features/mcp-server.md b/docs/features/mcp-server.md new file mode 100644 index 00000000..d4e8e807 --- /dev/null +++ b/docs/features/mcp-server.md @@ -0,0 +1,218 @@ +# MCP Server (AI Agent Integration) + +Foundry includes an MCP (Model Context Protocol) server that enables AI assistants like Claude to discover and use materials science datasets. + +## What is MCP? + +MCP is a protocol that allows AI assistants to use external tools. With Foundry's MCP server, you can ask Claude: + +- "Find me a materials science dataset for band gap prediction" +- "What fields are in the OQMD dataset?" +- "Load the training data and show me the first few rows" + +## Quick Start + +### Install for Claude Code + +```bash +foundry mcp install +``` + +This adds Foundry to your Claude Code configuration. Restart Claude Code to activate. + +### Manual Start + +For custom integrations: + +```bash +foundry mcp start +``` + +## Available Tools + +The MCP server provides these tools to AI agents: + +### search_datasets + +Search for materials science datasets. + +**Parameters:** +- `query` (string, required) - Search terms +- `limit` (integer, optional) - Maximum results (default: 10) + +**Returns:** List of datasets with name, title, DOI, description + +**Example prompt:** "Search for datasets about crystal structures" + +### get_dataset_schema + +Get the schema of a dataset - what fields it contains. + +**Parameters:** +- `doi` (string, required) - The dataset DOI + +**Returns:** Schema with splits, fields, data types, and descriptions + +**Example prompt:** "What fields are in dataset 10.18126/abc123?" + +### load_dataset + +Load a dataset and return its data with schema. + +**Parameters:** +- `doi` (string, required) - The dataset DOI +- `split` (string, optional) - Which split to load (default: all) + +**Returns:** Data with schema information and citation + +**Example prompt:** "Load the training data from the band gap dataset" + +### list_all_datasets + +List all available Foundry datasets. + +**Parameters:** +- `limit` (integer, optional) - Maximum results (default: 100) + +**Returns:** Complete catalog of available datasets + +**Example prompt:** "What datasets are available in Foundry?" + +## Configuration + +### Claude Code + +The `foundry mcp install` command adds this to your Claude Code config: + +```json +{ + "mcpServers": { + "foundry-ml": { + "command": "foundry", + "args": ["mcp", "start"] + } + } +} +``` + +### Custom Integration + +For other MCP-compatible clients, the server uses stdio transport: + +```python +from foundry.mcp.server import create_server + +config = create_server() +# config contains server name, version, and tool definitions +``` + +## Example Conversations + +### Finding a Dataset + +**You:** Find me a dataset for predicting band gaps of materials + +**Claude:** I'll search for band gap datasets in Foundry. + +*Uses search_datasets tool* + +I found 5 datasets related to band gaps: +1. **OQMD Band Gaps** (10.18126/abc) - 50,000 materials +2. **AFLOW Band Gaps** (10.18126/def) - 30,000 materials +... + +### Understanding a Dataset + +**You:** What's in the OQMD band gaps dataset? + +**Claude:** Let me get the schema for that dataset. + +*Uses get_dataset_schema tool* + +The OQMD Band Gaps dataset contains: +- **Inputs:** composition (chemical formula), structure (crystal structure) +- **Targets:** band_gap (eV) +- **Splits:** train (80%), test (20%) + +### Loading Data + +**You:** Load the training data and show me some examples + +**Claude:** I'll load the training split. + +*Uses load_dataset tool* + +Here are the first 5 rows: +| composition | band_gap | +|-------------|----------| +| Si | 1.12 | +| GaAs | 1.42 | +... + +## Troubleshooting + +### Server Not Starting + +Ensure Foundry is installed correctly: + +```bash +pip install --upgrade foundry-ml +foundry version +``` + +### Tools Not Available + +Restart Claude Code after installing: + +```bash +foundry mcp install +# Restart Claude Code +``` + +### Authentication Issues + +For datasets requiring authentication, ensure you've authenticated: + +```python +from foundry import Foundry +f = Foundry() # This triggers auth flow if needed +``` + +## Technical Details + +### Protocol + +The MCP server uses JSON-RPC 2.0 over stdio. + +### Server Info + +```python +from foundry.mcp.server import create_server + +config = create_server() +print(config) +# { +# "name": "foundry-ml", +# "version": "1.1.0", +# "tools": [...] +# } +``` + +### Tool Definitions + +Each tool follows the MCP tool schema: + +```python +{ + "name": "search_datasets", + "description": "Search for materials science datasets...", + "inputSchema": { + "type": "object", + "properties": { + "query": {"type": "string", "description": "..."}, + "limit": {"type": "integer", "default": 10} + }, + "required": ["query"] + } +} +``` diff --git a/docs/guide/loading-data.md b/docs/guide/loading-data.md new file mode 100644 index 00000000..351e86bd --- /dev/null +++ b/docs/guide/loading-data.md @@ -0,0 +1,165 @@ +# Loading Data + +Once you've found a dataset, here's how to load and use it. + +## Basic Loading + +```python +from foundry import Foundry + +f = Foundry() +results = f.search("band gap", limit=1) +dataset = results.iloc[0].FoundryDataset + +# Load all data +data = dataset.get_as_dict() +``` + +## Understanding the Data Structure + +Most datasets have this structure: + +```python +data = { + 'train': (X_train, y_train), # Inputs and targets + 'test': (X_test, y_test), +} +``` + +Access training data: + +```python +X_train, y_train = data['train'] +``` + +## Loading Specific Splits + +```python +# Load only training data +train_data = dataset.get_as_dict(split='train') + +# Load only test data +test_data = dataset.get_as_dict(split='test') +``` + +## Loading with Schema + +Get data and metadata together: + +```python +result = dataset.get_as_dict(include_schema=True) + +data = result['data'] +schema = result['schema'] + +print(f"Dataset: {schema['name']}") +print(f"Fields: {schema['fields']}") +``` + +## Data Types + +### Tabular Data + +Most common format - dictionaries of arrays: + +```python +X, y = data['train'] + +# X might be: +# {'composition': [...], 'structure': [...]} + +# y might be: +# {'band_gap': [...]} +``` + +### Working with DataFrames + +```python +import pandas as pd + +X, y = data['train'] +df = pd.DataFrame(X) +df['target'] = list(y.values())[0] +``` + +## HDF5 Data + +For large datasets, use lazy loading: + +```python +data = dataset.get_as_dict(as_hdf5=True) +# Returns h5py objects that load on access +``` + +## Caching + +Data is cached locally after first download: + +```python +# First call downloads +data = dataset.get_as_dict() # Slow + +# Subsequent calls use cache +data = dataset.get_as_dict() # Fast +``` + +### Custom Cache Location + +```python +f = Foundry(local_cache_dir="/path/to/cache") +``` + +### Clear Cache + +```python +f.delete_dataset_cache("dataset_name") +``` + +## Common Patterns + +### Train/Test Split + +```python +data = dataset.get_as_dict() + +X_train, y_train = data['train'] +X_test, y_test = data['test'] + +# Train model +from sklearn.ensemble import RandomForestRegressor +model = RandomForestRegressor() +model.fit(pd.DataFrame(X_train), list(y_train.values())[0]) +``` + +### Single Target Column + +```python +X, y = data['train'] +target_name = list(y.keys())[0] # Get first target +target_values = y[target_name] +``` + +### Multiple Inputs + +```python +X, y = data['train'] + +# Combine inputs into DataFrame +import pandas as pd +df = pd.DataFrame(X) +print(df.columns) # See all input features +``` + +## Error Handling + +```python +from foundry.errors import DownloadError, DataLoadError + +try: + data = dataset.get_as_dict() +except DownloadError as e: + print(f"Download failed: {e.message}") + print(f"Try: {e.recovery_hint}") +except DataLoadError as e: + print(f"Could not load data: {e.message}") +``` diff --git a/docs/guide/ml-frameworks.md b/docs/guide/ml-frameworks.md new file mode 100644 index 00000000..9fbaa898 --- /dev/null +++ b/docs/guide/ml-frameworks.md @@ -0,0 +1,190 @@ +# Using with ML Frameworks + +Foundry integrates with popular ML frameworks. + +## PyTorch + +### Load as PyTorch Dataset + +```python +# Get PyTorch-compatible dataset +torch_dataset = dataset.get_as_torch(split='train') + +# Use with DataLoader +from torch.utils.data import DataLoader + +loader = DataLoader(torch_dataset, batch_size=32, shuffle=True) + +for batch in loader: + inputs, targets = batch + # Train your model +``` + +### Full Training Example + +```python +import torch +import torch.nn as nn +from torch.utils.data import DataLoader +from foundry import Foundry + +# Load data +f = Foundry() +ds = f.search("band gap", limit=1).iloc[0].FoundryDataset + +train_dataset = ds.get_as_torch(split='train') +test_dataset = ds.get_as_torch(split='test') + +train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) +test_loader = DataLoader(test_dataset, batch_size=32) + +# Define model +model = nn.Sequential( + nn.Linear(input_size, 64), + nn.ReLU(), + nn.Linear(64, 1) +) + +# Train +optimizer = torch.optim.Adam(model.parameters()) +criterion = nn.MSELoss() + +for epoch in range(10): + for inputs, targets in train_loader: + optimizer.zero_grad() + outputs = model(inputs) + loss = criterion(outputs, targets) + loss.backward() + optimizer.step() +``` + +## TensorFlow + +### Load as tf.data.Dataset + +```python +# Get TensorFlow-compatible dataset +tf_dataset = dataset.get_as_tensorflow(split='train') + +# Batch and prefetch +tf_dataset = tf_dataset.batch(32).prefetch(1) + +# Use in training +model.fit(tf_dataset, epochs=10) +``` + +### Full Training Example + +```python +import tensorflow as tf +from foundry import Foundry + +# Load data +f = Foundry() +ds = f.search("band gap", limit=1).iloc[0].FoundryDataset + +train_ds = ds.get_as_tensorflow(split='train') +test_ds = ds.get_as_tensorflow(split='test') + +train_ds = train_ds.batch(32).prefetch(tf.data.AUTOTUNE) +test_ds = test_ds.batch(32) + +# Define model +model = tf.keras.Sequential([ + tf.keras.layers.Dense(64, activation='relu'), + tf.keras.layers.Dense(1) +]) + +model.compile( + optimizer='adam', + loss='mse', + metrics=['mae'] +) + +# Train +model.fit(train_ds, validation_data=test_ds, epochs=10) +``` + +## Scikit-learn + +Use the dictionary format: + +```python +from sklearn.ensemble import RandomForestRegressor +from sklearn.model_selection import cross_val_score +import pandas as pd +from foundry import Foundry + +# Load data +f = Foundry() +ds = f.search("band gap", limit=1).iloc[0].FoundryDataset +data = ds.get_as_dict() + +X_train, y_train = data['train'] +X_test, y_test = data['test'] + +# Convert to arrays +X_train_df = pd.DataFrame(X_train) +y_train_arr = list(y_train.values())[0] + +# Train +model = RandomForestRegressor(n_estimators=100) +model.fit(X_train_df, y_train_arr) + +# Evaluate +X_test_df = pd.DataFrame(X_test) +y_test_arr = list(y_test.values())[0] +score = model.score(X_test_df, y_test_arr) +print(f"R² score: {score:.3f}") +``` + +## Generic Python + +For any framework, use the dictionary format: + +```python +data = dataset.get_as_dict() +X, y = data['train'] + +# X is a dict: {'feature1': [...], 'feature2': [...]} +# y is a dict: {'target': [...]} + +# Convert as needed for your framework +import numpy as np +X_array = np.column_stack([X[k] for k in X.keys()]) +y_array = np.array(list(y.values())[0]) +``` + +## Tips + +### Check Data Shape + +```python +data = dataset.get_as_dict() +X, y = data['train'] + +print(f"Features: {list(X.keys())}") +print(f"Targets: {list(y.keys())}") +print(f"Samples: {len(list(X.values())[0])}") +``` + +### Handle Missing Values + +```python +import pandas as pd + +X, y = data['train'] +df = pd.DataFrame(X) +print(df.isnull().sum()) # Check for missing values +df = df.fillna(0) # Or handle as appropriate +``` + +### Feature Engineering + +```python +# Get schema to understand features +schema = dataset.get_schema() + +for field in schema['fields']: + print(f"{field['name']}: {field['description']} ({field['units']})") +``` diff --git a/docs/guide/schemas.md b/docs/guide/schemas.md new file mode 100644 index 00000000..12a4319e --- /dev/null +++ b/docs/guide/schemas.md @@ -0,0 +1,179 @@ +# Dataset Schemas + +Schemas describe what data a dataset contains, helping you understand before you load. + +## Getting the Schema + +```python +from foundry import Foundry + +f = Foundry() +dataset = f.search("band gap", limit=1).iloc[0].FoundryDataset + +schema = dataset.get_schema() +``` + +## Schema Structure + +```python +{ + 'name': 'foundry_oqmd_band_gaps_v1.1', + 'title': 'OQMD Band Gaps Dataset', + 'doi': '10.18126/abc123', + 'description': 'Band gaps calculated using DFT...', + 'data_type': 'tabular', + 'fields': [...], + 'splits': [...] +} +``` + +## Fields + +Fields describe each column/feature in the dataset: + +```python +for field in schema['fields']: + print(f"Name: {field['name']}") + print(f"Role: {field['role']}") # 'input' or 'target' + print(f"Description: {field['description']}") + print(f"Units: {field['units']}") + print("---") +``` + +Example output: +``` +Name: composition +Role: input +Description: Chemical composition formula +Units: None +--- +Name: band_gap +Role: target +Description: DFT-calculated band gap +Units: eV +--- +``` + +## Splits + +Splits show how data is divided: + +```python +for split in schema['splits']: + print(f"{split['name']}: {split.get('type', 'data')}") +``` + +Example output: +``` +train: train +test: test +``` + +## Data Types + +The `data_type` field indicates the format: + +| Type | Description | +|------|-------------| +| `tabular` | Rows and columns (most common) | +| `hierarchical` | Nested/tree structure | +| `image` | Image data | + +## Using Schema Information + +### Filter by Field Role + +```python +input_fields = [f for f in schema['fields'] if f['role'] == 'input'] +target_fields = [f for f in schema['fields'] if f['role'] == 'target'] + +print(f"Inputs: {[f['name'] for f in input_fields]}") +print(f"Targets: {[f['name'] for f in target_fields]}") +``` + +### Check Units + +```python +for field in schema['fields']: + if field['units']: + print(f"{field['name']}: {field['units']}") +``` + +### Include Schema with Data + +```python +result = dataset.get_as_dict(include_schema=True) + +data = result['data'] +schema = result['schema'] + +# Now you have both together +X, y = data['train'] +print(f"Loading {schema['name']}...") +``` + +## CLI Schema + +```bash +foundry schema 10.18126/abc123 +``` + +Output: +``` +Dataset: foundry_oqmd_band_gaps_v1.1 +Title: OQMD Band Gaps Dataset +DOI: 10.18126/abc123 +Data Type: tabular + +Fields: + [input ] composition: Chemical composition formula + [target] band_gap: DFT-calculated band gap (eV) + +Splits: + - train + - test +``` + +## Best Practices + +### Always Check Schema First + +```python +# Before loading (no download) +schema = dataset.get_schema() +print(f"Fields: {len(schema['fields'])}") +print(f"Splits: {[s['name'] for s in schema['splits']]}") + +# If it looks right, load +data = dataset.get_as_dict() +``` + +### Validate Data Against Schema + +```python +schema = dataset.get_schema() +data = dataset.get_as_dict() + +X, y = data['train'] + +input_names = [f['name'] for f in schema['fields'] if f['role'] == 'input'] +for name in input_names: + if name not in X: + print(f"Warning: {name} not in data") +``` + +### Document Your Usage + +```python +schema = dataset.get_schema() +print(f""" +Using dataset: {schema['name']} +DOI: {schema['doi']} + +Features used: +{chr(10).join(f"- {f['name']}: {f['description']}" for f in schema['fields'] if f['role'] == 'input')} + +Target: +{chr(10).join(f"- {f['name']}: {f['description']}" for f in schema['fields'] if f['role'] == 'target')} +""") +``` diff --git a/docs/guide/searching.md b/docs/guide/searching.md new file mode 100644 index 00000000..a40016a1 --- /dev/null +++ b/docs/guide/searching.md @@ -0,0 +1,129 @@ +# Searching for Datasets + +Foundry provides multiple ways to discover datasets. + +## Keyword Search + +Search by topic, material, or property: + +```python +from foundry import Foundry + +f = Foundry() + +# Search by keyword +results = f.search("band gap") +results = f.search("crystal structure") +results = f.search("formation energy") +``` + +### Limit Results + +```python +results = f.search("band gap", limit=5) +``` + +### JSON Output + +For programmatic access: + +```python +results = f.search("band gap", as_json=True) + +for ds in results: + print(f"{ds['name']}: {ds['title']}") +``` + +## Browse the Catalog + +List all available datasets: + +```python +# List datasets +catalog = f.list(limit=20) + +# As JSON +catalog = f.list(as_json=True) +``` + +## Get by DOI + +If you know the DOI: + +```python +dataset = f.get_dataset("10.18126/abc123") +``` + +## Search Results + +Search returns a DataFrame with columns: + +| Column | Description | +|--------|-------------| +| `dataset_name` | Unique identifier | +| `title` | Human-readable title | +| `DOI` | Digital Object Identifier | +| `year` | Publication year | +| `FoundryDataset` | Dataset object for loading | + +## Accessing Datasets + +From search results: + +```python +# By index +dataset = results.iloc[0].FoundryDataset + +# By name +dataset = results.get_dataset_by_name("foundry_oqmd_band_gaps_v1.1") + +# By DOI +dataset = results.get_dataset_by_doi("10.18126/abc123") +``` + +## CLI Search + +```bash +# Search from terminal +foundry search "band gap" + +# Limit results +foundry search "band gap" --limit 5 + +# JSON output +foundry search "band gap" --json +``` + +## Tips + +### Broad vs. Specific + +```python +# Broad (more results) +f.search("energy") + +# Specific (fewer, more relevant) +f.search("formation energy DFT") +``` + +### Check What's Available + +```python +# See all datasets first +all_ds = f.list(limit=100) +print(f"Total datasets: {len(all_ds)}") + +# Then search +results = f.search("your topic") +``` + +### Inspect Before Loading + +```python +# Check schema before downloading +dataset = results.iloc[0].FoundryDataset +schema = dataset.get_schema() + +print(f"Fields: {[f['name'] for f in schema['fields']]}") +print(f"Splits: {[s['name'] for s in schema['splits']]}") +``` diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 00000000..559edac9 --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,117 @@ +# Installation + +## Requirements + +- Python 3.8 or higher +- pip package manager + +## Basic Installation + +Install Foundry-ML from PyPI: + +```bash +pip install foundry-ml +``` + +This installs the core package with HTTPS download support. No additional setup required. + +## Optional Dependencies + +### HuggingFace Integration + +To export datasets to HuggingFace Hub: + +```bash +pip install foundry-ml[huggingface] +``` + +### All Optional Dependencies + +```bash +pip install foundry-ml[all] +``` + +## Verify Installation + +```python +from foundry import Foundry + +f = Foundry() +print("Foundry installed successfully!") + +# Test search +results = f.search("band gap", limit=1) +print(f"Found {len(results)} datasets") +``` + +## Cloud Environments + +### Google Colab + +Foundry works in Colab without additional setup: + +```python +!pip install foundry-ml + +from foundry import Foundry +f = Foundry(no_browser=True, no_local_server=True) +``` + +### Jupyter Notebooks + +For Jupyter running on remote servers: + +```python +from foundry import Foundry +f = Foundry(no_browser=True, no_local_server=True) +``` + +## Globus Setup (Optional) + +For large dataset transfers, you can use Globus instead of HTTPS: + +1. Install [Globus Connect Personal](https://www.globus.org/globus-connect-personal) +2. Start the Globus endpoint +3. Enable Globus in Foundry: + +```python +f = Foundry(use_globus=True) +``` + +**Note:** HTTPS is the default and works for most use cases. Only use Globus if you're transferring very large datasets (>10GB) or have institutional Globus endpoints. + +## Troubleshooting + +### Import Errors + +If you get import errors, ensure you have the latest version: + +```bash +pip install --upgrade foundry-ml +``` + +### Network Issues + +Foundry requires internet access to search and download datasets. If behind a proxy: + +```python +import os +os.environ['HTTP_PROXY'] = 'http://proxy:port' +os.environ['HTTPS_PROXY'] = 'http://proxy:port' + +from foundry import Foundry +f = Foundry() +``` + +### Cache Location + +By default, datasets are cached in your home directory. To change: + +```python +f = Foundry(local_cache_dir="/path/to/cache") +``` + +## Next Steps + +- [Quick Start](quickstart.md) - Load your first dataset +- [CLI](features/cli.md) - Use Foundry from the command line diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 00000000..433c56f4 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,111 @@ +# Quick Start + +Load your first materials science dataset in under 5 minutes. + +## 1. Install + +```bash +pip install foundry-ml +``` + +## 2. Connect + +```python +from foundry import Foundry + +f = Foundry() +``` + +For cloud environments (Colab, remote Jupyter): + +```python +f = Foundry(no_browser=True, no_local_server=True) +``` + +## 3. Search + +Find datasets by keyword: + +```python +results = f.search("band gap", limit=5) +results +``` + +Output: +``` + dataset_name ... +0 foundry_oqmd_band_gaps_v1.1 ... +1 foundry_aflow_band_gaps_v1.1 ... +2 foundry_experimental_band_gaps_v1.1 ... +... +``` + +## 4. Load + +Get a dataset and load its data: + +```python +# Get the first result +dataset = results.iloc[0].FoundryDataset + +# Load training data +data = dataset.get_as_dict() +X_train, y_train = data['train'] + +print(f"Samples: {len(X_train)}") +``` + +## 5. Understand + +Check what fields the dataset contains: + +```python +schema = dataset.get_schema() + +print(f"Dataset: {schema['name']}") +print(f"Fields:") +for field in schema['fields']: + print(f" - {field['name']} ({field['role']})") +``` + +## 6. Cite + +Get the citation for publications: + +```python +print(dataset.get_citation()) +``` + +## Complete Example + +```python +from foundry import Foundry + +# Connect +f = Foundry() + +# Search +results = f.search("band gap", limit=5) + +# Load +dataset = results.iloc[0].FoundryDataset +X, y = dataset.get_as_dict()['train'] + +# Use with sklearn +from sklearn.ensemble import RandomForestRegressor +model = RandomForestRegressor() +# model.fit(X, y) # Train your model + +# Cite +print(dataset.get_citation()) +``` + +## What's Next? + +| Task | Guide | +|------|-------| +| Search effectively | [Searching for Datasets](guide/searching.md) | +| Load different formats | [Loading Data](guide/loading-data.md) | +| Use with PyTorch/TensorFlow | [ML Frameworks](guide/ml-frameworks.md) | +| Use from terminal | [CLI](features/cli.md) | +| Publish your data | [Publishing Datasets](publishing/publishing-datasets.md) | diff --git a/docs/support/faq.md b/docs/support/faq.md new file mode 100644 index 00000000..4ef5c431 --- /dev/null +++ b/docs/support/faq.md @@ -0,0 +1,211 @@ +# Frequently Asked Questions + +## General + +### What is Foundry? + +Foundry-ML is a Python library for discovering and loading machine learning-ready datasets in materials science and chemistry. It provides standardized access to curated scientific datasets with rich metadata. + +### Is Foundry free? + +Yes. Foundry is open source and the datasets are freely available. Some datasets may have specific licenses - check the citation information for details. + +### Do I need to create an account? + +No account is required for basic usage with HTTPS download. Some features (like Globus transfers) may require authentication. + +## Installation + +### What Python version do I need? + +Python 3.8 or higher. + +### How do I install Foundry? + +```bash +pip install foundry-ml +``` + +### I get import errors after installing + +Try upgrading: + +```bash +pip install --upgrade foundry-ml +``` + +## Data Loading + +### Why is my first download slow? + +Data is downloaded on first access and cached locally. Subsequent loads are fast. + +### Where is data cached? + +By default, in your home directory. To change: + +```python +f = Foundry(local_cache_dir="/path/to/cache") +``` + +### How do I clear the cache? + +```python +f.delete_dataset_cache("dataset_name") +``` + +### Can I use Foundry offline? + +You need internet to search and download datasets. Once cached, data loads locally. + +## Cloud Environments + +### How do I use Foundry in Google Colab? + +```python +!pip install foundry-ml + +from foundry import Foundry +f = Foundry(no_browser=True, no_local_server=True) +``` + +### Does it work with Jupyter on a remote server? + +Yes, use the same settings: + +```python +f = Foundry(no_browser=True, no_local_server=True) +``` + +## Data Format + +### What format is the data in? + +Most datasets use a dictionary format: + +```python +data = { + 'train': (X_dict, y_dict), + 'test': (X_dict, y_dict) +} +``` + +### How do I get a pandas DataFrame? + +```python +import pandas as pd + +X, y = data['train'] +df = pd.DataFrame(X) +``` + +### Does it work with PyTorch? + +Yes: + +```python +torch_dataset = dataset.get_as_torch(split='train') +``` + +### Does it work with TensorFlow? + +Yes: + +```python +tf_dataset = dataset.get_as_tensorflow(split='train') +``` + +## Publishing + +### How do I publish my own dataset? + +See [Publishing Datasets](../publishing/publishing-datasets.md) for the full workflow. + +### What metadata format is required? + +Foundry uses DataCite-compliant metadata. See [Metadata Reference](../publishing/metadata-reference.md). + +### Can I update a published dataset? + +Create a new version with an updated source_id (e.g., `my_dataset_v2`). + +## Globus + +### Do I need Globus? + +No. HTTPS download is the default and works for most use cases. + +### When should I use Globus? + +For very large datasets (>10GB) or if you have institutional Globus endpoints. + +### How do I enable Globus? + +```python +f = Foundry(use_globus=True) +``` + +You'll need [Globus Connect Personal](https://www.globus.org/globus-connect-personal) running. + +## AI Integration + +### How do I use Foundry with Claude? + +Install the MCP server: + +```bash +foundry mcp install +``` + +Restart Claude Code. You can now ask Claude to find and load datasets. + +### What AI assistants are supported? + +Any MCP-compatible assistant. Currently tested with Claude Code. + +## HuggingFace + +### Can I export to HuggingFace Hub? + +Yes: + +```bash +pip install foundry-ml[huggingface] +foundry push-to-hf 10.18126/abc123 --repo your-username/dataset-name +``` + +### Who is listed as author on HuggingFace? + +The original dataset creators from the DataCite metadata, not the person pushing. + +## Troubleshooting + +### I get "Dataset not found" + +Check: +1. The DOI is correct +2. Try a broader search term +3. List available datasets: `f.list()` + +### Download keeps failing + +Try: +1. Check your internet connection +2. Try again (transient errors) +3. If using Globus, switch to HTTPS: `f = Foundry(use_globus=False)` + +### The data format is unexpected + +Check the schema first: + +```python +schema = dataset.get_schema() +print(schema['fields']) +print(schema['splits']) +``` + +## More Help + +- [Documentation](https://ai-materials-and-chemistry.gitbook.io/foundry/) +- [GitHub Issues](https://github.com/MLMI2-CSSI/foundry/issues) +- [Troubleshooting](troubleshooting.md) diff --git a/examples/00_hello_foundry/README.md b/examples/00_hello_foundry/README.md new file mode 100644 index 00000000..192a339d --- /dev/null +++ b/examples/00_hello_foundry/README.md @@ -0,0 +1,51 @@ +# Hello Foundry! + +This is the **beginner-friendly** example for Foundry-ML. + +No domain expertise required - just Python basics. + +## What You'll Learn + +1. How to connect to Foundry +2. How to search for datasets +3. How to load data into Python +4. How to explore dataset schemas +5. How to get proper citations + +## Quick Start + +```python +from foundry import Foundry + +# Connect +f = Foundry() + +# Search +results = f.search("band gap", limit=5) + +# Load +dataset = results.iloc[0].FoundryDataset +data = dataset.get_as_dict() + +# Use +X, y = data['train'] +``` + +## Running in Google Colab + +For cloud environments, use: + +```python +f = Foundry(no_browser=True, no_local_server=True) +``` + +## Next Steps + +After this example, check out: +- `/examples/bandgap/` - Working with band gap datasets +- `/examples/publishing/` - How to publish your own datasets + +## Need Help? + +- Documentation: https://github.com/MLMI2-CSSI/foundry +- CLI help: `foundry --help` diff --git a/examples/00_hello_foundry/hello_foundry.ipynb b/examples/00_hello_foundry/hello_foundry.ipynb new file mode 100644 index 00000000..36b0b65e --- /dev/null +++ b/examples/00_hello_foundry/hello_foundry.ipynb @@ -0,0 +1,371 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Hello Foundry! 🚀\n", + "\n", + "Welcome to Foundry-ML! This notebook will show you how to:\n", + "\n", + "1. **Search** for materials science datasets\n", + "2. **Load** a dataset into Python\n", + "3. **Explore** the data\n", + "\n", + "No domain expertise required - just Python basics!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Install Foundry\n", + "\n", + "If you haven't already, install Foundry-ML:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Uncomment the line below to install\n", + "# !pip install foundry-ml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Import and Connect\n", + "\n", + "First, import Foundry and create a client. If you're running this in Google Colab or a cloud environment, use `no_browser=True`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "from foundry import Foundry\n\n# Create a Foundry client (uses HTTPS download by default)\n# For cloud environments (Colab, etc.), add: no_browser=True, no_local_server=True\nf = Foundry()" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Search for Datasets\n", + "\n", + "Let's search for datasets. You can search by keyword - no need to know the exact name!" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
dataset_nametitleyearDOI
0foundry_g4mp2_solvation_v1.2DFT Estimates of Solvation Energy in Multiple ...root=202210.18126/jos5-wj65
1foundry_assorted_computational_band_gaps_v1.1Graph Network Based Deep Learning of Band Gaps...root=202110.18126/7io9-1z9k
2foundry_experimental_band_gaps_v1.1Graph Network Based Deep Learning of Band Gaps...root=202110.18126/wg3u-g8vu
3foundry_aflow_band_gaps_v1.1Graph Network Based Deep Learning of Band Gaps...root=202110.18126/6fdy-bsam
4foundry_oqmd_band_gaps_v1.1Graph Network Based Deep Learning of Band Gaps...root=202110.18126/w1ey-9y8b
\n", + "
" + ], + "text/plain": [ + " dataset_name \\\n", + "0 foundry_g4mp2_solvation_v1.2 \n", + "1 foundry_assorted_computational_band_gaps_v1.1 \n", + "2 foundry_experimental_band_gaps_v1.1 \n", + "3 foundry_aflow_band_gaps_v1.1 \n", + "4 foundry_oqmd_band_gaps_v1.1 \n", + "\n", + " title year \\\n", + "0 DFT Estimates of Solvation Energy in Multiple ... root=2022 \n", + "1 Graph Network Based Deep Learning of Band Gaps... root=2021 \n", + "2 Graph Network Based Deep Learning of Band Gaps... root=2021 \n", + "3 Graph Network Based Deep Learning of Band Gaps... root=2021 \n", + "4 Graph Network Based Deep Learning of Band Gaps... root=2021 \n", + "\n", + " DOI FoundryDataset \n", + "0 10.18126/jos5-wj65 DFT Estimates of Solvation Energy in Multiple SolventsWard, Logan; Dandu, Naveen; Blaiszik, Ben; Narayanan, Badri; Assary, Rajeev S.; Redfern, Paul C.; Foster, Ian; Curtiss, Larry A.

DOI: 10.18126/jos5-wj65

Dataset

short_nameg4mp2_solvation
data_typetabular
task_type
  • supervised
domain
  • materials science
  • chemistry
n_items130258.0
splits
  • typetrain
    pathg4mp2_data.json
    labeltrain
keys
keytypefilterdescriptionunitsclasses
  • smiles_0
inputInput SMILES string
  • smiles_1
inputSMILES string after relaxation
  • inchi_0
inputInChi after generating coordinates with CORINA
  • inchi_1
inputInChi after relaxation
  • xyz
inputInChi after relaxationXYZ coordinates after relaxation
  • atomic_charges
inputAtomic charges on each atom, as predicted from B3LYP
  • A
inputRotational constant, AGHz
  • B
inputRotational constant, BGHz
  • C
inputRotational constant, CGHz
  • inchi_1
inputInChi after relaxation
  • n_electrons
inputNumber of electrons
  • n_heavy_atoms
inputNumber of non-hydrogen atoms
  • n_atom
inputNumber of atoms in molecule
  • mu
inputDipole momentD
  • alpha
inputIsotropic polarizabilitya_0^3
  • R2
inputElectronic spatial extanta_0^2
  • cv
inputHeat capacity at 298.15Kcal/mol-K
  • g4mp2_hf298
targetG4MP2 Standard Enthalpy of Formation, 298Kkcal/mol
  • bandgap
inputB3LYP Band gap energyHa
  • homo
inputB3LYP Energy of HOMOHa
  • lumo
inputB3LYP Energy of LUMOHa
  • zpe
inputB3LYP Zero point vibrational energyHa
  • u0
inputB3LYP Internal energy at 0KHa
  • u
inputB3LYP Internal energy at 298.15KHa
  • h
inputB3LYP Enthalpy at 298.15KHa
  • u0_atom
inputB3LYP atomization energy at 0KHa
  • g
inputB3LYP Free energy at 298.15KHa
  • g4mp2_0k
targetG4MP2 Internal energy at 0KHa
  • g4mp2_energy
targetG4MP2 Internal energy at 298.15KHa
  • g4mp2_enthalpy
targetG4MP2 Enthalpy at 298.15KHa
  • g4mp2_free
targetG4MP2 Free eergy at 0KHa
  • g4mp2_atom
targetG4MP2 atomization energy at 0KHa
  • sol_acetone
targetSolvation energy, acetonekcal/mol
  • sol_acn
targetSolvation energy, acetonitrilekcal/mol
  • sol_dmso
targetSolvation energy, dimethyl sulfoxidekcal/mol
  • sol_ethanol
targetSolvation energy, ethanolkcal/mol
  • sol_water
targetSolvation energy, waterkcal/mol
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Get the first dataset from our search results\n", + "dataset = results.iloc[0].FoundryDataset\n", + "\n", + "# Display dataset info\n", + "dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Understand the Schema\n", + "\n", + "Before loading data, let's see what fields it contains:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# Get the schema - what columns/fields are in this dataset?\nschema = dataset.get_schema()\n\nprint(f\"Dataset: {schema['name']}\")\nprint(f\"Data Type: {schema['data_type']}\")\nprint(f\"\\nSplits: {[s['name'] for s in schema['splits']]}\")\nprint(f\"\\nFields:\")\nfor field in schema['fields']:\n print(f\" - {field['name']} ({field['role']}): {field['description'] or 'No description'}\")" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Load the Data\n", + "\n", + "Now let's load the actual data. Foundry handles downloading and caching automatically!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Processing records: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 3266.59it/s]\n", + "Transferring data: 0%| | 0/1 [00:00 understand -> load -> train.\"\"\"\n \n f = Foundry()\n \n # 1. Discover\n print(\"1. Searching for datasets...\")\n results = f.search(\"band gap\", limit=5, as_json=True)\n \n if not results:\n raise DatasetNotFoundError(\"band gap\")\n \n print(f\" Found {len(results)} datasets\")\n \n # 2. Understand\n print(\"\\n2. Getting dataset schema...\")\n dataset = f.list(limit=1).iloc[0].FoundryDataset\n schema = dataset.get_schema()\n \n print(f\" Dataset: {schema['name']}\")\n print(f\" Fields: {[f['name'] for f in schema['fields']]}\")\n print(f\" Splits: {[s['name'] for s in schema['splits']]}\")\n \n # 3. Load (with schema for context)\n print(\"\\n3. Loading data...\")\n result = dataset.get_as_dict(include_schema=True)\n \n data = result['data']\n print(f\" Loaded splits: {list(data.keys())}\")\n \n # 4. Train\n print(\"\\n4. Ready to train!\")\n if 'train' in data:\n X_train, y_train = data['train']\n print(f\" Training samples available\")\n \n # 5. Cite\n print(\"\\n5. Citation:\")\n print(dataset.get_citation())\n \n return dataset\n\n# Run it\ntry:\n ds = train_band_gap_model()\nexcept Exception as e:\n print(f\"Workflow failed: {e}\")" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## Summary\n\n### Publishing\n```python\nf.publish(metadata, data_path=\"./data\", source_id=\"my_dataset_v1\")\nf.check_status(\"my_dataset_v1\")\n```\n\n### HuggingFace Export\n```python\nfrom foundry.integrations.huggingface import push_to_hub\npush_to_hub(dataset, \"org/name\", token=\"hf_xxx\")\n```\n\n### CLI\n```bash\nfoundry search \"query\"\nfoundry schema \nfoundry mcp install\n```\n\n### Error Handling\n```python\nfrom foundry.errors import DatasetNotFoundError\ntry:\n f.get_dataset(doi)\nexcept DatasetNotFoundError as e:\n print(e.recovery_hint)\n```\n\n### Configuration\n```python\n# Default: HTTPS download (no Globus needed)\nf = Foundry()\n\n# For cloud environments (Colab, etc.)\nf = Foundry(no_browser=True, no_local_server=True)\n\n# For Globus transfers (large datasets, institutional endpoints)\nf = Foundry(use_globus=True)\n```\n\n---\n\n**You've completed the Foundry tutorial!**\n\n- Documentation: https://github.com/MLMI2-CSSI/foundry\n- Issues: https://github.com/MLMI2-CSSI/foundry/issues" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.9.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/examples/README.md b/examples/README.md index cd6dfc27..fbed38ea 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,8 +1,53 @@ # Examples using Foundry + If you're wondering how to get started with Foundry or want to see it in action, you're in the right place! -Each notebook walks through instantiating Foundry, loading data from Foundry, and working with the data in different ways. Some notebooks also use machine learning models with the data. +## Tutorials (Start Here) + +| # | Name | Time | Description | +|---|------|------|-------------| +| 00 | [Hello Foundry](./00_hello_foundry/) | 5 min | Absolute basics - your first dataset | +| 01 | [Quickstart](./01_quickstart/) | 5 min | Search, load, use in ML workflow | +| 02 | [Working with Data](./02_working_with_data/) | 15 min | Schemas, splits, PyTorch/TensorFlow | +| 03 | [Advanced Workflows](./03_advanced_workflows/) | 20 min | Publishing, HuggingFace, CLI, MCP | + +## Domain Examples + +Each folder contains a notebook and `requirements.txt` file. The notebooks can be run locally or in [Google Colab](https://colab.research.google.com/). + +| Example | Domain | Description | +|---------|--------|-------------| +| [bandgap](./bandgap/) | Materials | Band gap prediction | +| [oqmd](./oqmd/) | Materials | Open Quantum Materials Database | +| [zeolite](./zeolite/) | Chemistry | Zeolite structure analysis | +| [dendrite-segmentation](./dendrite-segmentation/) | Imaging | Microscopy segmentation | +| [atom-position-finding](./atom-position-finding/) | Imaging | Atom localization | + +## Quick Reference + +```python +from foundry import Foundry + +f = Foundry() # HTTPS download by default +results = f.search("band gap", limit=5) +dataset = results.iloc[0].FoundryDataset +X, y = dataset.get_as_dict()['train'] +``` + +**Cloud environments (Colab, etc.):** +```python +f = Foundry(no_browser=True, no_local_server=True) +``` + +**For large datasets with Globus:** +```python +f = Foundry(use_globus=True) # Requires Globus Connect Personal +``` -Each folder contains a notebook and `requirements.txt` file. The notebooks can be run locally (using the `requirements.txt`) or in [Google Colab](https://colab.research.google.com/). +**CLI:** +```bash +foundry search "band gap" +foundry schema +``` -If you have any trouble with the notebooks, please check our [documentation](https://ai-materials-and-chemistry.gitbook.io/foundry/v/docs/) or create an issue on the repo. +If you have any trouble, check our [documentation](https://ai-materials-and-chemistry.gitbook.io/foundry/v/docs/) or create an issue. diff --git a/foundry/__init__.py b/foundry/__init__.py index 78883ed9..5721a2c7 100644 --- a/foundry/__init__.py +++ b/foundry/__init__.py @@ -3,3 +3,14 @@ from . import https_download # noqa F401 (import unused) from . import https_upload # noqa F401 (import unused) from .foundry_dataset import FoundryDataset # noqa F401 (import unused) +from .errors import ( # noqa F401 (import unused) + FoundryError, + DatasetNotFoundError, + AuthenticationError, + DownloadError, + DataLoadError, + ValidationError, + PublishError, + CacheError, + ConfigurationError, +) diff --git a/foundry/__main__.py b/foundry/__main__.py new file mode 100644 index 00000000..ca3b1c73 --- /dev/null +++ b/foundry/__main__.py @@ -0,0 +1,419 @@ +"""Foundry CLI - Beautiful command-line interface for materials science datasets. + +Usage: + foundry search "bandgap" # Search datasets + foundry get # Download a dataset + foundry list # List available datasets + foundry schema # Show dataset schema + foundry status # Check publication status + foundry mcp start # Start MCP server for agent integration +""" + +import json +import sys +from typing import Optional + +import typer +from rich.console import Console +from rich.table import Table +from rich.panel import Panel +from rich.progress import Progress, SpinnerColumn, TextColumn +from rich import print as rprint + +app = typer.Typer( + name="foundry", + help="Foundry-ML: Materials science datasets for machine learning.", + add_completion=False, + no_args_is_help=True, +) +mcp_app = typer.Typer(help="MCP server commands for agent integration.") +app.add_typer(mcp_app, name="mcp") + +console = Console() + + +def get_foundry(no_browser: bool = True): + """Get a Foundry client instance.""" + from foundry import Foundry + return Foundry(no_browser=no_browser, no_local_server=True) + + +@app.command() +def search( + query: str = typer.Argument(..., help="Search query (e.g., 'bandgap', 'crystal structure')"), + limit: int = typer.Option(10, "--limit", "-l", help="Maximum number of results"), + output_json: bool = typer.Option(False, "--json", "-j", help="Output as JSON"), +): + """Search for datasets matching a query.""" + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + transient=True, + ) as progress: + progress.add_task("Searching datasets...", total=None) + f = get_foundry() + results = f.search(query, limit=limit) + + if len(results) == 0: + console.print(f"[yellow]No datasets found matching '{query}'[/yellow]") + raise typer.Exit(1) + + if output_json: + # Output as JSON for programmatic use + output = [] + for _, row in results.iterrows(): + ds = row.FoundryDataset + output.append({ + "name": ds.dataset_name, + "title": ds.dc.titles[0].title if ds.dc.titles else None, + "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None, + }) + console.print(json.dumps(output, indent=2)) + else: + # Pretty table output + table = Table(title=f"Search Results for '{query}'") + table.add_column("Name", style="cyan", no_wrap=True) + table.add_column("Title", style="green") + table.add_column("DOI", style="dim") + + for _, row in results.iterrows(): + ds = row.FoundryDataset + title = ds.dc.titles[0].title if ds.dc.titles else "N/A" + doi = str(ds.dc.identifier.identifier) if ds.dc.identifier else "N/A" + table.add_row(ds.dataset_name, title[:50] + "..." if len(title) > 50 else title, doi) + + console.print(table) + console.print(f"\n[dim]Found {len(results)} dataset(s). Use 'foundry get ' to download.[/dim]") + + +@app.command() +def list_datasets( + limit: int = typer.Option(20, "--limit", "-l", help="Maximum number of results"), + output_json: bool = typer.Option(False, "--json", "-j", help="Output as JSON"), +): + """List all available datasets.""" + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + transient=True, + ) as progress: + progress.add_task("Fetching dataset list...", total=None) + f = get_foundry() + results = f.list(limit=limit) + + if output_json: + output = [] + for _, row in results.iterrows(): + ds = row.FoundryDataset + output.append({ + "name": ds.dataset_name, + "title": ds.dc.titles[0].title if ds.dc.titles else None, + "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None, + }) + console.print(json.dumps(output, indent=2)) + else: + table = Table(title="Available Datasets") + table.add_column("Name", style="cyan", no_wrap=True) + table.add_column("Title", style="green") + table.add_column("DOI", style="dim") + + for _, row in results.iterrows(): + ds = row.FoundryDataset + title = ds.dc.titles[0].title if ds.dc.titles else "N/A" + doi = str(ds.dc.identifier.identifier) if ds.dc.identifier else "N/A" + table.add_row(ds.dataset_name, title[:50] + "..." if len(title) > 50 else title, doi) + + console.print(table) + console.print(f"\n[dim]Showing {len(results)} dataset(s).[/dim]") + + +# Alias for list +app.command(name="list")(list_datasets) + + +@app.command() +def get( + doi: str = typer.Argument(..., help="DOI of the dataset to download"), + split: Optional[str] = typer.Option(None, "--split", "-s", help="Specific split to download"), + output_dir: Optional[str] = typer.Option(None, "--output", "-o", help="Output directory"), +): + """Download a dataset by DOI.""" + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + ) as progress: + task = progress.add_task("Connecting to Foundry...", total=None) + f = get_foundry() + + progress.update(task, description=f"Fetching dataset {doi}...") + try: + dataset = f.get_dataset(doi) + except Exception as e: + console.print(f"[red]Error: Could not find dataset '{doi}'[/red]") + console.print(f"[dim]{e}[/dim]") + raise typer.Exit(1) + + progress.update(task, description="Downloading data...") + try: + data = dataset.get_as_dict(split=split) + except Exception as e: + console.print(f"[red]Error downloading data: {e}[/red]") + raise typer.Exit(1) + + # Show summary + console.print(Panel( + f"[green]Successfully downloaded![/green]\n\n" + f"[bold]Dataset:[/bold] {dataset.dataset_name}\n" + f"[bold]Title:[/bold] {dataset.dc.titles[0].title if dataset.dc.titles else 'N/A'}\n" + f"[bold]Splits:[/bold] {', '.join(data.keys()) if isinstance(data, dict) else 'N/A'}", + title="Download Complete", + )) + + +@app.command() +def schema( + doi: str = typer.Argument(..., help="DOI of the dataset"), + output_json: bool = typer.Option(False, "--json", "-j", help="Output as JSON"), +): + """Show the schema of a dataset - what fields it contains.""" + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + transient=True, + ) as progress: + progress.add_task("Fetching schema...", total=None) + f = get_foundry() + try: + dataset = f.get_dataset(doi) + except Exception as e: + console.print(f"[red]Error: Could not find dataset '{doi}'[/red]") + raise typer.Exit(1) + + schema_info = { + "name": dataset.dataset_name, + "title": dataset.dc.titles[0].title if dataset.dc.titles else None, + "doi": str(dataset.dc.identifier.identifier) if dataset.dc.identifier else None, + "data_type": dataset.foundry_schema.data_type, + "splits": [ + {"name": s.label, "type": s.type} + for s in (dataset.foundry_schema.splits or []) + ], + "fields": [ + { + "name": k.key[0] if k.key else None, + "role": k.type, + "description": k.description, + "units": k.units, + } + for k in (dataset.foundry_schema.keys or []) + ], + } + + if output_json: + console.print(json.dumps(schema_info, indent=2)) + else: + # Pretty output + console.print(Panel( + f"[bold]{schema_info['title']}[/bold]\n" + f"[dim]DOI: {schema_info['doi']}[/dim]\n" + f"[dim]Type: {schema_info['data_type']}[/dim]", + title=f"Dataset: {schema_info['name']}", + )) + + if schema_info['splits']: + console.print("\n[bold]Splits:[/bold]") + for split in schema_info['splits']: + console.print(f" - {split['name']} ({split['type']})") + + if schema_info['fields']: + console.print("\n[bold]Fields:[/bold]") + table = Table(show_header=True) + table.add_column("Name", style="cyan") + table.add_column("Role", style="green") + table.add_column("Description") + table.add_column("Units", style="dim") + + for field in schema_info['fields']: + table.add_row( + field['name'] or "N/A", + field['role'] or "N/A", + (field['description'] or "")[:40], + field['units'] or "", + ) + console.print(table) + + +@app.command() +def status( + source_id: str = typer.Argument(..., help="Source ID to check status for"), + output_json: bool = typer.Option(False, "--json", "-j", help="Output as JSON"), +): + """Check the publication status of a dataset.""" + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + transient=True, + ) as progress: + progress.add_task("Checking status...", total=None) + f = get_foundry() + try: + result = f.check_status(source_id) + except Exception as e: + console.print(f"[red]Error checking status: {e}[/red]") + raise typer.Exit(1) + + if output_json: + console.print(json.dumps(result, indent=2, default=str)) + else: + console.print(Panel(str(result), title=f"Status: {source_id}")) + + +@app.command() +def catalog( + output_json: bool = typer.Option(True, "--json", "-j", help="Output as JSON (default)"), +): + """Dump all available datasets as JSON catalog.""" + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + transient=True, + ) as progress: + progress.add_task("Building catalog...", total=None) + f = get_foundry() + results = f.list(limit=1000) + + output = [] + for _, row in results.iterrows(): + ds = row.FoundryDataset + output.append({ + "name": ds.dataset_name, + "title": ds.dc.titles[0].title if ds.dc.titles else None, + "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None, + "description": ds.dc.descriptions[0].description if ds.dc.descriptions else None, + }) + + console.print(json.dumps(output, indent=2)) + + +@app.command(name="push-to-hf") +def push_to_hf( + doi: str = typer.Argument(..., help="DOI of the dataset to export"), + repo: str = typer.Option(..., "--repo", "-r", help="HuggingFace repo ID (e.g., 'org/dataset-name')"), + token: Optional[str] = typer.Option(None, "--token", "-t", help="HuggingFace API token"), + private: bool = typer.Option(False, "--private", "-p", help="Create a private repository"), +): + """Export a Foundry dataset to Hugging Face Hub. + + This makes materials science datasets discoverable in the broader ML ecosystem. + + Example: + foundry push-to-hf 10.18126/abc123 --repo my-org/bandgap-data + """ + try: + from foundry.integrations.huggingface import push_to_hub + except ImportError: + console.print( + "[red]HuggingFace integration not installed.[/red]\n" + "Install with: [cyan]pip install foundry-ml[huggingface][/cyan]" + ) + raise typer.Exit(1) + + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + console=console, + ) as progress: + task = progress.add_task("Connecting to Foundry...", total=None) + f = get_foundry() + + progress.update(task, description=f"Loading dataset {doi}...") + try: + dataset = f.get_dataset(doi) + except Exception as e: + console.print(f"[red]Error: Could not find dataset '{doi}'[/red]") + console.print(f"[dim]{e}[/dim]") + raise typer.Exit(1) + + progress.update(task, description=f"Exporting to HuggingFace Hub...") + try: + url = push_to_hub(dataset, repo, token=token, private=private) + except Exception as e: + console.print(f"[red]Error exporting to HuggingFace: {e}[/red]") + raise typer.Exit(1) + + console.print(Panel( + f"[green]Successfully exported to HuggingFace Hub![/green]\n\n" + f"[bold]Dataset:[/bold] {dataset.dataset_name}\n" + f"[bold]Repository:[/bold] {repo}\n" + f"[bold]URL:[/bold] [link={url}]{url}[/link]", + title="Export Complete", + )) + + +# MCP subcommands +@mcp_app.command() +def start( + host: str = typer.Option("localhost", "--host", "-h", help="Host to bind to"), + port: int = typer.Option(8765, "--port", "-p", help="Port to bind to"), +): + """Start the MCP server for agent integration.""" + console.print(Panel( + "[bold green]Starting Foundry MCP Server[/bold green]\n\n" + f"Host: {host}\n" + f"Port: {port}\n\n" + "This server allows AI agents to discover and load materials science datasets.\n" + "Connect using the Model Context Protocol.", + title="MCP Server", + )) + + try: + from foundry.mcp.server import run_server + run_server(host=host, port=port) + except ImportError: + console.print("[yellow]MCP server not yet implemented. Coming soon![/yellow]") + raise typer.Exit(1) + + +@mcp_app.command() +def install(): + """Install Foundry as an MCP server in Claude Code.""" + console.print(Panel( + "[bold]To install Foundry in Claude Code:[/bold]\n\n" + "Add this to your Claude Code MCP configuration:\n\n" + "[cyan]{\n" + ' "mcpServers": {\n' + ' "foundry": {\n' + ' "command": "python",\n' + ' "args": ["-m", "foundry", "mcp", "start"]\n' + " }\n" + " }\n" + "}[/cyan]\n\n" + "Then restart Claude Code.", + title="MCP Installation", + )) + + +@app.command() +def version(): + """Show Foundry version.""" + try: + from importlib.metadata import version as get_version + v = get_version("foundry_ml") + except Exception: + v = "unknown" + console.print(f"foundry-ml version {v}") + + +def main(): + """Entry point for the CLI.""" + app() + + +if __name__ == "__main__": + main() diff --git a/foundry/errors.py b/foundry/errors.py new file mode 100644 index 00000000..7b0b8bc0 --- /dev/null +++ b/foundry/errors.py @@ -0,0 +1,171 @@ +"""Structured error classes for Foundry. + +These errors provide: +1. Error codes for programmatic handling +2. Human-readable messages +3. Recovery hints for agents and users +4. Structured details for debugging + +This enables both humans and AI agents to understand and recover from errors. +""" + +from dataclasses import dataclass, field +from typing import Any, Dict, Optional + + +@dataclass +class FoundryError(Exception): + """Base error class with structured information for agents and users. + + Attributes: + code: Machine-readable error code (e.g., "DATASET_NOT_FOUND") + message: Human-readable error description + details: Additional context for debugging + recovery_hint: Actionable suggestion for resolving the error + """ + + code: str + message: str + details: Optional[Dict[str, Any]] = field(default=None) + recovery_hint: Optional[str] = field(default=None) + + def __post_init__(self): + super().__init__(self.message) + + def __str__(self) -> str: + parts = [f"[{self.code}] {self.message}"] + if self.recovery_hint: + parts.append(f"Hint: {self.recovery_hint}") + return " ".join(parts) + + def to_dict(self) -> Dict[str, Any]: + """Serialize error to dictionary for JSON responses.""" + return { + "code": self.code, + "message": self.message, + "details": self.details, + "recovery_hint": self.recovery_hint, + } + + +class DatasetNotFoundError(FoundryError): + """Raised when a dataset cannot be found.""" + + def __init__(self, query: str, search_type: str = "query"): + super().__init__( + code="DATASET_NOT_FOUND", + message=f"No dataset found matching {search_type}: '{query}'", + details={"query": query, "search_type": search_type}, + recovery_hint=( + "Try a broader search term, check the DOI format, " + "or use foundry.list() to see available datasets." + ), + ) + + +class AuthenticationError(FoundryError): + """Raised when authentication fails.""" + + def __init__(self, service: str, reason: str = None): + msg = f"Authentication failed for {service}" + if reason: + msg += f": {reason}" + super().__init__( + code="AUTH_FAILED", + message=msg, + details={"service": service, "reason": reason}, + recovery_hint=( + "Run Foundry(no_browser=False) to re-authenticate, " + "or check your Globus credentials." + ), + ) + + +class DownloadError(FoundryError): + """Raised when a file download fails.""" + + def __init__(self, url: str, reason: str, destination: str = None): + super().__init__( + code="DOWNLOAD_FAILED", + message=f"Failed to download from {url}: {reason}", + details={"url": url, "reason": reason, "destination": destination}, + recovery_hint=( + "Check your network connection. " + "Try use_globus=False for HTTPS fallback, or use_globus=True for Globus transfer." + ), + ) + + +class DataLoadError(FoundryError): + """Raised when loading data from a file fails.""" + + def __init__(self, file_path: str, reason: str, data_type: str = None): + super().__init__( + code="DATA_LOAD_FAILED", + message=f"Failed to load data from {file_path}: {reason}", + details={"file_path": file_path, "reason": reason, "data_type": data_type}, + recovery_hint=( + "Check that the file exists and is not corrupted. " + "Try clearing the cache with dataset.clear_dataset_cache()." + ), + ) + + +class ValidationError(FoundryError): + """Raised when metadata validation fails.""" + + def __init__(self, field_name: str, error_msg: str, schema_type: str = "metadata"): + super().__init__( + code="VALIDATION_FAILED", + message=f"Validation failed for {schema_type} field '{field_name}': {error_msg}", + details={"field_name": field_name, "error_msg": error_msg, "schema_type": schema_type}, + recovery_hint=( + "Check the field value against the expected schema. " + "See documentation for required metadata format." + ), + ) + + +class PublishError(FoundryError): + """Raised when publishing a dataset fails.""" + + def __init__(self, reason: str, source_id: str = None, status: str = None): + super().__init__( + code="PUBLISH_FAILED", + message=f"Failed to publish dataset: {reason}", + details={"source_id": source_id, "status": status, "reason": reason}, + recovery_hint=( + "Check your metadata is complete and valid. " + "Use foundry.check_status(source_id) to monitor publication progress." + ), + ) + + +class CacheError(FoundryError): + """Raised when cache operations fail.""" + + def __init__(self, operation: str, reason: str, cache_path: str = None): + super().__init__( + code="CACHE_ERROR", + message=f"Cache {operation} failed: {reason}", + details={"operation": operation, "reason": reason, "cache_path": cache_path}, + recovery_hint=( + "Try clearing the cache directory manually, " + "or check disk space and permissions." + ), + ) + + +class ConfigurationError(FoundryError): + """Raised when Foundry is misconfigured.""" + + def __init__(self, setting: str, reason: str, current_value: Any = None): + super().__init__( + code="CONFIG_ERROR", + message=f"Configuration error for '{setting}': {reason}", + details={"setting": setting, "reason": reason, "current_value": current_value}, + recovery_hint=( + "Check your Foundry initialization parameters. " + "See documentation for valid configuration options." + ), + ) diff --git a/foundry/foundry.py b/foundry/foundry.py index a3d0ca5b..5785dff1 100644 --- a/foundry/foundry.py +++ b/foundry/foundry.py @@ -5,7 +5,7 @@ from pydantic import Field, ConfigDict from mdf_connect_client import MDFConnectClient -from mdf_forge import Forge +from .mdf_client import MDFClient from globus_sdk import AuthClient from .auth import PubAuths @@ -91,7 +91,7 @@ class Foundry(FoundryBase): index: str = Field(default="") auths: Any = Field(default=None) - use_globus: bool = Field(default=True) + use_globus: bool = Field(default=False) verbose: bool = Field(default=False) interval: int = Field(default=10) parallel_https: int = Field(default=4) @@ -108,7 +108,7 @@ def __init__(self, no_local_server: bool = False, index: str = "mdf", authorizers: dict = None, - use_globus: bool = True, + use_globus: bool = False, verbose: bool = False, interval: int = 10, parallel_https: int = 4, @@ -157,7 +157,7 @@ def __init__(self, # add special SearchAuthorizer object self.auths['search_authorizer'] = search_auth['search'] - self.forge_client = Forge( + self.forge_client = MDFClient( index=index, services=None, search_client=self.auths["search"], @@ -194,7 +194,7 @@ def __init__(self, verbose, local_cache_dir) - def search(self, query: str = None, limit: int = None, as_list: bool = False) -> List[FoundryDataset]: + def search(self, query: str = None, limit: int = None, as_list: bool = False, as_json: bool = False) -> List[FoundryDataset]: """Search available Foundry datasets This method searches for available Foundry datasets based on the provided query string. @@ -206,9 +206,10 @@ def search(self, query: str = None, limit: int = None, as_list: bool = False) -> query (str): The query string to match. If a DOI is provided, it retrieves the metadata for that specific dataset. limit (int): The maximum number of results to return. as_list (bool): If True, the search results will be returned as a list instead of a DataFrame. + as_json (bool): If True, return results as a list of dictionaries (agent-friendly). Returns: - List[FoundryDataset] or DataFrame: A list of search results as FoundryDataset objects or a DataFrame if as_list is False. + List[FoundryDataset] or DataFrame or List[dict]: Search results in the requested format. Raises: Exception: If no results are found for the provided query. @@ -219,13 +220,15 @@ def search(self, query: str = None, limit: int = None, as_list: bool = False) -> >>> print(len(results)) 10 """ + from .errors import DatasetNotFoundError + if (query is not None) and (is_doi(query)): metadata_list = [self.get_metadata_by_doi(query)] else: metadata_list = self.get_metadata_by_query(query, limit) if len(metadata_list) == 0: - raise Exception(f"load: No results found for the query '{query}'") + raise DatasetNotFoundError(query or "all datasets", "query") foundry_datasets = [] for metadata in metadata_list: @@ -235,6 +238,9 @@ def search(self, query: str = None, limit: int = None, as_list: bool = False) -> logger.info(f"Search for '{query}' returned {len(foundry_datasets)} foundry datasets out of {len(metadata_list)} matches") + if as_json: + return [self._dataset_to_dict(ds) for ds in foundry_datasets] + if as_list: return foundry_datasets @@ -242,16 +248,37 @@ def search(self, query: str = None, limit: int = None, as_list: bool = False) -> return foundry_datasets - def list(self, limit: int = None): + def _dataset_to_dict(self, ds: FoundryDataset) -> Dict[str, Any]: + """Convert a FoundryDataset to an agent-friendly dictionary. + + Args: + ds: The FoundryDataset to convert. + + Returns: + Dictionary with dataset metadata suitable for JSON serialization. + """ + return { + "name": ds.dataset_name, + "title": ds.dc.titles[0].title if ds.dc.titles else None, + "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None, + "description": ds.dc.descriptions[0].description if ds.dc.descriptions else None, + "year": ds.dc.publicationYear if hasattr(ds.dc, 'publicationYear') else None, + "fields": [k.key[0] for k in (ds.foundry_schema.keys or []) if k.key], + "splits": [s.label for s in (ds.foundry_schema.splits or [])], + "data_type": ds.foundry_schema.data_type, + } + + def list(self, limit: int = None, as_json: bool = False): """List available Foundry datasets Args: limit (int): maximum number of results to return + as_json (bool): If True, return results as list of dicts (agent-friendly) Returns: - List[FoundryDataset]: List of FoundryDataset objects + List[FoundryDataset] or DataFrame or List[dict]: Available datasets """ - return self.search(limit=limit) + return self.search(limit=limit, as_json=as_json) def dataset_from_metadata(self, metadata: dict) -> FoundryDataset: """ Converts the result of a forge query to a FoundryDatset object diff --git a/foundry/foundry_cache.py b/foundry/foundry_cache.py index 7fbfab8a..f92317e9 100644 --- a/foundry/foundry_cache.py +++ b/foundry/foundry_cache.py @@ -4,7 +4,7 @@ from concurrent.futures import ThreadPoolExecutor, as_completed import h5py -from mdf_forge import Forge +from .mdf_client import MDFClient import numpy as np import pandas as pd import shutil @@ -23,7 +23,7 @@ class FoundryCache(): """The FoundryCache manages the local storage of FoundryDataset objects""" def __init__(self, - forge_client: Forge, + forge_client: MDFClient, transfer_client: Any, use_globus, interval, @@ -34,7 +34,7 @@ def __init__(self, Initializes a FoundryCache object. Args: - forge_client (Forge): The Forge client object. + forge_client (MDFClient): The MDF client object. transfer_client (Any): The transfer client object. use_globus (bool): Flag indicating whether to use Globus for downloading. interval (int): How often to wait before checking Globus transfer status. @@ -354,7 +354,7 @@ def _load_data(self, # Sort version folders and get the latest one latest_version = sorted(version_folders, key=lambda x: [int(n) for n in x.split('.')], reverse=True)[0] path = os.path.join(path, latest_version) - print(f"Loading from version folder: {latest_version}") + logger.info(f"Loading from version folder: {latest_version}") path_to_file = os.path.join(path, file) diff --git a/foundry/foundry_dataset.py b/foundry/foundry_dataset.py index 75195894..e2ec54fc 100644 --- a/foundry/foundry_dataset.py +++ b/foundry/foundry_dataset.py @@ -45,24 +45,70 @@ def __init__(self, try: self.dc = FoundryDatacite(datacite_entry) self.foundry_schema = FoundrySchema(foundry_schema) + except ValidationError as e: + logger.error(f"Invalid metadata for dataset '{dataset_name}': {e}") + raise except Exception as e: - raise Exception('there was a problem creating the dataset: ', e) + raise ValueError( + f"Failed to create dataset '{dataset_name}': {e}. " + "Check that datacite_entry and foundry_schema are valid." + ) from e self._foundry_cache = foundry_cache - def get_as_dict(self, split: str = None, as_hdf5: bool = False): + def get_as_dict(self, split: str = None, as_hdf5: bool = False, include_schema: bool = False): """Returns the data from the dataset as a dictionary Arguments: split (string): Split to create dataset on. **Default:** ``None`` + as_hdf5 (bool): If True, return raw HDF5 data. + **Default:** ``False`` + include_schema (bool): If True, return data with schema information. + Useful for AI agents that need to understand the data structure. + **Default:** ``False`` - Returns: (dict) Dictionary of all the data from the specified split + Returns: + dict: Dictionary of all the data from the specified split. + If include_schema is True, returns {"data": ..., "schema": ...} """ - return self._foundry_cache.load_as_dict(split, + data = self._foundry_cache.load_as_dict(split, self.dataset_name, self.foundry_schema, as_hdf5) + if include_schema: + return { + "data": data, + "schema": self.get_schema(), + } + return data + + def get_schema(self) -> dict: + """Get the schema of this dataset - what fields it contains. + + Returns: + dict: Schema with fields, splits, data type, and metadata. + """ + return { + "name": self.dataset_name, + "title": self.dc.titles[0].title if self.dc.titles else None, + "doi": str(self.dc.identifier.identifier) if self.dc.identifier else None, + "data_type": self.foundry_schema.data_type, + "splits": [ + {"name": s.label, "type": s.type} + for s in (self.foundry_schema.splits or []) + ], + "fields": [ + { + "name": k.key[0] if k.key else None, + "role": k.type, # "input" or "target" + "description": k.description, + "units": k.units, + } + for k in (self.foundry_schema.keys or []) + ], + } + load = get_as_dict def get_as_torch(self, split: str = None): @@ -212,7 +258,6 @@ def clear_dataset_cache(self): def clean_dc_dict(self): """Clean the Datacite dictionary of None values""" - print(json.loads(self.dc.json())) return self.delete_none(json.loads(self.dc.json())) def delete_none(self, _dict): diff --git a/foundry/https_download.py b/foundry/https_download.py index 77d40a2e..7364fdea 100644 --- a/foundry/https_download.py +++ b/foundry/https_download.py @@ -2,6 +2,7 @@ """ +import logging import os from collections import deque @@ -9,6 +10,9 @@ from globus_sdk import TransferClient +logger = logging.getLogger(__name__) + + def recursive_ls(tc: TransferClient, ep: str, path: str, max_depth: int = 3): """Find all files in a Globus directory recursively @@ -52,6 +56,16 @@ def _get_files(tc, ep, queue, max_depth): yield item +class DownloadError(Exception): + """Raised when a file download fails.""" + + def __init__(self, url: str, reason: str, destination: str = None): + self.url = url + self.reason = reason + self.destination = destination + super().__init__(f"Failed to download {url}: {reason}") + + def download_file(item, base_directory, https_config, timeout=1800): """Download a file to disk @@ -60,6 +74,12 @@ def download_file(item, base_directory, https_config, timeout=1800): base_directory: Base directory for storing downloaded files https_config: Configuration defining the URL of the server and the name of the dataset timeout: Timeout for the download request in seconds (default: 1800) + + Returns: + str: Path to the downloaded file + + Raises: + DownloadError: If the download fails for any reason """ base_url = https_config['base_url'].rstrip('/') path = item.get('path', '').strip('/') @@ -89,20 +109,18 @@ def download_file(item, base_directory, https_config, timeout=1800): response.raise_for_status() downloaded_size = 0 - print(f"\rStarting Download of: {url}") + logger.info(f"Starting download: {url}") with open(destination, "wb") as f: for chunk in response.iter_content(chunk_size=8192): if chunk: f.write(chunk) downloaded_size += len(chunk) - # Calculate and print the download progress - print(f"\rDownloading... {downloaded_size/(1 << 20):,.2f} MB", end="") + + logger.info(f"Downloaded {downloaded_size / (1 << 20):,.2f} MB to {destination}") return destination except requests.exceptions.RequestException as e: - print(f"Error downloading file: {e}") + raise DownloadError(url, str(e), destination) from e except IOError as e: - print(f"Error writing file to disk: {e}") - - return {destination + " status": True} + raise DownloadError(url, f"Failed to write to disk: {e}", destination) from e diff --git a/foundry/integrations/__init__.py b/foundry/integrations/__init__.py new file mode 100644 index 00000000..87640f95 --- /dev/null +++ b/foundry/integrations/__init__.py @@ -0,0 +1,19 @@ +"""Foundry integrations with external platforms. + +This module provides bridges to other data ecosystems: +- Hugging Face Hub: Export datasets to HF for broader visibility + +Usage: + from foundry.integrations.huggingface import push_to_hub + + # Export a Foundry dataset to Hugging Face + dataset = foundry.get_dataset("10.18126/abc123") + push_to_hub(dataset, "my-org/dataset-name") +""" + +try: + from .huggingface import push_to_hub + __all__ = ["push_to_hub"] +except ImportError: + # huggingface extras not installed + __all__ = [] diff --git a/foundry/integrations/huggingface.py b/foundry/integrations/huggingface.py new file mode 100644 index 00000000..07e409b1 --- /dev/null +++ b/foundry/integrations/huggingface.py @@ -0,0 +1,334 @@ +"""Hugging Face Hub integration for Foundry datasets. + +This module provides functionality to export Foundry datasets to Hugging Face Hub, +making materials science datasets discoverable in the broader ML ecosystem. + +Requirements: + pip install foundry-ml[huggingface] + +Usage: + from foundry import Foundry + from foundry.integrations.huggingface import push_to_hub + + f = Foundry() + dataset = f.get_dataset("10.18126/abc123") + push_to_hub(dataset, "materials-science/bandgap-data") +""" + +import logging +from typing import Optional, Dict, Any + +logger = logging.getLogger(__name__) + +try: + from datasets import Dataset, DatasetDict + from huggingface_hub import HfApi + HF_AVAILABLE = True +except ImportError: + HF_AVAILABLE = False + + +def _check_hf_available(): + """Check if HuggingFace dependencies are installed.""" + if not HF_AVAILABLE: + raise ImportError( + "HuggingFace integration requires additional dependencies. " + "Install with: pip install foundry-ml[huggingface]" + ) + + +def push_to_hub( + dataset, # FoundryDataset + repo_id: str, + token: Optional[str] = None, + private: bool = False, + split: Optional[str] = None, +) -> str: + """Export a Foundry dataset to Hugging Face Hub. + + This creates a new dataset repository on HF Hub with the Foundry data, + automatically generating a Dataset Card from the DataCite metadata. + + Args: + dataset: A FoundryDataset object (from foundry.get_dataset()) + repo_id: HF Hub repository ID (e.g., "materials-science/bandgap-data") + token: HuggingFace API token. If None, uses cached credentials. + private: If True, create a private repository. + split: Specific split to export. If None, exports all splits. + + Returns: + str: URL of the created dataset on HF Hub. + + Example: + >>> from foundry import Foundry + >>> from foundry.integrations.huggingface import push_to_hub + >>> f = Foundry() + >>> ds = f.get_dataset("10.18126/abc123") + >>> url = push_to_hub(ds, "my-org/my-dataset") + >>> print(f"Dataset published at: {url}") + """ + _check_hf_available() + + logger.info(f"Exporting Foundry dataset '{dataset.dataset_name}' to HF Hub: {repo_id}") + + # Load data from Foundry + data = dataset.get_as_dict(split=split) + + # Convert to HuggingFace Dataset format + if isinstance(data, dict) and all(isinstance(v, tuple) for v in data.values()): + # Data has splits: {split_name: (inputs, targets)} + hf_datasets = {} + for split_name, (inputs, targets) in data.items(): + combined = _combine_inputs_targets(inputs, targets) + hf_datasets[split_name] = Dataset.from_dict(combined) + hf_dataset = DatasetDict(hf_datasets) + else: + # Single dataset without splits + hf_dataset = Dataset.from_dict(_flatten_data(data)) + + # Generate Dataset Card (README.md) from DataCite metadata + readme_content = _generate_dataset_card(dataset) + + # Push to Hub + hf_dataset.push_to_hub( + repo_id, + token=token, + private=private, + ) + + # Update the README with our generated card + api = HfApi(token=token) + api.upload_file( + path_or_fileobj=readme_content.encode(), + path_in_repo="README.md", + repo_id=repo_id, + repo_type="dataset", + ) + + url = f"https://huggingface.co/datasets/{repo_id}" + logger.info(f"Successfully published to: {url}") + return url + + +def _combine_inputs_targets(inputs: Dict, targets: Dict) -> Dict[str, Any]: + """Combine input and target dictionaries into a single flat dict.""" + import pandas as pd + import numpy as np + + combined = {} + + for key, value in inputs.items(): + col_name = f"input_{key}" if key in targets else key + combined[col_name] = _to_list(value) + + for key, value in targets.items(): + col_name = f"target_{key}" if key in inputs else key + combined[col_name] = _to_list(value) + + return combined + + +def _flatten_data(data: Any) -> Dict[str, Any]: + """Flatten nested data structure to a dict suitable for HF Dataset.""" + import pandas as pd + import numpy as np + + if isinstance(data, pd.DataFrame): + return {col: data[col].tolist() for col in data.columns} + elif isinstance(data, dict): + result = {} + for key, value in data.items(): + result[key] = _to_list(value) + return result + else: + return {"data": _to_list(data)} + + +def _to_list(value: Any) -> list: + """Convert various types to list for HF compatibility.""" + import pandas as pd + import numpy as np + + if isinstance(value, np.ndarray): + return value.tolist() + elif isinstance(value, pd.Series): + return value.tolist() + elif isinstance(value, pd.DataFrame): + return value.to_dict(orient='records') + elif isinstance(value, (list, tuple)): + return list(value) + else: + return [value] + + +def _normalize_license(license_str: str) -> str: + """Map license string to a valid HuggingFace license identifier.""" + if not license_str: + return "other" + + license_lower = license_str.lower() + + # Direct matches to HF identifiers + hf_licenses = { + "cc0", "cc0-1.0", "cc-by-4.0", "cc-by-sa-4.0", "cc-by-nc-4.0", + "cc-by-nc-sa-4.0", "cc-by-nc-nd-4.0", "mit", "apache-2.0", + "bsd", "bsd-2-clause", "bsd-3-clause", "gpl-3.0", "lgpl-3.0", + "unknown", "other" + } + + # Check for direct match + if license_lower in hf_licenses: + return license_lower + + # Common mappings + mappings = { + "creative commons": "cc-by-4.0", + "cc by": "cc-by-4.0", + "cc-by": "cc-by-4.0", + "cc by 4": "cc-by-4.0", + "cc by-sa": "cc-by-sa-4.0", + "cc by-nc": "cc-by-nc-4.0", + "cc0": "cc0-1.0", + "public domain": "cc0-1.0", + "apache": "apache-2.0", + "mit license": "mit", + "bsd license": "bsd-3-clause", + "gpl": "gpl-3.0", + "lgpl": "lgpl-3.0", + } + + for pattern, hf_id in mappings.items(): + if pattern in license_lower: + return hf_id + + # If we can't map it, use "other" + return "other" + + +def _generate_dataset_card(dataset) -> str: + """Generate a HuggingFace Dataset Card from Foundry DataCite metadata.""" + dc = dataset.dc + schema = dataset.foundry_schema + + # Extract metadata + title = dc.titles[0].title if dc.titles else dataset.dataset_name + description = dc.descriptions[0].description if dc.descriptions else "" + + # DOI is a RootModel with .root containing the actual string + doi = "" + if dc.identifier and dc.identifier.identifier: + doi_obj = dc.identifier.identifier + doi = doi_obj.root if hasattr(doi_obj, 'root') else str(doi_obj) + + # Handle creators (can be dicts or pydantic objects) + authors = [] + for c in (dc.creators or []): + if isinstance(c, dict): + authors.append(c.get('creatorName', 'Unknown')) + elif hasattr(c, 'creatorName'): + authors.append(c.creatorName or 'Unknown') + else: + authors.append(str(c)) + + # Year is a RootModel with .root containing the actual int + year = "" + if hasattr(dc, 'publicationYear') and dc.publicationYear: + year_obj = dc.publicationYear + year = year_obj.root if hasattr(year_obj, 'root') else str(year_obj) + + # Get license if available (rightsList contains RightsListItem objects) + license_raw = None + if hasattr(dc, 'rightsList') and dc.rightsList: + rights_item = dc.rightsList[0] + if isinstance(rights_item, dict): + license_raw = rights_item.get('rights') + elif hasattr(rights_item, 'rights'): + license_raw = rights_item.rights + + license_id = _normalize_license(license_raw) + # For display, show original if different from ID + license_display = license_raw if license_raw and license_raw != license_id else license_id + + # Build field documentation + fields_doc = "" + if schema.keys: + fields_doc = "\n### Fields\n\n| Field | Role | Description | Units |\n|-------|------|-------------|-------|\n" + for key in schema.keys: + name = key.key[0] if key.key else "N/A" + role = key.type or "N/A" + desc = (key.description or "")[:50] + units = key.units or "" + fields_doc += f"| {name} | {role} | {desc} | {units} |\n" + + # Build splits documentation + splits_doc = "" + if schema.splits: + splits_doc = "\n### Splits\n\n" + for split in schema.splits: + splits_doc += f"- **{split.label}**: {split.type or 'data'}\n" + + # Generate citation + citation = dataset.get_citation() + + return f"""--- +license: {license_id} +task_categories: + - tabular-regression + - tabular-classification +tags: + - materials-science + - chemistry + - foundry-ml + - scientific-data +size_categories: + - 1K List[Dict[str, Any]]: + """Search for materials science datasets. + + Args: + query: Search terms (e.g., "band gap", "crystal structure", "zeolite") + limit: Maximum number of results to return (default: 10) + + Returns: + List of datasets with name, title, DOI, and description + """ + f = _get_foundry() + results = f.search(query, limit=limit) + + output = [] + for _, row in results.iterrows(): + ds = row.FoundryDataset + output.append({ + "name": ds.dataset_name, + "title": ds.dc.titles[0].title if ds.dc.titles else None, + "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None, + "description": ds.dc.descriptions[0].description if ds.dc.descriptions else None, + "year": ds.dc.publicationYear if hasattr(ds.dc, 'publicationYear') else None, + }) + return output + + +def get_dataset_schema(doi: str) -> Dict[str, Any]: + """Get the schema of a dataset - what fields it contains. + + Use this to understand the structure of a dataset before loading it. + + Args: + doi: The DOI of the dataset (e.g., "10.18126/abc123") + + Returns: + Schema with name, splits, fields (with descriptions and units), and data type + """ + f = _get_foundry() + ds = f.get_dataset(doi) + + return { + "name": ds.dataset_name, + "title": ds.dc.titles[0].title if ds.dc.titles else None, + "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None, + "data_type": ds.foundry_schema.data_type, + "splits": [ + {"name": s.label, "type": s.type} + for s in (ds.foundry_schema.splits or []) + ], + "fields": [ + { + "name": k.key[0] if k.key else None, + "role": k.type, # "input" or "target" + "description": k.description, + "units": k.units, + } + for k in (ds.foundry_schema.keys or []) + ], + } + + +def load_dataset(doi: str, split: str = "train") -> Dict[str, Any]: + """Load a dataset and return its data with schema. + + This downloads the data if not cached, then returns it along with schema information. + + Args: + doi: The DOI of the dataset + split: Which split to load (default: "train") + + Returns: + Dictionary with "schema" (field information) and "data" (the actual data) + """ + f = _get_foundry() + ds = f.get_dataset(doi) + data = ds.get_as_dict(split=split) + schema = get_dataset_schema(doi) + + return { + "schema": schema, + "data": _serialize_data(data), + "citation": ds.get_citation(), + } + + +def list_all_datasets(limit: int = 100) -> List[Dict[str, Any]]: + """List all available Foundry datasets. + + Returns a catalog of all datasets that can be loaded. + + Args: + limit: Maximum number of datasets to return (default: 100) + + Returns: + List of all available datasets with basic info + """ + f = _get_foundry() + results = f.list(limit=limit) + + output = [] + for _, row in results.iterrows(): + ds = row.FoundryDataset + output.append({ + "name": ds.dataset_name, + "title": ds.dc.titles[0].title if ds.dc.titles else None, + "doi": str(ds.dc.identifier.identifier) if ds.dc.identifier else None, + "description": ds.dc.descriptions[0].description if ds.dc.descriptions else None, + }) + return output + + +def _serialize_data(data: Any) -> Any: + """Convert numpy arrays and other non-JSON-serializable types to lists.""" + import numpy as np + import pandas as pd + + if isinstance(data, np.ndarray): + return data.tolist() + elif isinstance(data, pd.DataFrame): + return data.to_dict(orient='records') + elif isinstance(data, pd.Series): + return data.tolist() + elif isinstance(data, dict): + return {k: _serialize_data(v) for k, v in data.items()} + elif isinstance(data, (list, tuple)): + return [_serialize_data(item) for item in data] + elif isinstance(data, (np.integer, np.floating)): + return data.item() + else: + return data + + +# MCP Server Implementation using stdio transport +TOOLS = [ + { + "name": "search_datasets", + "description": "Search for materials science datasets. Returns datasets matching the query with name, title, DOI, and description.", + "inputSchema": { + "type": "object", + "properties": { + "query": { + "type": "string", + "description": "Search terms (e.g., 'band gap', 'crystal structure', 'zeolite')" + }, + "limit": { + "type": "integer", + "description": "Maximum number of results (default: 10)", + "default": 10 + } + }, + "required": ["query"] + } + }, + { + "name": "get_dataset_schema", + "description": "Get the schema of a dataset - what fields it contains, their descriptions, and units. Use this to understand the data structure before loading.", + "inputSchema": { + "type": "object", + "properties": { + "doi": { + "type": "string", + "description": "The DOI of the dataset (e.g., '10.18126/abc123')" + } + }, + "required": ["doi"] + } + }, + { + "name": "load_dataset", + "description": "Load a dataset and return its data with schema information. Downloads data if not cached.", + "inputSchema": { + "type": "object", + "properties": { + "doi": { + "type": "string", + "description": "The DOI of the dataset" + }, + "split": { + "type": "string", + "description": "Which split to load (default: 'train')", + "default": "train" + } + }, + "required": ["doi"] + } + }, + { + "name": "list_all_datasets", + "description": "List all available Foundry datasets. Returns a catalog with basic info for each dataset.", + "inputSchema": { + "type": "object", + "properties": { + "limit": { + "type": "integer", + "description": "Maximum number of datasets to return (default: 100)", + "default": 100 + } + }, + "required": [] + } + } +] + + +def handle_tool_call(name: str, arguments: Dict[str, Any]) -> Any: + """Handle a tool call and return the result.""" + if name == "search_datasets": + return search_datasets( + query=arguments["query"], + limit=arguments.get("limit", 10) + ) + elif name == "get_dataset_schema": + return get_dataset_schema(doi=arguments["doi"]) + elif name == "load_dataset": + return load_dataset( + doi=arguments["doi"], + split=arguments.get("split", "train") + ) + elif name == "list_all_datasets": + return list_all_datasets(limit=arguments.get("limit", 100)) + else: + raise ValueError(f"Unknown tool: {name}") + + +def create_server(): + """Create and return the MCP server configuration.""" + return { + "name": "foundry-ml", + "version": "1.0.0", + "description": "Materials science dataset discovery and loading for ML", + "tools": TOOLS, + } + + +def run_server(host: str = "localhost", port: int = 8765): + """Run the MCP server using stdio transport. + + This implements a simple JSON-RPC style protocol for MCP. + """ + import sys + + logger.info(f"Starting Foundry MCP server on {host}:{port}") + + # For now, use a simple stdio-based protocol + # In production, this would use the full MCP SDK + print(json.dumps({ + "jsonrpc": "2.0", + "method": "server/info", + "params": create_server() + }), flush=True) + + # Read and process requests + for line in sys.stdin: + try: + request = json.loads(line.strip()) + method = request.get("method", "") + + if method == "tools/list": + response = { + "jsonrpc": "2.0", + "id": request.get("id"), + "result": {"tools": TOOLS} + } + elif method == "tools/call": + params = request.get("params", {}) + tool_name = params.get("name") + arguments = params.get("arguments", {}) + try: + result = handle_tool_call(tool_name, arguments) + response = { + "jsonrpc": "2.0", + "id": request.get("id"), + "result": {"content": [{"type": "text", "text": json.dumps(result, default=str)}]} + } + except Exception as e: + response = { + "jsonrpc": "2.0", + "id": request.get("id"), + "error": {"code": -32000, "message": str(e)} + } + else: + response = { + "jsonrpc": "2.0", + "id": request.get("id"), + "error": {"code": -32601, "message": f"Method not found: {method}"} + } + + print(json.dumps(response), flush=True) + + except json.JSONDecodeError: + continue + except Exception as e: + logger.error(f"Error processing request: {e}") + print(json.dumps({ + "jsonrpc": "2.0", + "id": None, + "error": {"code": -32603, "message": str(e)} + }), flush=True) diff --git a/foundry/models.py b/foundry/models.py index 2b490e00..a11d04cc 100644 --- a/foundry/models.py +++ b/foundry/models.py @@ -76,15 +76,18 @@ def __init__(self, project_dict: Dict[str, Any]): try: super().__init__(**project_dict) except ValidationError as e: - print("FoundrySchema validation failed!") + logger.error("FoundrySchema validation failed!") for error in e.errors(): field_name = ".".join([str(item) for item in error['loc']]) error_description = error['msg'] - error_message = f"""There is an issue validating the entry for the field '{field_name}': - The error message returned is: '{error_description}'. - The description for this field is: '{FoundryModel.model_json_schema()['properties'][field_name]['description']}'""" - print(error_message) - raise e + schema_props = FoundryModel.model_json_schema().get('properties', {}) + field_desc = schema_props.get(field_name, {}).get('description', 'No description available') + error_message = ( + f"Validation error for field '{field_name}': {error_description}. " + f"Field description: {field_desc}" + ) + logger.error(error_message) + raise class FoundryDatacite(DataciteModel): @@ -100,15 +103,18 @@ def __init__(self, datacite_dict: Dict[str, Any], **kwargs): dc_dict['identifier']['identifier'] = dc_dict['identifier']['identifier']['__root__'] super().__init__(**dc_dict, **kwargs) except ValidationError as e: - print("Datacite validation failed!") + logger.error("Datacite validation failed!") + schema_props = DataciteModel.model_json_schema().get('properties', {}) for error in e.errors(): field_name = ".".join(str(loc) for loc in error["loc"]) error_description = error['msg'] - error_message = f"""There is an issue validating the entry for the field '{field_name}': - The error message returned is: '{error_description}'. - The description is: '{self.model_json_schema()['properties'].get(field_name, {}).get('description', 'No description available')}'""" - print(error_message) - raise e + field_desc = schema_props.get(field_name, {}).get('description', 'No description available') + error_message = ( + f"Validation error for field '{field_name}': {error_description}. " + f"Field description: {field_desc}" + ) + logger.error(error_message) + raise class FoundryBase(BaseModel): diff --git a/scripts/README.md b/scripts/README.md new file mode 100644 index 00000000..6cd7b218 --- /dev/null +++ b/scripts/README.md @@ -0,0 +1,31 @@ +# Foundry Scripts + +Utility scripts for managing Foundry datasets. + +## batch_push_to_hf.py + +Push all Foundry datasets to HuggingFace Hub for broader discoverability. + +### Quick Start + +```bash +# 1. Install dependencies +pip install foundry-ml[huggingface] + +# 2. Set your HuggingFace token +export HF_TOKEN="hf_your_token_here" + +# 3. Dry run (see what would be uploaded) +python scripts/batch_push_to_hf.py --dry-run + +# 4. Upload all datasets +python scripts/batch_push_to_hf.py --org foundry-ml +``` + +### Setup + +See the full setup instructions at the top of `batch_push_to_hf.py` or run: + +```bash +python scripts/batch_push_to_hf.py --help +``` diff --git a/scripts/batch_push_to_hf.py b/scripts/batch_push_to_hf.py new file mode 100755 index 00000000..bd12c3af --- /dev/null +++ b/scripts/batch_push_to_hf.py @@ -0,0 +1,459 @@ +#!/usr/bin/env python3 +""" +Batch Push Foundry Datasets to HuggingFace Hub +=============================================== + +This script exports all Foundry datasets to HuggingFace Hub, making them +discoverable in the broader ML ecosystem. + +SETUP INSTRUCTIONS +------------------ + +1. CREATE A HUGGINGFACE ACCOUNT + - Go to https://huggingface.co/join + - Create an account + +2. CREATE AN ORGANIZATION (Recommended) + - Go to https://huggingface.co/organizations/new + - Create organization named "foundry-ml" (or your preferred name) + - This keeps all datasets under one namespace: foundry-ml/dataset-name + +3. GET YOUR API TOKEN + - Go to https://huggingface.co/settings/tokens + - Click "New token" + - Name: "foundry-batch-upload" + - Role: "Write" (required to create repos) + - Copy the token (starts with "hf_...") + +4. SET YOUR TOKEN (choose one method): + + Option A - Environment variable (recommended): + ```bash + export HF_TOKEN="hf_your_token_here" + python scripts/batch_push_to_hf.py + ``` + + Option B - Login via CLI (persists across sessions): + ```bash + pip install huggingface_hub + huggingface-cli login + # Paste your token when prompted + python scripts/batch_push_to_hf.py + ``` + + Option C - Pass directly (not recommended for shared scripts): + ```bash + python scripts/batch_push_to_hf.py --token "hf_your_token_here" + ``` + +5. INSTALL DEPENDENCIES + ```bash + pip install foundry-ml[huggingface] + # or + pip install datasets huggingface_hub + ``` + +6. RUN THE SCRIPT + ```bash + python scripts/batch_push_to_hf.py --org foundry-ml + ``` + +USAGE +----- + python scripts/batch_push_to_hf.py [OPTIONS] + +OPTIONS +------- + --org ORG HuggingFace organization name (default: foundry-ml) + --token TOKEN HuggingFace API token (or set HF_TOKEN env var) + --private Create private repositories + --dry-run List datasets without uploading + --limit N Only process first N datasets (for testing) + --skip N Skip first N datasets (for resuming) + --dataset NAME Process only this specific dataset + --output FILE Save results to JSON file + +EXAMPLES +-------- + # Dry run - see what would be uploaded + python scripts/batch_push_to_hf.py --dry-run + + # Upload all datasets to foundry-ml organization + python scripts/batch_push_to_hf.py --org foundry-ml + + # Test with first 3 datasets + python scripts/batch_push_to_hf.py --org foundry-ml --limit 3 + + # Resume from dataset 10 (if previous run failed) + python scripts/batch_push_to_hf.py --org foundry-ml --skip 10 + + # Upload a single specific dataset + python scripts/batch_push_to_hf.py --org foundry-ml --dataset "foundry_bandgap_oqmd" +""" + +import argparse +import json +import logging +import os +import sys +import time +from dataclasses import dataclass, asdict +from datetime import datetime +from pathlib import Path +from typing import Optional, List + +# Configure logging +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(levelname)s - %(message)s', + datefmt='%H:%M:%S' +) +logger = logging.getLogger(__name__) + + +@dataclass +class UploadResult: + """Result of a single dataset upload.""" + dataset_name: str + doi: str + repo_id: str + status: str # 'success', 'failed', 'skipped' + url: Optional[str] = None + error: Optional[str] = None + duration_seconds: Optional[float] = None + + +def get_hf_token(args_token: Optional[str] = None) -> str: + """Get HuggingFace token from args, env, or cached login.""" + # 1. Check command line argument + if args_token: + return args_token + + # 2. Check environment variable + env_token = os.environ.get('HF_TOKEN') or os.environ.get('HUGGINGFACE_TOKEN') + if env_token: + return env_token + + # 3. Check if logged in via huggingface-cli + try: + from huggingface_hub import HfFolder + cached_token = HfFolder.get_token() + if cached_token: + return cached_token + except Exception: + pass + + # 4. No token found + raise ValueError( + "No HuggingFace token found. Please either:\n" + " 1. Set HF_TOKEN environment variable\n" + " 2. Run 'huggingface-cli login'\n" + " 3. Pass --token argument\n" + "\nGet your token at: https://huggingface.co/settings/tokens" + ) + + +def sanitize_repo_name(name: str) -> str: + """Convert dataset name to valid HF repo name.""" + # HF repo names: lowercase, alphanumeric, hyphens, underscores + # Max 96 characters + import re + name = name.lower() + name = re.sub(r'[^a-z0-9_-]', '-', name) + name = re.sub(r'-+', '-', name) # Collapse multiple hyphens + name = name.strip('-_') + return name[:96] + + +def check_repo_exists(api, repo_id: str) -> bool: + """Check if a repository already exists on HF Hub.""" + try: + api.repo_info(repo_id=repo_id, repo_type="dataset") + return True + except Exception: + return False + + +def push_dataset( + dataset, + org: str, + token: str, + private: bool = False, +) -> UploadResult: + """Push a single dataset to HuggingFace Hub.""" + from foundry.integrations.huggingface import push_to_hub + from huggingface_hub import HfApi + + dataset_name = dataset.dataset_name + doi = str(dataset.dc.identifier.identifier) if dataset.dc.identifier else "unknown" + repo_name = sanitize_repo_name(dataset_name) + repo_id = f"{org}/{repo_name}" + + start_time = time.time() + + try: + # Check if already exists + api = HfApi(token=token) + if check_repo_exists(api, repo_id): + logger.info(f" Repository {repo_id} already exists, skipping") + return UploadResult( + dataset_name=dataset_name, + doi=doi, + repo_id=repo_id, + status='skipped', + url=f"https://huggingface.co/datasets/{repo_id}", + error="Repository already exists" + ) + + # Push to hub + url = push_to_hub( + dataset=dataset, + repo_id=repo_id, + token=token, + private=private, + ) + + duration = time.time() - start_time + logger.info(f" Successfully pushed to {url} ({duration:.1f}s)") + + return UploadResult( + dataset_name=dataset_name, + doi=doi, + repo_id=repo_id, + status='success', + url=url, + duration_seconds=duration + ) + + except Exception as e: + duration = time.time() - start_time + error_msg = str(e) + logger.error(f" Failed: {error_msg}") + + return UploadResult( + dataset_name=dataset_name, + doi=doi, + repo_id=repo_id, + status='failed', + error=error_msg, + duration_seconds=duration + ) + + +def main(): + parser = argparse.ArgumentParser( + description="Batch push Foundry datasets to HuggingFace Hub", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=__doc__ + ) + parser.add_argument( + '--org', + default='foundry-ml', + help='HuggingFace organization name (default: foundry-ml)' + ) + parser.add_argument( + '--token', + help='HuggingFace API token (or set HF_TOKEN env var)' + ) + parser.add_argument( + '--private', + action='store_true', + help='Create private repositories' + ) + parser.add_argument( + '--dry-run', + action='store_true', + help='List datasets without uploading' + ) + parser.add_argument( + '--limit', + type=int, + help='Only process first N datasets' + ) + parser.add_argument( + '--skip', + type=int, + default=0, + help='Skip first N datasets' + ) + parser.add_argument( + '--dataset', + help='Process only this specific dataset name' + ) + parser.add_argument( + '--output', + help='Save results to JSON file' + ) + + args = parser.parse_args() + + # Import Foundry + try: + from foundry import Foundry + except ImportError: + logger.error("Foundry not installed. Run: pip install foundry-ml") + sys.exit(1) + + # Check HF dependencies + try: + from datasets import Dataset + from huggingface_hub import HfApi + except ImportError: + logger.error( + "HuggingFace dependencies not installed.\n" + "Run: pip install foundry-ml[huggingface]" + ) + sys.exit(1) + + # Get token (skip for dry run) + token = None + if not args.dry_run: + try: + token = get_hf_token(args.token) + logger.info(f"Using HuggingFace token: {token[:10]}...") + except ValueError as e: + logger.error(str(e)) + sys.exit(1) + + # Initialize Foundry + logger.info("Connecting to Foundry...") + f = Foundry() + + # List all datasets + logger.info("Fetching dataset list...") + datasets_df = f.list() + total_datasets = len(datasets_df) + logger.info(f"Found {total_datasets} datasets") + + # Apply filters + if args.dataset: + datasets_df = datasets_df[datasets_df['dataset_name'] == args.dataset] + if len(datasets_df) == 0: + logger.error(f"Dataset '{args.dataset}' not found") + sys.exit(1) + + if args.skip: + datasets_df = datasets_df.iloc[args.skip:] + logger.info(f"Skipping first {args.skip} datasets") + + if args.limit: + datasets_df = datasets_df.iloc[:args.limit] + logger.info(f"Limiting to {args.limit} datasets") + + # Preview + print("\n" + "=" * 60) + print("DATASETS TO PROCESS") + print("=" * 60) + for i, (_, row) in enumerate(datasets_df.iterrows()): + name = row.get('dataset_name', 'N/A') + title = str(row.get('title', 'N/A'))[:50] + repo_name = sanitize_repo_name(name) + print(f"{i+1:3}. {args.org}/{repo_name}") + print(f" Title: {title}") + print("=" * 60 + "\n") + + if args.dry_run: + logger.info("Dry run complete. No datasets were uploaded.") + return + + # Confirm + response = input(f"Push {len(datasets_df)} datasets to {args.org}? [y/N] ") + if response.lower() != 'y': + logger.info("Aborted.") + return + + # Process datasets + results: List[UploadResult] = [] + success_count = 0 + failed_count = 0 + skipped_count = 0 + + print("\n" + "=" * 60) + print("UPLOADING") + print("=" * 60) + + for i, (_, row) in enumerate(datasets_df.iterrows()): + name = row.get('dataset_name', 'N/A') + doi = row.get('DOI', 'N/A') + # Use the pre-loaded FoundryDataset object if available + foundry_dataset = row.get('FoundryDataset', None) + + logger.info(f"[{i+1}/{len(datasets_df)}] Processing: {name}") + + try: + # Use pre-loaded dataset or fetch it + if foundry_dataset is not None: + dataset = foundry_dataset + else: + dataset = f.get_dataset(doi) + + # Push to HF + result = push_dataset( + dataset=dataset, + org=args.org, + token=token, + private=args.private, + ) + results.append(result) + + if result.status == 'success': + success_count += 1 + elif result.status == 'skipped': + skipped_count += 1 + else: + failed_count += 1 + + except Exception as e: + logger.error(f" Error loading dataset: {e}") + results.append(UploadResult( + dataset_name=name, + doi=doi, + repo_id=f"{args.org}/{sanitize_repo_name(name)}", + status='failed', + error=str(e) + )) + failed_count += 1 + + # Brief pause to avoid rate limiting + time.sleep(1) + + # Summary + print("\n" + "=" * 60) + print("SUMMARY") + print("=" * 60) + print(f"Total processed: {len(results)}") + print(f" Successful: {success_count}") + print(f" Skipped: {skipped_count}") + print(f" Failed: {failed_count}") + + if failed_count > 0: + print("\nFailed datasets:") + for r in results: + if r.status == 'failed': + print(f" - {r.dataset_name}: {r.error}") + + # Save results + if args.output: + output_data = { + 'timestamp': datetime.now().isoformat(), + 'organization': args.org, + 'total': len(results), + 'success': success_count, + 'skipped': skipped_count, + 'failed': failed_count, + 'results': [asdict(r) for r in results] + } + with open(args.output, 'w') as f: + json.dump(output_data, f, indent=2) + logger.info(f"Results saved to {args.output}") + + # Print successful URLs + if success_count > 0: + print("\nSuccessfully uploaded:") + for r in results: + if r.status == 'success': + print(f" {r.url}") + + +if __name__ == '__main__': + main() diff --git a/setup.py b/setup.py index 1783614e..811026a9 100644 --- a/setup.py +++ b/setup.py @@ -5,7 +5,7 @@ packages = (setuptools.find_packages(),) setuptools.setup( name="foundry_ml", - version="1.0.4", + version="1.1.0", author="""Aristana Scourtas, KJ Schmidt, Isaac Darling, Aadit Ambadkar, Braeden Cullen, Imogen Foster, Ribhav Bose, Zoa Katok, Ethan Truelove, Ian Foster, Ben Blaiszik""", author_email="blaiszik@uchicago.edu", @@ -14,7 +14,7 @@ long_description=long_description, long_description_content_type="text/markdown", install_requires=[ - "mdf_forge>=0.8.0", + "mdf_toolbox>=0.6.0", "globus-sdk>=3,<4", "dlhub_sdk>=1.0.0", "numpy>=1.15.4", @@ -23,19 +23,38 @@ "mdf_connect_client>=0.5.0", "h5py>=2.10.0", "json2table", - "openpyxl>=3.1.0" + "openpyxl>=3.1.0", + # CLI and agent support (core) + "typer[all]>=0.9.0", + "rich>=13.0.0", ], - python_requires=">=3.7", + extras_require={ + "huggingface": [ + "datasets>=2.14.0", + "huggingface_hub>=0.17.0", + ], + }, + entry_points={ + "console_scripts": [ + "foundry=foundry.__main__:main", + ], + }, + python_requires=">=3.8", classifiers=[ - "Development Status :: 3 - Alpha", + "Development Status :: 4 - Beta", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.8", + "Programming Language :: Python :: 3.9", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", "Topic :: Scientific/Engineering", ], - keywords=[], + keywords=["materials science", "machine learning", "datasets", "MCP", "AI agents"], license="MIT License", url="https://github.com/MLMI2-CSSI/foundry", ) diff --git a/tests/test_errors.py b/tests/test_errors.py new file mode 100644 index 00000000..8b5f3564 --- /dev/null +++ b/tests/test_errors.py @@ -0,0 +1,199 @@ +"""Tests for the structured error classes.""" + +import pytest + +from foundry.errors import ( + FoundryError, + DatasetNotFoundError, + AuthenticationError, + DownloadError, + DataLoadError, + ValidationError, + PublishError, + CacheError, + ConfigurationError, +) + + +class TestFoundryError: + """Tests for the base FoundryError class.""" + + def test_foundry_error_has_required_fields(self): + """Test that FoundryError has all required fields.""" + error = FoundryError( + code="TEST_ERROR", + message="Test error message", + details={"key": "value"}, + recovery_hint="Try again" + ) + + assert error.code == "TEST_ERROR" + assert error.message == "Test error message" + assert error.details == {"key": "value"} + assert error.recovery_hint == "Try again" + + def test_foundry_error_str_includes_code_and_hint(self): + """Test that str() includes code and recovery hint.""" + error = FoundryError( + code="TEST_ERROR", + message="Test message", + recovery_hint="Do this to fix" + ) + + error_str = str(error) + assert "[TEST_ERROR]" in error_str + assert "Test message" in error_str + assert "Do this to fix" in error_str + + def test_foundry_error_to_dict(self): + """Test serialization to dict for JSON responses.""" + error = FoundryError( + code="TEST_ERROR", + message="Test message", + details={"foo": "bar"}, + recovery_hint="Fix it" + ) + + d = error.to_dict() + assert d["code"] == "TEST_ERROR" + assert d["message"] == "Test message" + assert d["details"] == {"foo": "bar"} + assert d["recovery_hint"] == "Fix it" + + def test_foundry_error_is_exception(self): + """Test that FoundryError can be raised and caught.""" + with pytest.raises(FoundryError) as exc_info: + raise FoundryError(code="TEST", message="Test") + + assert exc_info.value.code == "TEST" + + +class TestDatasetNotFoundError: + """Tests for DatasetNotFoundError.""" + + def test_dataset_not_found_error(self): + """Test DatasetNotFoundError initialization.""" + error = DatasetNotFoundError("bandgap") + + assert error.code == "DATASET_NOT_FOUND" + assert "bandgap" in error.message + assert error.details["query"] == "bandgap" + assert error.recovery_hint is not None + + def test_dataset_not_found_error_search_type(self): + """Test DatasetNotFoundError with different search types.""" + error = DatasetNotFoundError("10.18126/abc", search_type="DOI") + + assert "DOI" in error.message + assert error.details["search_type"] == "DOI" + + +class TestAuthenticationError: + """Tests for AuthenticationError.""" + + def test_authentication_error(self): + """Test AuthenticationError initialization.""" + error = AuthenticationError("Globus", "Token expired") + + assert error.code == "AUTH_FAILED" + assert "Globus" in error.message + assert "Token expired" in error.message + assert error.details["service"] == "Globus" + assert error.recovery_hint is not None + + +class TestDownloadError: + """Tests for DownloadError.""" + + def test_download_error(self): + """Test DownloadError initialization.""" + error = DownloadError( + url="https://example.com/file.dat", + reason="Connection timeout", + destination="/tmp/file.dat" + ) + + assert error.code == "DOWNLOAD_FAILED" + assert "https://example.com/file.dat" in error.message + assert error.details["url"] == "https://example.com/file.dat" + assert error.details["destination"] == "/tmp/file.dat" + + +class TestDataLoadError: + """Tests for DataLoadError.""" + + def test_data_load_error(self): + """Test DataLoadError initialization.""" + error = DataLoadError( + file_path="/data/dataset.json", + reason="Invalid JSON", + data_type="tabular" + ) + + assert error.code == "DATA_LOAD_FAILED" + assert "/data/dataset.json" in error.message + assert error.details["data_type"] == "tabular" + + +class TestValidationError: + """Tests for ValidationError.""" + + def test_validation_error(self): + """Test ValidationError initialization.""" + error = ValidationError( + field_name="creators", + error_msg="Field required", + schema_type="datacite" + ) + + assert error.code == "VALIDATION_FAILED" + assert "creators" in error.message + assert error.details["schema_type"] == "datacite" + + +class TestPublishError: + """Tests for PublishError.""" + + def test_publish_error(self): + """Test PublishError initialization.""" + error = PublishError( + reason="Metadata validation failed", + source_id="my_dataset_v1.0", + status="failed" + ) + + assert error.code == "PUBLISH_FAILED" + assert error.details["source_id"] == "my_dataset_v1.0" + assert error.details["status"] == "failed" + + +class TestCacheError: + """Tests for CacheError.""" + + def test_cache_error(self): + """Test CacheError initialization.""" + error = CacheError( + operation="write", + reason="Disk full", + cache_path="/tmp/cache" + ) + + assert error.code == "CACHE_ERROR" + assert "write" in error.message + assert error.details["cache_path"] == "/tmp/cache" + + +class TestConfigurationError: + """Tests for ConfigurationError.""" + + def test_configuration_error(self): + """Test ConfigurationError initialization.""" + error = ConfigurationError( + setting="use_globus", + reason="Invalid value", + current_value="maybe" + ) + + assert error.code == "CONFIG_ERROR" + assert "use_globus" in error.message + assert error.details["current_value"] == "maybe" diff --git a/tests/test_https_download.py b/tests/test_https_download.py index f992fe0e..3890f165 100644 --- a/tests/test_https_download.py +++ b/tests/test_https_download.py @@ -1,67 +1,99 @@ -# import os -# import requests -# import mock - -# from foundry.https_download import download_file - - -# def test_download_file(tmp_path): -# item = { -# "path": tmp_path, -# "name": "example_file.txt" -# } -# data_directory = tmp_path -# https_config = { -# "base_url": "https://example.com/", -# "source_id": "12345" -# } - -# # Mock the requests.get function to return a response with content -# with mock.patch.object(requests, "get") as mock_get: -# mock_get.return_value.content = b"Example file content" - -# # Call the function -# result = download_file(item, data_directory, https_config) - -# # Assert that the file was downloaded and written correctly -# assert os.path.exists(str(tmp_path) + "/12345/example_file.txt") -# with open(str(tmp_path) + "/12345/example_file.txt", "rb") as f: -# assert f.read() == b"Example file content" - -# # Assert that the result is as expected -# assert result == {str(tmp_path) + "/12345/example_file.txt status": True} - - -# def test_download_file_with_existing_directories(tmp_path): -# temp_path_to_file = str(tmp_path) + '/file' -# os.mkdir(temp_path_to_file) -# temp_path_to_data = str(tmp_path) + '/data' -# os.mkdir(temp_path_to_data) - -# item = { -# "path": temp_path_to_file, -# "name": "example_file.txt" -# } -# data_directory = temp_path_to_data -# https_config = { -# "base_url": "https://example.com/", -# "source_id": "12345" -# } - -# # Create the parent directories -# os.makedirs(temp_path_to_data + "12345") - -# # Mock the requests.get function to return a response with content -# with mock.patch.object(requests, "get") as mock_get: -# mock_get.return_value.content = b"Example file content" - -# # Call the function -# result = download_file(item, data_directory, https_config) - -# # Assert that the file was downloaded and written correctly -# assert os.path.exists(temp_path_to_data + "/12345/example_file.txt") -# with open(temp_path_to_data + "/12345/example_file.txt", "rb") as f: -# assert f.read() == b"Example file content" - -# # Assert that the result is as expected -# assert result == {temp_path_to_data + "/12345/example_file.txt status": True} +"""Tests for https_download module.""" + +import os +import pytest +from unittest import mock + +import requests + +from foundry.https_download import download_file, DownloadError + + +class TestDownloadFile: + """Tests for the download_file function.""" + + def test_download_file_success(self, tmp_path): + """Test successful file download.""" + item = { + "path": "/data", + "name": "example_file.txt" + } + https_config = { + "base_url": "https://example.com/", + "source_id": "test_dataset" + } + + # Mock successful response + mock_response = mock.Mock() + mock_response.iter_content = mock.Mock(return_value=[b"Example file content"]) + mock_response.raise_for_status = mock.Mock() + mock_response.__enter__ = mock.Mock(return_value=mock_response) + mock_response.__exit__ = mock.Mock(return_value=False) + + with mock.patch.object(requests, "get", return_value=mock_response): + result = download_file(item, str(tmp_path), https_config) + + # Assert file was downloaded + expected_path = tmp_path / "test_dataset" / "example_file.txt" + assert os.path.exists(expected_path) + assert result == str(expected_path) + + def test_download_file_request_error(self, tmp_path): + """Test that RequestException raises DownloadError.""" + item = { + "path": "/data", + "name": "example_file.txt" + } + https_config = { + "base_url": "https://example.com/", + "source_id": "test_dataset" + } + + # Mock request failure + with mock.patch.object(requests, "get", side_effect=requests.exceptions.RequestException("Connection failed")): + with pytest.raises(DownloadError) as exc_info: + download_file(item, str(tmp_path), https_config) + + error = exc_info.value + assert error.url == "https://example.com/data/example_file.txt" + assert "Connection failed" in error.reason + assert error.destination is not None + + def test_download_file_io_error(self, tmp_path): + """Test that IOError raises DownloadError.""" + item = { + "path": "/data", + "name": "example_file.txt" + } + https_config = { + "base_url": "https://example.com/", + "source_id": "test_dataset" + } + + # Mock successful response but write failure + mock_response = mock.Mock() + mock_response.iter_content = mock.Mock(return_value=[b"data"]) + mock_response.raise_for_status = mock.Mock() + mock_response.__enter__ = mock.Mock(return_value=mock_response) + mock_response.__exit__ = mock.Mock(return_value=False) + + with mock.patch.object(requests, "get", return_value=mock_response): + with mock.patch("builtins.open", side_effect=IOError("Disk full")): + with pytest.raises(DownloadError) as exc_info: + download_file(item, str(tmp_path), https_config) + + error = exc_info.value + assert "Disk full" in error.reason + + def test_download_error_has_structured_info(self): + """Test that DownloadError provides structured information.""" + error = DownloadError( + url="https://example.com/file.txt", + reason="Connection timeout", + destination="/tmp/file.txt" + ) + + assert error.url == "https://example.com/file.txt" + assert error.reason == "Connection timeout" + assert error.destination == "/tmp/file.txt" + assert "Connection timeout" in str(error) diff --git a/tests/test_new_features.py b/tests/test_new_features.py new file mode 100644 index 00000000..2f40dcc6 --- /dev/null +++ b/tests/test_new_features.py @@ -0,0 +1,265 @@ +"""Tests for new features: as_json, include_schema, get_schema, MCP tools.""" + +import pytest +from unittest import mock + +from foundry import Foundry, FoundryDataset +from tests.test_data import datacite_data, valid_metadata + + +class TestAsJsonParameter: + """Tests for the as_json parameter on search and list.""" + + def test_dataset_to_dict_method(self): + """Test that _dataset_to_dict converts a dataset to dict properly.""" + # Create a mock dataset + mock_ds = mock.Mock() + mock_ds.dataset_name = "test_dataset" + mock_ds.dc.titles = [mock.Mock(title="Test Dataset")] + mock_ds.dc.identifier = mock.Mock() + mock_ds.dc.identifier.identifier = "10.18126/test" + mock_ds.dc.descriptions = [mock.Mock(description="A test dataset")] + mock_ds.dc.publicationYear = 2024 + mock_ds.foundry_schema.keys = [] + mock_ds.foundry_schema.splits = [] + mock_ds.foundry_schema.data_type = "tabular" + + # Test the _dataset_to_dict method directly + from foundry.foundry import Foundry + result = Foundry._dataset_to_dict(None, mock_ds) + + assert isinstance(result, dict) + assert result["name"] == "test_dataset" + assert result["title"] == "Test Dataset" + assert result["doi"] == "10.18126/test" + assert result["data_type"] == "tabular" + + def test_dataset_to_dict_includes_fields_and_splits(self): + """Test that _dataset_to_dict includes fields and splits.""" + mock_key = mock.Mock() + mock_key.key = ["band_gap"] + + mock_split = mock.Mock() + mock_split.label = "train" + + mock_ds = mock.Mock() + mock_ds.dataset_name = "test_dataset" + mock_ds.dc.titles = [mock.Mock(title="Test")] + mock_ds.dc.identifier = mock.Mock() + mock_ds.dc.identifier.identifier = "10.18126/test" + mock_ds.dc.descriptions = [] + mock_ds.dc.publicationYear = 2024 + mock_ds.foundry_schema.keys = [mock_key] + mock_ds.foundry_schema.splits = [mock_split] + mock_ds.foundry_schema.data_type = "tabular" + + from foundry.foundry import Foundry + result = Foundry._dataset_to_dict(None, mock_ds) + + assert result["fields"] == ["band_gap"] + assert result["splits"] == ["train"] + assert result["data_type"] == "tabular" + + +class TestGetSchema: + """Tests for the get_schema method on FoundryDataset.""" + + def test_get_schema_returns_dict(self): + """Test that get_schema returns a dictionary with expected fields.""" + ds = FoundryDataset( + dataset_name='test_dataset', + foundry_schema=valid_metadata, + datacite_entry=datacite_data + ) + + schema = ds.get_schema() + + assert isinstance(schema, dict) + assert schema["name"] == "test_dataset" + assert "title" in schema + assert "doi" in schema + assert "data_type" in schema + assert "splits" in schema + assert "fields" in schema + + def test_get_schema_includes_field_details(self): + """Test that get_schema includes field descriptions and units.""" + ds = FoundryDataset( + dataset_name='test_dataset', + foundry_schema=valid_metadata, + datacite_entry=datacite_data + ) + + schema = ds.get_schema() + + # Check that fields have the expected structure + assert len(schema["fields"]) > 0 + field = schema["fields"][0] + assert "name" in field + assert "role" in field + assert "description" in field + assert "units" in field + + def test_get_schema_includes_splits(self): + """Test that get_schema includes split information.""" + ds = FoundryDataset( + dataset_name='test_dataset', + foundry_schema=valid_metadata, + datacite_entry=datacite_data + ) + + schema = ds.get_schema() + + assert len(schema["splits"]) == 2 # train and test from valid_metadata + split_names = [s["name"] for s in schema["splits"]] + assert "train" in split_names + assert "test" in split_names + + +class TestIncludeSchema: + """Tests for the include_schema parameter on get_as_dict.""" + + def test_include_schema_false_returns_data_only(self): + """Test that include_schema=False returns just data.""" + ds = FoundryDataset( + dataset_name='test_dataset', + foundry_schema=valid_metadata, + datacite_entry=datacite_data + ) + ds._foundry_cache = mock.Mock() + ds._foundry_cache.load_as_dict.return_value = {"train": ({"x": [1, 2]}, {"y": [0, 1]})} + + result = ds.get_as_dict(include_schema=False) + + assert "schema" not in result + assert "train" in result + + def test_include_schema_true_returns_data_and_schema(self): + """Test that include_schema=True returns data with schema.""" + ds = FoundryDataset( + dataset_name='test_dataset', + foundry_schema=valid_metadata, + datacite_entry=datacite_data + ) + ds._foundry_cache = mock.Mock() + ds._foundry_cache.load_as_dict.return_value = {"train": ({"x": [1, 2]}, {"y": [0, 1]})} + + result = ds.get_as_dict(include_schema=True) + + assert "data" in result + assert "schema" in result + assert result["schema"]["name"] == "test_dataset" + + +class TestMCPTools: + """Tests for MCP server tools.""" + + def test_search_datasets_tool(self): + """Test the search_datasets MCP tool.""" + from foundry.mcp.server import TOOLS + + search_tool = next(t for t in TOOLS if t["name"] == "search_datasets") + + assert search_tool["name"] == "search_datasets" + assert "query" in search_tool["inputSchema"]["properties"] + assert "limit" in search_tool["inputSchema"]["properties"] + assert "query" in search_tool["inputSchema"]["required"] + + def test_get_dataset_schema_tool(self): + """Test the get_dataset_schema MCP tool.""" + from foundry.mcp.server import TOOLS + + schema_tool = next(t for t in TOOLS if t["name"] == "get_dataset_schema") + + assert schema_tool["name"] == "get_dataset_schema" + assert "doi" in schema_tool["inputSchema"]["properties"] + assert "doi" in schema_tool["inputSchema"]["required"] + + def test_load_dataset_tool(self): + """Test the load_dataset MCP tool.""" + from foundry.mcp.server import TOOLS + + load_tool = next(t for t in TOOLS if t["name"] == "load_dataset") + + assert load_tool["name"] == "load_dataset" + assert "doi" in load_tool["inputSchema"]["properties"] + assert "split" in load_tool["inputSchema"]["properties"] + + def test_list_all_datasets_tool(self): + """Test the list_all_datasets MCP tool.""" + from foundry.mcp.server import TOOLS + + list_tool = next(t for t in TOOLS if t["name"] == "list_all_datasets") + + assert list_tool["name"] == "list_all_datasets" + assert "limit" in list_tool["inputSchema"]["properties"] + + def test_create_server(self): + """Test that create_server returns proper server config.""" + from foundry.mcp.server import create_server + + config = create_server() + + assert config["name"] == "foundry-ml" + assert "version" in config + assert "tools" in config + assert len(config["tools"]) == 4 + + +class TestCLI: + """Tests for CLI commands.""" + + def test_cli_app_exists(self): + """Test that CLI app is properly configured.""" + from foundry.__main__ import app + + assert app is not None + assert app.info.name == "foundry" + + def test_cli_has_search_command(self): + """Test that CLI has search command.""" + from foundry.__main__ import search + assert search is not None + assert callable(search) + + def test_cli_has_get_command(self): + """Test that CLI has get command.""" + from foundry.__main__ import get + assert get is not None + assert callable(get) + + def test_cli_has_schema_command(self): + """Test that CLI has schema command.""" + from foundry.__main__ import schema + assert schema is not None + assert callable(schema) + + def test_cli_has_catalog_command(self): + """Test that CLI has catalog command.""" + from foundry.__main__ import catalog + assert catalog is not None + assert callable(catalog) + + def test_cli_has_push_to_hf_command(self): + """Test that CLI has push_to_hf command.""" + from foundry.__main__ import push_to_hf + assert push_to_hf is not None + assert callable(push_to_hf) + + def test_cli_has_version_command(self): + """Test that CLI has version command.""" + from foundry.__main__ import version + assert version is not None + assert callable(version) + + def test_cli_has_mcp_start_command(self): + """Test that CLI has MCP start command.""" + from foundry.__main__ import start + assert start is not None + assert callable(start) + + def test_cli_has_mcp_install_command(self): + """Test that CLI has MCP install command.""" + from foundry.__main__ import install + assert install is not None + assert callable(install)