Skip to content

sodascience/BiodiversityASSET

Repository files navigation

BiodiversityASSET

LLM-powered analysis of biodiversity-related investment activities in financial reports

Python OpenAI uv

BiodiversityASSET is a comprehensive pipeline for extracting, classifying, and analyzing biodiversity-related content from investor reports. The system uses LLMs to evaluate paragraphs across three key dimensions:

  1. 🌿 Biodiversity relevance - Identifies content related to biodiversity and environmental impact
  2. πŸ’° Investment activity - Classifies paragraphs containing concrete investment activities
  3. πŸ“Š Assetization characteristics - Scores content on intrinsic value, cash flow, and ownership/control

Table of Contents

Key Features

✨ Modular Architecture

  • Submit batch jobs and monitor progress independently
  • Resume workflows from any step using batch IDs
  • Cancel running jobs with safety confirmations

πŸ€– LLM-Powered Processing

  • OpenAI Batch API integration for cost-effective analysis (~50% cost reduction)
  • External prompt system for easy customization
  • Support for multiple models and configurations

πŸ“ Organized Output

  • Results saved in batch-specific subfolders
  • Clean filenames without ID conflicts
  • Individual chunk processing for large datasets

πŸ”§ Developer-Friendly

  • Comprehensive CLI tools with intuitive options
  • Detailed progress monitoring and error handling
  • Flexible configuration and custom prompt support

Processing Pipeline

The processing pipeline consists of sequential steps, with LLM-powered batch processing for steps 3-4:

Step Purpose Script Input Output
1 Extract paragraphs from PDFs extract_pdfs.py data/raw/pdfs/ extracted_paragraphs_from_pdfs/
2 Filter biodiversity content filter_biodiversity_paragraphs.py extracted_paragraphs_from_pdfs/ biodiversity_related_paragraphs/
3a Submit investment classification submit_batch_job.py biodiversity_related_paragraphs/ Returns batch ID
3b Monitor batch progress check_batch_status.py Batch ID Status updates
3c Download investment results download_batch_results.py Batch ID investment_activity_classification/
4a Submit assetization scoring submit_batch_job.py investment_activity_classification/ Returns batch ID
4b Monitor batch progress check_batch_status.py Batch ID Status updates
4c Download assetization results download_batch_results.py Batch ID assetization_features_scoring/

πŸ’‘ Key Points:

  • Steps 3-4 use OpenAI's Batch API for cost-effective processing
  • Each batch step can be run independently
  • Step 4 requires a completed investment activity classification batch ID
  • All results are organized in batch-specific subfolders

Quick Start

Prerequisites

Ensure you have uv installed:

πŸ“¦ Install uv (click to expand)

Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Linux/MacOS:

curl -LsSf https://astral.sh/uv/install.sh | sh

πŸš€ Installation

# Clone the repository
git clone <repository-url>
cd BiodiversityASSET

# Install dependencies
uv sync

βš™οΈ Environment Setup

# Set your OpenAI API key
export OPENAI_API_KEY="your-openai-api-key"

# Or create a .env file
echo "OPENAI_API_KEY=your-openai-api-key" > .env

πŸ“ Basic Usage

1. Extract paragraphs from PDFs

python scripts/extract_pdfs.py

2. Filter biodiversity-related content

python scripts/filter_biodiversity_paragraphs.py

3. Classify investment activities

# Submit the batch job
python scripts/submit_batch_job.py --task investment_activity_classification

# Monitor progress (replace <batch-id> with actual ID)
python scripts/check_batch_status.py --batch-id <batch-id> --wait

# Download results
python scripts/download_batch_results.py --batch-id <batch-id>

4. Score assetization features

# Submit dependent job (requires investment batch ID)
python scripts/submit_batch_job.py --task assetization_features_scoring --batch-id <investment_batch_id>

# Monitor and download
python scripts/check_batch_status.py --batch-id <assetization_batch_id> --wait
python scripts/download_batch_results.py --batch-id <assetization_batch_id>

Batch Job Management

πŸ“Š Monitoring Jobs

# List all batch jobs with LAST-CHECKED status and timestamps
python scripts/check_batch_status.py --list-jobs

# Check CURRENT status of a specific job
python scripts/check_batch_status.py --batch-id <batch-id>

# Wait for job completion (polls every 30 seconds)
python scripts/check_batch_status.py --batch-id <batch-id> --wait

# Custom polling interval
python scripts/check_batch_status.py --batch-id <batch-id> --wait --poll-interval 60

❌ Canceling Jobs

# Cancel a running batch job (requires confirmation)
python scripts/check_batch_status.py --batch-id <batch-id> --cancel

πŸ“‹ Example Job Listing Output

=== Batch Jobs (3 found) ===
Batch ID                              Task                      Status          Last Checked      Submitted         Paragraphs  
-------------------------------------------------------------------------------------------------------------------------------
batch_686fc36b2da08190903bc237510c52f5 investment_activity_class completed       07-10 16:55       2025-07-10T15:43  120         
batch_686fd9e4f814819088b69150a57753d6 assetization_features_sc  submitted       never             2025-07-10T17:19  3           
batch_686fdd5143248190aae3f8185f24a415 investment_activity_class in_progress     07-10 14:30       2025-07-10T14:15  274         

Prompt Customization

BiodiversityASSET uses external text files for prompts, making them easy to customize without code changes:

πŸ“ Default Prompt Files

  • prompts/investment_activity_classification_system_prompt.txt - System prompt for investment activity classification
  • prompts/assetization_features_scoring_system_prompt.txt - System prompt for assetization features scoring
  • prompts/user_prompt_template.txt - User prompt template applied to each paragraph

πŸ› οΈ Using Custom Prompts

# Use custom system prompt
python scripts/submit_batch_job.py --task investment_activity_classification \
    --system-prompt prompts/my_custom_system.txt

# Use both custom system and user prompts
python scripts/submit_batch_job.py --task investment_activity_classification \
    --system-prompt prompts/my_custom_system.txt \
    --user-prompt prompts/my_custom_user.txt

# Use different model with custom prompts
python scripts/submit_batch_job.py --task assetization_features_scoring \
    --batch-id <investment_batch_id> \
    --model gpt-4o \
    --max-tokens 750 \
    --system-prompt prompts/my_custom_system.txt

Project Structure

BiodiversityASSET/
β”œβ”€β”€ πŸ“ data/
β”‚   β”œβ”€β”€ πŸ“ raw/
β”‚   β”‚   └── πŸ“ pdfs/                     # πŸ“„ Input: PDF investor reports
β”‚   β”œβ”€β”€ πŸ“ processed/
β”‚   β”‚   β”œβ”€β”€ πŸ“ extracted_paragraphs_from_pdfs/      # Step 1: Extracted paragraphs
β”‚   β”‚   β”œβ”€β”€ πŸ“ biodiversity_related_paragraphs/     # Step 2: Filtered biodiversity content
β”‚   β”‚   β”œβ”€β”€ πŸ“ investment_activity_classification/  # Step 3: Investment classification results
β”‚   β”‚   β”‚   └── πŸ“ <batch_id>/
β”‚   β”‚   β”‚       β”œβ”€β”€ πŸ“Š batch_results.jsonl
β”‚   β”‚   β”‚       β”œβ”€β”€ πŸ“Š chunk_1.csv
β”‚   β”‚   β”‚       └── πŸ“Š chunk_2.csv
β”‚   β”‚   └── πŸ“ assetization_features_scoring/       # Step 4: Assetization scoring results
β”‚   β”‚       └── πŸ“ <batch_id>/
β”‚   β”‚           β”œβ”€β”€ πŸ“Š batch_results.jsonl
β”‚   β”‚           └── πŸ“Š assetization_features_scored.csv
β”‚   └── πŸ“ human_annotations/            # πŸ‘₯ Manual annotations for evaluation
β”œβ”€β”€ πŸ“ prompts/                          # πŸ€– LLM prompt templates
β”‚   β”œβ”€β”€ πŸ“ investment_activity_classification_system_prompt.txt
β”‚   β”œβ”€β”€ πŸ“ assetization_features_scoring_system_prompt.txt
β”‚   └── πŸ“ user_prompt_template.txt
β”œβ”€β”€ πŸ“ results/
β”‚   β”œβ”€β”€ πŸ“ batch_jobs/                   # πŸ“‹ Batch job metadata and raw results
β”‚   β”‚   β”œβ”€β”€ πŸ“„ <batch_id>.json
β”‚   β”‚   β”œβ”€β”€ πŸ“ investment_activity_classification_processing/
β”‚   β”‚   └── πŸ“ assetization_features_scoring_processing/
β”‚   └── πŸ“ evaluation/                   # πŸ“ˆ Evaluation results (future)
β”œβ”€β”€ πŸ“ scripts/                          # 🐍 Python processing scripts
β”œβ”€β”€ βš™οΈ pyproject.toml                    # πŸ“¦ Project dependencies
β”œβ”€β”€ πŸ”’ uv.lock                           # πŸ” Lock file for dependencies
β”œβ”€β”€ πŸ“– README.md                         # πŸ“š Project documentation
β”œβ”€β”€ πŸ“– BATCH_WORKFLOW.md                 # πŸ”„ Detailed batch processing workflow
└── πŸ“– REFACTORING_SUMMARY.md            # πŸ“ Summary of refactoring changes

Output Organization

Results are organized in batch-specific subfolders to prevent conflicts and enable easy tracking:

πŸ’Ό Investment Activity Classification

data/processed/investment_activity_classification/<batch_id>/
β”œβ”€β”€ πŸ“Š batch_results.jsonl              # Raw API responses
β”œβ”€β”€ πŸ“Š chunk_1.csv                      # Processed results for chunk 1
└── πŸ“Š chunk_2.csv                      # Processed results for chunk 2

Contains: Investment activity scores, explanations, and original paragraph metadata

πŸ“ˆ Assetization Features Scoring

data/processed/assetization_features_scoring/<batch_id>/
β”œβ”€β”€ πŸ“Š batch_results.jsonl              # Raw API responses
└── πŸ“Š assetization_features_scored.csv # Scored paragraphs with all dimensions

Contains: Intrinsic value, cash flow, and ownership/control scores with detailed reasoning

πŸ”‘ Key Benefits

  • πŸ”’ Conflict-free: Each batch job gets its own subfolder
  • 🏷️ Clean naming: Filenames without batch ID suffixes
  • πŸ“ Traceable: Easy to identify which batch produced which results
  • πŸ”„ Resumable: Can re-run or reference specific batch outputs

Documentation

πŸ“– BATCH_WORKFLOW.md - Detailed step-by-step workflow guide with examples

πŸ“ REFACTORING_SUMMARY.md - Complete summary of system architecture and changes


Contributing

We welcome contributions! Please see our contribution guidelines for more information.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use BiodiversityASSET in your research, please cite:

@software{biodiversityasset,
  title={BiodiversityASSET: LLM-powered analysis of biodiversity-related investment activities},
  author={SoDa},
  year={2025},
  url={https://github.com/yourusername/BiodiversityASSET}
}

Contact

This project is developed and maintained by the ODISSEI Social Data Science (SoDa) team.

Do you have questions, suggestions, or remarks? File an issue or feel free to contact Qixiang Fang or Catalina Papari.

About

LLM-powered analysis of biodiversity-related investment activities in financial reports

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages