EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

A Generalizable Benchmark for Continuous Plan-and-Execute Decision Making in Interactive Economies

🌟 Overview

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies.

EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity.

EcoGym's design principles and three economic environments: Vending, Freelance, and Operation.

Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient action executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability–utility trade-offs in realistic economic settings.

📊 Experimental Results

Our empirical evaluation on EcoGym reveals a significant performance gap in current LLMs: no single model consistently achieves superior performance across all scenarios, highlighting the inherent difficulty of long-horizon economic decision-making. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. Furthermore, we conduct a comprehensive suite of 8 diagnostic experiments or case studies, encompassing factors such as context window length, agent behavior patterns, additional memory modules, and human baselines.

Performance comparison across eleven leading LLMs in the three EcoGym environments.

⚙️ Setup

# Create environment
conda create -n ecogym python=3.10
conda activate ecogym

# Install dependencies
pip install -r requirements.txt

🔧 Configuration

Configure your API key and model pricing:

Create .env file with your API keys
Set pricing in config/model_pricing.yaml

🚀 Usage

python main.py --type {vending, freelance, operation}

Results will be saved to logs/sessions/{session_id}/

🙏 Acknowledgements

This project is adapted from Agno, a framework for building multi-agent systems that learn and improve with every interaction.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
config		config
data		data
libs/agno/agno		libs/agno/agno
memory		memory
utils		utils
.editorconfig		.editorconfig
.env.example		.env.example
.envrc		.envrc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

🌟 Overview

📊 Experimental Results

⚙️ Setup

🔧 Configuration

🚀 Usage

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

🌟 Overview

📊 Experimental Results

⚙️ Setup

🔧 Configuration

🚀 Usage

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages