Skip to content

NetX-lab/Ayo

Repository files navigation

Ayo

This repository contains the prototype implementation for our ASPLOS'25 paper: Towards End-to-End Optimization of LLM-based Applications with Ayo (arXiv preprint).

Ayo is a fine-grained orchestration framework designed for building and optimizing AI-powered applications—such as Retrieval-Augmented Generation (RAG) workflows—in environments where inference engines are deployed locally rather than accessed via remote APIs.

Unlike existing frameworks that usually treat workflows as coarse-grained, sequential module chains, Ayo introduces a task-primitive-based abstraction, enabling highly flexible and dynamic orchestration. With minimal user input, Ayo automatically optimizes workflows for performance, exploiting parallelism, pipelining, and inherent scheduling strategies.

Note:Some parts of the repo are still under construction, e.g. the unified multi-request scheduling for engine schedulers, user-friendly interface, and the documentation. We would keep updating these.

Key Features

  • Fine-grained task orchestration for LLM workflows
  • Dynamic optimization for performance (e.g., parallelism, pipelining)
    • Dependency Pruning
    • Stage decomposition parallelization
    • LLM Prefilling Splitting
    • LLM Decoding pipelining
  • Distributed Two-level Scheduling
    • A graph scheduler for the task primitive scheduling of each query graph
    • Several distinct engine schedulers for handling different types of engines and managing the different operations

Quick Start

  1. Install dependencies:

Install postgres and pgvector:

sudo apt-get install postgresql postgresql-contrib libpq-dev # install postgresql
git clone https://github.com/pgvector/pgvector.git # compile and install pgvector; you could install through other ways as well
cd pgvector
make
sudo make install
sudo -u postgres psql template1 -c "CREATE EXTENSION vector;" # test

Install our modified vllm:

git clone --recurse-submodules https://github.com/NetX-lab/Ayo.git # clone the repo and submodules
cd vllm
pip install -e .

Install Ayo:

cd ..
pip install -r requirements.txt
pip install -e .
  1. Define the workflow with Nodes (Task Primitives) and Optimize the workflow with Ayo
Click to expand the code
from Ayo.app import APP
from Ayo.configs.config import EngineConfig
from Ayo.engines.engine_types import EngineType

app = APP.init() # initialize the app entry

llm_config = EngineConfig(
    name="llm_service",
    engine_type=EngineType.LLM,
    resources={},
    num_gpus=1,
    num_cpus=1,
    instances=1,
    model_config={
        "model_name": "meta-llama/Llama-2-7b-chat-hf",
        "tensor_parallel_size": 1,
        #other config ...
    },
    latency_profile={
        "timeout": 300,
    }
)

app.register_engine(llm_config)
#register other engines ...


# define the primitive nodes
llm_prefilling_node = Node(
    name="LLMPrefilling",
    node_type=NodeType.COMPUTE,
    engine_type=EngineType.LLM,
    io_schema=NodeIOSchema(
        input_format={"queries": List[str], "reranked_results": List[List[str]]},
        output_format={"prefill_state": bool}
    ),
    op_type=NodeOps.LLM_PREFILLING,
    config={
        'prompt_template': replace_placeholders(RAG_QUESTION_ANSWERING_PROMPT_TEMPLATE_STRING, question="queries", context="reranked_results"),
        'parse_json': True,
        #other config ...
    }
)

llm_decoding_node = Node(
    name="LLMDecoding",
    node_type=NodeType.COMPUTE,
    engine_type=EngineType.LLM,
    io_schema=NodeIOSchema(
        input_format={"prefill_state": bool},
        output_format={"result": str}
    ),
    op_type=NodeOps.LLM_DECODING,
    config={
        'prompt_template': replace_placeholders(RAG_QUESTION_ANSWERING_PROMPT_TEMPLATE_STRING, question="queries", context="reranked_results"),
        'parse_json': True,
        #other config ...
    }
)
#define other nodes ...

# create the DAG
dag = DAG(dag_id="rag_workflow")
dag.register_nodes(llm_prefilling_node, llm_decoding_node, ...)
# set the query inputs
dag.set_query_inputs(
  {
    'queries': ['What is the capital of France?'], ## set the query inputs
  }
)

from Ayo.opt_pass.pruning_dependency import PruningDependencyPass
from Ayo.opt_pass.stage_decomposition import StageDecompositionPass
from Ayo.opt_pass.prefilling_split import PrefillingSpiltPass
from Ayo.opt_pass.decoding_pipeling import LLMDecodingPipeliningPass

dag.optimize([PruningDependencyPass(), StageDecompositionPass(), PrefillingSpiltPass(), LLMDecodingPipeliningPass()])

query=Query(
    uuid=f"random-test-{query_id}",
    query_id=f"random-test-{query_id}",
    DAG=deepcopy(dag)
)

future = await app.submit_query(
        query=query,
        timeout=300
    )

result = await asyncio.wait_for(future, timeout=300)
  1. Define the high-level task modules, then transform and optimize the workflow with Ayo (some parts are still under construction)
Click to expand the code
from Ayo.modules import IndexingModule, QueryExpandingModule, SearchingModule, RerankingModule
from Ayo.modules_to_primitives import transform_mod_to_prim

indexing_module = IndexingModule(
    input_format={"passages": List[str]},
    output_format={"index_status": bool}
)

query_expanding_module = QueryExpandingModule(
    input_format={"query": str},
    output_format={"expanded_queries": List[str]},
    config={"expanded_query_num": 3}
)

searching_module = SearchingModule(
    input_format={"index_status": bool, "expanded_queries": List[str]},
    output_format={"searching_results": List[str]}
)

reranking_module = RerankingModule(
    input_format={"searching_results": List[str]},
    output_format={"reranking_results": List[str]}
)


indexing_module>>query_expanding_module>>searching_module>>reranking_module


node_list=transform_mod_to_prim([indexing_module,query_expanding_module,searching_module,reranking_module])

### Then optimize the workflow with Ayo as above

Examples

Some examples are in the examples folder.

The testbed is a server with 4x NVIDIA 3090 GPUs and 52 cores Intel(R) Xeon(R) Gold 5320 CPU.

For example0, in file examples/optimized_embedding_ingestion_searching_reranking_llm.py, we provide the optimized workflow for the naive RAG workflow with Ayo and the unoptimized workflow is in file examples/unoptimized_embedding_ingestion_searching_reranking_llm.py.

We could see the visualization comparison of the unoptimized (left) and optimized (right) workflow under the same folder.

unoptimized workflow optimized workflow

The execution latency is:

Workflow Type Latency
Unoptimized 3.72s
Optimized 1.97s

For example1, in file examples/optimized_embedding_ingestion_rewriting_searching_reranking_llm.py, we provide the optimized workflow for the advanced RAG workflow with Ayo and the unoptimized workflow is in file examples/unoptimized_embedding_ingestion_rewriting_searching_reranking_llm.py.

We could see the visualization comparison of the unoptimized (left) and optimized (right) workflow under the same folder.

unoptimized workflow optimized workflow

The execution latency is:

Workflow Type Latency
Unoptimized 7.40s
Optimized 3.17s

To-Do List

  • Add the multiple-LLM calling example
  • Refine the support for more LLM prompt templates
  • Add the unified multi-request scheduling

Acknowledgements

We list open-source projects used by us and our modifications to them (if any).

Citation

If you find this work useful, please cite our paper:

@inproceedings{tan2025ayo,
title = {Towards End-to-End Optimization of LLM-based Applications with Ayo},
author = {Xin Tan and Yimin Jiang and Yitao Yang and Hong Xu},
booktitle = {Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
url = {https://doi.org/10.1145/3676641.3716278},
doi = {10.1145/3676641.3716278},
year = {2025}
}

@article{tan2024teola,
  title={Teola: Towards end-to-end optimization of llm-based applications},
  author={Tan, Xin and Jiang, Yimin and Yang, Yitao and Xu, Hong},
  journal={arXiv preprint arXiv:2407.00326},
  year={2024}
}

Contact

If you have any questions or feedback, please email Xin Tan (xtan22@cse.cuhk.edu.hk).

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

[ASPLOS'25] Towards End-to-End Optimization of LLM-based Applications with Ayo

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages