Skip to content

Streaming on LLM calls #2178

@themaherkhalil

Description

@themaherkhalil

Is there an existing issue for this?

  • I confirm that I have not found an existing or similar issue.

Description

Implement the new streaming endpoint to get output from LLM to the user as results are coming in.

1) Content streaming (text)

User hits the runPixelAsync method call with the following python code:

from ai_server import ModelEngine
model = ModelEngine(engine_id = "b0d18f4b-ff2c-4563-8f9d-57efbff53d60")

# Text Generation
command = '\''write me a paragraph about soccer'\''
output = model.ask(command = command, param_dict={'\''max_completion_tokens'\'':20000,'\''temperature'\'':0.3})
output

The run pixel async returns a json with the jobId

{
    "jobId": "019a6f97-9c11-73a9-a540-61325b1c2f5c"
}

The new endpoint for streaming - POST /Monolith/api/engine/pixelJobStreaming with body --data-urlencode 'jobId=019a6f96-1192-7cdc-8f61-7e2362f6ed5e' - will return new messages in the format:

{
    "message": [
        {
            "stream_type": "content",
            "data": {
                "content": ""
            }
        },
        {
            "stream_type": "content",
            "data": {
                "content": "Soccer"
            }
        },
        {
            "stream_type": "content",
            "data": {
                "content": ","
            }
        },
        {
            "stream_type": "content",
            "data": {
                "content": " known"
            }
        },
        {
            "stream_type": "content",
            "data": {
                "content": " as football"
            }
        },
        ......
        {
            "stream_type": "content",
            "data": {
                "content": " cher"
            }
        },
        {
            "stream_type": "content",
            "data": {
                "content": "ished clubs"
            }
        },
        {
            "stream_type": "content",
            "data": {
                "content": "."
            }
        },
        {
            "stream_type": "content",
            "data": {
                "finish_reason": "stop"
            }
        }
    ],
    "status": "ProgressComplete"
}

the FE will continue calling this until it gets a message where data contains the key "finish_reason" and the value will be the reason why it ended (like stop for naturally finished generations or length if token_limit has been reached and the response is truncated, etc.)

2) Tool calling streaming

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions