Skip to content

[EVAL] Long Horizon Execution#1074

Closed
akshathmangudi wants to merge 11 commits intohuggingface:mainfrom
akshathmangudi:akshath/issue-1056-v2
Closed

[EVAL] Long Horizon Execution#1074
akshathmangudi wants to merge 11 commits intohuggingface:mainfrom
akshathmangudi:akshath/issue-1056-v2

Conversation

@akshathmangudi
Copy link
Copy Markdown
Contributor

I screwed up my previous git clone, so I had to redo the changes 😅

Description:
Approach described within #1056.

Tasks:

  • Initial scaffolding of /tasks/tasks/long_horizon_execution.py
  • Implement a custom scorer to parse <answer> tags.
  • Complete implementation of /tasks/tasks/long_horizon_execution.py
  • Evaluation and Testing

STATUS: ready for review.

Current behavior:

When we run lighteval tasks inspect long_horizon_execution, the output has been shown below:

... more lines
           "'basic', 'alive', 'cream', 'dress', 'black', 'brown', 'drama', "
           "'black', 'audio', 'brown', 'album', 'cover', 'avoid', 'aware', "
           "'event', 'dream', 'clean', 'clock', 'apple', 'above', 'close', "
           "'begin', 'allow', 'album', 'draft', 'brain', 'civil', 'faith', "
           "'death', 'coach', 'below', 'doubt', 'aware', 'cover', 'final', "
           "'allow', 'avoid', 'ahead', 'cross', 'child', 'cream', 'error', "
           "'break', 'brief', 'clock', 'final', 'dance', 'award', 'every', "
           "'chief', 'could', 'dream', 'begin', 'burst', 'audio', 'album', "
           "'cross', 'doubt', 'blood', 'child', 'brand', 'brand', 'extra', "
           "'broad', 'cloud', 'check', 'after', 'chart', 'basic', 'child', "
           "'coach', 'chair', 'faith', 'earth', 'audio', 'basic', 'field', "
           "'cloud', 'draft', 'apply', 'court', 'black', 'ahead', 'burst', "
           "'crowd', 'depth', 'enemy', 'drink', 'first', 'could', 'false', "
           "'could', 'blame', 'first', 'album', 'crowd', 'first', 'broad', "
           "'extra', 'clock', 'chart', 'fiber', 'board', 'earth', 'being', "
           "'alive', 'chart', 'avoid', 'dress', 'cloud', 'clean', 'avoid', "
           "'crash', 'clean', 'arise', 'death', 'brand', 'error']\n"
           '\n'
           'Your task: Calculate the cumulative sum after each key. The first '
           'sum is just the value of the first key. The second sum is the '
           'first value plus the second value, and so on.\n'
           '\n'
           'IMPORTANT:\n'
           '- Output your answer as a single line with comma-separated values '
           'inside <answer></answer> tags\n'
           '- Do not include any other text outside the answer tags\n'
           '- Format: <answer>value1,value2,value3,...</answer>\n'
           '- Example: If the cumulative sums are [5, 8, 12], output: '
           '<answer>5,8,12</answer>\n'
           '\n'
           'Your answer:',
  'sampling_methods': [],
  'specific': None,
  'stop_sequences': (),
  'task_name': 'long_horizon_execution',
  'unconditioned_query': None,
  'use_logits': False}

@akshathmangudi akshathmangudi marked this pull request as ready for review November 21, 2025 10:59
@akshathmangudi
Copy link
Copy Markdown
Contributor Author

cc: @NathanHB

@NathanHB
Copy link
Copy Markdown
Member

looking good ! Will run locally and review today or start of next week :)
Can you share a HUggingFace Space with the samples as described here to make it easier to verify ? 🤗

@akshathmangudi
Copy link
Copy Markdown
Contributor Author

i ran the benchmark on HF Inference's gpt-4o but a lot of the results I am seeing are quite poor. is this expected or something wrong with the prompting that I haven't looked at yet?

https://huggingface.co/spaces/akshathmangudi/lhe-gpt4o-single

Copy link
Copy Markdown
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey ! Thanks for the hard work on this, i'm testing it locally right now. I have some small nits but it's looking almost ready !

Comment thread src/lighteval/tasks/tasks/long_horizon_execution/__init__.py Outdated
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/single_turn.py Outdated
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/constants.py
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/constants.py
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py Outdated
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py Outdated
Copy link
Copy Markdown
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on single turn, working great with the few nits I added above. However i cannot seems to make the multiturn work, can you ping when it's ready?

@akshathmangudi
Copy link
Copy Markdown
Contributor Author

@NathanHB it should be working now, ive created a link below that tests both single and multi-turn.

https://huggingface.co/spaces/akshathmangudi/lhe-gpt

@NathanHB
Copy link
Copy Markdown
Member

NathanHB commented Dec 4, 2025

hey @akshathmangudi that's amazing !!
The link seems broken or maybe the dataset is private ? :)

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@akshathmangudi
Copy link
Copy Markdown
Contributor Author

sorry! it was private. made it public now :)

@NathanHB
Copy link
Copy Markdown
Member

NathanHB commented Dec 4, 2025

great ! Maybe i'm mistaken but i only see single turn eval ?

@NathanHB
Copy link
Copy Markdown
Member

NathanHB commented Dec 9, 2025

hey @akshathmangudi we are planning a release thisz week and would love the tasks you started implementing to be in it. I was just wondering if you were planning on finishing those or if i could take over ? Thanks ! 🤗

@akshathmangudi
Copy link
Copy Markdown
Contributor Author

hey @NathanHB!

sorry, been traveling all week. i'll have some space today and tomorrow, since a lot of the comments are nits and just things i accidentally overlooked (sorry for that), ill get them ready ASAP!

@akshathmangudi
Copy link
Copy Markdown
Contributor Author

https://huggingface.co/spaces/akshathmangudi/lhe-gpt

ive updated the space to have multi-turn evaluation. please let me know if any changes have to be made 🤗

Copilot AI review requested due to automatic review settings December 9, 2025 15:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the Long Horizon Execution benchmark for evaluating language models' ability to maintain state and perform cumulative operations over long sequences. The implementation follows a research paper approach with both single-turn (process all keys at once) and multi-turn (incremental key processing) evaluation modes.

Key Changes

  • Added complete task implementation with support for 7 context sizes (1024-65536) and 3 turn complexities (K=1, 2, 10)
  • Implemented custom answer tag parsing scorers for extracting <answer> formatted responses
  • Used binary search optimization to fit maximum items within prompt length constraints

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
src/lighteval/tasks/tasks/long_horizon_execution/constants.py Defines prompt templates and configuration constants for context sizes and turn complexities
src/lighteval/tasks/tasks/long_horizon_execution/utils.py Implements binary search logic and prompt building functions for both single and multi-turn modes
src/lighteval/tasks/tasks/long_horizon_execution/main.py Provides single-turn task implementation with scorer and creates task configurations
src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py Implements multi-turn evaluation with conversation state tracking and fractional accuracy scoring
Comments suppressed due to low confidence (2)

src/lighteval/tasks/tasks/long_horizon_execution/utils.py:130

  • Surplus named argument for string format. An argument named 'num_keys' is provided, but it is not required by [format "You are an AI assistant. I will provide you with a dictionary and then give you keys in groups of {k}.
    Your task is to keep a running total (starting from 0) by adding the values associated with the keys I provide.
    In each turn, I'll provide {k} keys (comma-separated).
    Respond with the current running sum, enclosed in tags.

Dictionary to maintain:
{dict_str}

Ready to start!
User: {keys_str}
Assistant:"](1).

        return PROMPT_TEMPLATE_MULTI_START.format(
            dict_str=dict_str, keys_str=keys_str, k=k, num_keys=len(first_turn_keys)
        )

src/lighteval/tasks/tasks/long_horizon_execution/utils.py:194

  • Surplus named argument for string format. An argument named 'num_keys' is provided, but it is not required by [format "You are an AI assistant. I will provide you with a dictionary and then give you keys in groups of {k}.
    Your task is to keep a running total (starting from 0) by adding the values associated with the keys I provide.
    In each turn, I'll provide {k} keys (comma-separated).
    Respond with the current running sum, enclosed in tags.

Dictionary to maintain:
{dict_str}

Ready to start!
User: {keys_str}
Assistant:"](1).

    initial_prompt = PROMPT_TEMPLATE_MULTI_START.format(
        dict_str=dict_str, keys_str=first_turn_keys_str, k=k, num_keys=len(turn_chunks[0])
    )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/lighteval/tasks/tasks/long_horizon_execution/main.py Outdated
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/constants.py Outdated
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py Outdated
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/constants.py
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/constants.py
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/utils.py Outdated
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/utils.py Outdated
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py Outdated
Comment thread src/lighteval/tasks/tasks/long_horizon_execution/utils.py Outdated
@akshathmangudi
Copy link
Copy Markdown
Contributor Author

it's seems there are few valid nits that copilot has addressed, will be fixing them in a few hours

@akshathmangudi
Copy link
Copy Markdown
Contributor Author

hey @NathanHB, addressed almost all the comments and verified that the benchmark runs. let me know if there's anything else to address :)

@akshathmangudi
Copy link
Copy Markdown
Contributor Author

hi @NathanHB, revisiting this for any comments before requesting to get this merged.

@NathanHB
Copy link
Copy Markdown
Member

hey @akshathmangudi sorry for the wait !

I tested your branch and made some modifs to keep everything in one file, are you ok with me forking this (keeping all the commits and credits) and merge the branch wioth modifs ?

  • basically i changed the task to be only one task (multiturn) where we can choose how many keys we get each turn

@akshathmangudi
Copy link
Copy Markdown
Contributor Author

yes of course, not a problem! let me know if there's anything to do from my side :)

@NathanHB
Copy link
Copy Markdown
Member

nothing more ! The PR is here #1119
Does this fit your needs ?

@akshathmangudi
Copy link
Copy Markdown
Contributor Author

just reviewed the pr, sounds good!

@NathanHB NathanHB closed this Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants