Skip to content

[Question]: Mark action labels from N1 trajectory data: Migrate the N1 subset data for use in the CE subset to fine-tune System2 #243

@clairetsai1222

Description

@clairetsai1222

Question

Dear Authors,

Thank you for your excellent work on System2 and the InternData-N1 dataset! The research is very impressive and the codebase is well-structured.

I am attempting to fine-tune System2 using the N1 subset and have encountered questions about chat template generation:
(I know that in InternData design, the CE subset is adjusted to fine-tune System2. However, the direction I want to optimise is more suitable for fine-tuning with the N1 subset, so I have these questions. I'd be very grateful if you could kindly answer my questions.)

Questions 1. Action Label Semantics in CE Subset

In the CE subset's trajectory data, each frame contains an action field:

{"action":2,"timestamp":0.03333333507180214,"frame_index":1,"episode_index":0,"index":1,"task_index":0}

Could you clarify the exact semantic meaning of this action field?

  • Does action=2 (TURN_LEFT) represent the action executed at the previous frame that led to the current observation at frame_index=1?
  • Or does it represent the action that should be taken from the current frame to reach the next waypoint?
  • In other words: is this action the historical transition action (from frame_i-1 to frame_i) or the next action to execute (from frame_i to frame_i+1)?

This distinction is critical for understanding how to properly construct the training data where the model predicts action symbols like "↓", "↑", "←", "→".

Questions 2. Reproducing CE Action Labels for N1 Subset

The N1 subset provides:

  • Task instructions (goal descriptions)
  • 3D trajectory positions (x, y, z)
  • Agent orientations (quaternions or Euler angles)
  • Timestamps
  • RGB observations

However, it lacks explicit action labels required by the System2 training chat template.

Is there a recommended approach to derive action labels from the available N1 trajectory data?

For example, could we:

  • Calculate the relative orientation change between consecutive frames (using quaternion differences or Euler angle deltas) to determine "↓", "↑", "←", "→"?
  • Compute the displacement vector between consecutive positions to identify "↑" actions?
  • Use timestamp differences to validate action transitions?
  • Apply some threshold-based heuristics for action classification?

Or are there any existing scripts or preprocessing pipelines in your codebase that perform this trajectory-to-action conversion that I might have missed?

Personal Context and Technical Detail Analysis

After analysing the codebase, I understand the System2 training chat template structure:

[
  [
    {
      'from': 'human',
      'value': (
        "You are an autonomous navigation assistant. Your task is to Find the kitchen. "
        "Where should you go next to stay on track? ... "
        "ahead of you is <image>."
      )
    },
    {
      'from': 'gpt',
      'value': "↓"  # Action prediction (PITCH_DOWN in this case)
    },
    {
      'from': 'human',
      'value': "in front of you is <image>."
    },
    {
      'from': 'gpt',
      'value': "0.45 0.32"  # 2D waypoint coordinates
    }
  ]
]

The action space mapping is:

idx2actions = {
    0: 'STOP',
    1: "↑",   # MOVE_FORWARD
    2: "←",   # TURN_LEFT
    3: "→",   # TURN_RIGHT
    5: "↓"    # PITCH_DOWN (pitch angle adjustment)
}

This corresponds to the paper's description: "The action space adheres to Habitat's default VLN task configuration, comprising four discrete actions: MOVE_FORWARD (0.25m), TURN_LEFT (15°), TURN_RIGHT (15°), and STOP."

My Observations:

  • The CE subset contains shorter trajectories but higher sample density compared to the CE subset
  • Task instructions and 2D waypoints (projected from 3D positions) are readily available in N1
  • The missing piece is the discrete action supervision signal needed for the intermediate action prediction step in the multi-turn dialogue

My Goal:
I want to leverage the N1 subset for System2 fine-tuning on vision-language navigation tasks, specifically to reproduce the training paradigm demonstrated with the CE subset. Understanding the action derivation methodology would enable me to preprocess the N1 data accordingly.

Any guidance on the action label generation process or pointers to relevant code sections would be greatly appreciated!

Thank you very much for your time and for contributing such valuable work.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions