Skip to content

Added Multi Trajectory context window generation#9

Merged
pietronvll merged 8 commits intoMachine-Learning-Dynamical-Systems:devfrom
Danfoa:multi_trajectory
Apr 11, 2024
Merged

Added Multi Trajectory context window generation#9
pietronvll merged 8 commits intoMachine-Learning-Dynamical-Systems:devfrom
Danfoa:multi_trajectory

Conversation

@Danfoa
Copy link
Contributor

@Danfoa Danfoa commented Mar 27, 2024

I modified TrajectoryContextDataset to accept multiple trajectory data.

As the data stream remains agnostic to the dimensions of the features, I had to add a "multi_traj" boolean argument at the class's initializer to assert if the first dimension of the vector is considered as the number of trajectories or not.

I suggest enforcing a prefix input expected shape of (n_trajs, n_frames, *features_dims), to avoid this new unnecessary argument and rely on the auxiliary methods traj_to_contexts and multi_traj_to_context to deal with the more flexible shapes.

Comments: There seems to be a perhaps unnecessarily deep hierarchy or classes TrajectoryContextDataset -> TensorContextDataset -> ContextWindowDataset -> ContextWindow -> Sequence, for handling the time series data slicing. Several of these classes repeat data parsing to appropriate backends and shapes. This deep hierarchy makes it quite cumbersome to modify the data pipeline for someone unfamiliar with the code. I propose to reduce this hierarchy to 1 or 2 classes maximum.

Danfoa added 2 commits March 27, 2024 21:15
- Modified `TrajectoryContextDataset` to accept multiple trajectory data.
@Danfoa Danfoa changed the title Multi trajectory Added Multi Trajectory context window generation Mar 28, 2024
@Danfoa
Copy link
Contributor Author

Danfoa commented Mar 28, 2024

It's worth noting that the HuggingFace Dataset class accepts an ArrayLike structure with the shape (n_samples, context_len, *features) (using the method ) consisting of either np.ndarray or torch.Tensors.

from datasets import Dataset
def my_gen():
    for i in range(1, 4):
        yield <numpy/torch.tensor>
dataset = Dataset.from_generator(my_gen)

This class seamlessly converts the data format to the appropriate backend (numpy, torch, jax, TensorFlow) using the dataset.with_format("tf/torch") method.

from datasets import Dataset
data = [[1, 2],[3, 4]]
ds = Dataset.from_dict({"data": data})
ds = ds.with_format("tf")
ds[0]
ds[:2]

Therefore, Kooplearn could primarily focus on trajectory/context slicing, leaving the backend management to the HuggingFace backend, which is already integrated within the framework.

raise ImportError(
"You selected the 'torch' backend, but kooplearn wasn't able to import it."
)
if isinstance(data_per_traj[0], np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By concatenating everything together, in the case of multi-trajectories it becomes hard to answer questions like "What was the n-th frame of the k-th trajectory?"

Of course, if every trajectory has the same length, this can be quickly recovered, but this might not always be the case (see my other comment).

The primary purpose of the idx_map, although the API is not exploited yet, is to help the user evaluate multi-step forecasting errors. For a simple trajectory, this can be done, e.g. as

for t in times:
        ref_idxs = test_contexts.idx_map.data[:, 0, 0] + t
        Y_pred = model.predict(test_contexts, t=t)
        Y_true = test_contexts.data[ref_idxs]

Copy link
Contributor Author

@Danfoa Danfoa Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was unclear for me from the documentation and docstring what the idx_map role was.

I follow now what you mean. Will update.

In the case of multi-trajectory then, idx_map will have an index per trajectory.

- After initializing TensorContextDataset the `self.backend` attribute stores the desired backend. Either torch or numpy.
- The method __getitems__ is a method from Dataset classes that enable fast collection of samples already in batch form. As our datasets are assumed to be in memory, we can enable this fast indexing.
elif isinstance(idx, slice):
return TensorContextDataset(self.data[idx])

def __getitems__(self, indices: list[int]) -> 'TensorContextDataset':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I didn't know if this trick. As far as I can tell, however, it only works with the torch backend, right? I propose to change __getitem__ as follows

def __getitem__(self, idx) -> 'TensorContextDataset':
        if np.issubdtype(type(idx), np.integer):
            # TODO: The default collect behaviour is to return a list of [type] objects. This additional dimension here
            # seems to be introduced only for a very specific customm collect_fn.
            return TensorContextDataset(self.data[idx][None, ...])
        elif isinstance(idx, slice):
            return TensorContextDataset(self.data[idx])
        # CHANGE STARTS HERE
        else:
            if self.backend == 'torch':
                self.__getitems__(idx)
            else:
                return TensorContextDataset(self.data[idx])
            
         

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is torch specific. The __getitems__ method is called on the default data fetch method used by DataLoaders

@pietronvll
Copy link
Contributor

@Danfoa any updates on this pull request?

@pietronvll pietronvll merged commit 88b8a6a into Machine-Learning-Dynamical-Systems:dev Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants