Skip to content

Non-deterministic data reading of image_data_layer in parallel training #4590

@AIROBOTAI

Description

@AIROBOTAI

Hi all, I have a question about the deterministic batch input of image_data_layer when doing parallel training. Suppose we have a dataset which only contains four batches named A, B, C, D, respectively. And we have 4 solvers (S1,S2,S3,S4) by using 4 GPUs. We also suppose that the dataset will not be randomly shuffled during training. I have checked the implementation of BasePrefetchingDataLayer to find it is only guaranteed that different solvers get their input batch sequentially but not in fixed order. Then I wonder we may encounter the following problem: at T-th iteration, the input batch for S1S4 may be A, B,C, D, respectively, but at the next iteration, it is quite probable the input batch for S1S4 might become B, C, A, D or something else. Such non-deterministic behavior may be dangerous in some cases. Could anyone kindly tell me whether my above doubt are correct?

Besides, could anyone please explain to me why the following sentences "using Params::size_; using Params::data_; using Params::diff_;" are used in the definition of classes: GPUParams and P2PSync (defined in parallel.hpp)? Personally, the using-declarations are generally to solve the problem that members in base class are shadowed in derived class, which however seems not the case for GPUParams and P2PSync. Therefore, I wonder if these sentences are necessary.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions