Skip to content

radix sort for quantile build? #5746

@okuvshynov

Description

@okuvshynov

One of the expensive operations in quantile build process is sorting of the batch of data, which happens here:

std::sort(queue.begin(), queue.begin() + qtail);

Sorting here sorts structures by float key.

While not as trivial as for ints, it's possible to do radix sort (< O(n log n) ) over floating point numbers. float_sort from boost is one such implementation, which uses hybrid approach (both radix and comparison-based): https://www.boost.org/doc/libs/1_67_0/libs/sort/doc/html/sort/single_thread/spreadsort/sort_hpp/float_sort.html

These are results from quick experiment with boost implementation (see https://github.com/okuvshynov/xgboost/commit/7d22b0bdcef92952d44c402512523e418ab3904d in my fork):

4 Core machine, higgs dataset, 256 bins:

std::sort:
[16:32:39] ======== Monitor: DenseCuts ========
[16:32:39] Build: 16.4768s, 1 calls @ 16476767us
[16:32:39] Init: 0.033793s, 1 calls @ 33793us

float_sort:
[16:30:14] ======== Monitor: DenseCuts ========
[16:30:14] Build: 12.3589s, 1 calls @ 12358875us
[16:30:14] Init: 0.063872s, 1 calls @ 63872us

12-core machine, higgs dataset, 256 bins:

std::sort:
[16:28:16] ======== Monitor: DenseCuts ========
[16:28:16] Build: 3.7118s, 1 calls @ 3711796us
[16:28:16] Init: 0.041176s, 1 calls @ 41176us

float_sort:
[16:25:47] ======== Monitor: DenseCuts ========
[16:25:47] Build: 2.83653s, 1 calls @ 2836527us
[16:25:47] Init: 0.040441s, 1 calls @ 40441us

This will depend on number of bins. With less bins, we'll have shorter vector to sort and win will become less pronounced. It should be also helpful for exact split finding method, but I haven't really tested that.

It looks nice, but, unless we want to bring boost as a dependency, it will require careful implementation of radix sort for floats, which would handle NaNs/positive/negative values correctly.

does this sound interesting?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions