Skip to content

Conversation

@Nuullll
Copy link
Contributor

@Nuullll Nuullll commented Aug 26, 2025

Introduce HostHalf wrapper class to eliminate explicit cl_half_from_float
and cl_half_to_float conversions throughout the test code. The wrapper
provides semantic value constructors/operators and automatic conversions,
simplifying half-precision arithmetic operations.

Key improvements:

  • HostHalf class with operator overloading for arithmetic and comparisons
  • Type traits is_host_atomic_fp_v and is_host_fp_v for generic FP handling
  • Unified floating-point atomic operations (add/sub/min/max/exchange)
  • Removed 300+ lines of half-specific conditional branches
  • Consistent calculation for all FP types

@bashbaug
Copy link
Contributor

Hi, I'm having a hard time connecting the dots together. Can you please provide a test command line that I can run that demonstrates the problem? Thanks!

@Nuullll
Copy link
Contributor Author

Nuullll commented Oct 30, 2025

Hi, I'm having a hard time connecting the dots together. Can you please provide a test command line that I can run that demonstrates the problem? Thanks!

Sorry for the late response - I was distracted by other work.

The issue can be reproduced by running:

./OpenCL-CTS/build/test_conformance/c11_atomics/test_c11_atomics svm_atomic_store

with Intel OpenCL CPU implementation (which has fp16 support enabled).

Root cause

In the OpenCL kernel:

#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void test_atomic_kernel(uint threadCount, uint numDestItems, volatile __global atomic_half *destMemory, __global half *oldValues)
{
  uint  tid = get_global_id(0);
  atomic_store_explicit(&destMemory[tid], tid, memory_order_relaxed, memory_scope_all_devices);
}

When uint tid is stored as a half value, the OpenCL device properly converts the integer to IEEE 754 half-precision floating point format. However, in the host reference code, the original implementation simply cast cl_uint directly to cl_half:

host_atomic_store(&destMemory[tid], (HostDataType)tid, MemoryOrder());

cl_half is defined as uint16_t in OpenCL Header, a direct cast from cl_uint to cl_half just truncates the integer bits, rather than performing proper float-to-half conversion.

@bashbaug
Copy link
Contributor

Ah, gotcha, thank you for the explanation!

FWIW, it was the static cast to a float in the snippet below that looked questionable (note: v is a uint)::

            // For half types, convert from float to proper half-precision bit
            // pattern
            return cl_half_from_float(static_cast<float>(v), gHalfRoundingMode);

But, it looks like this is matching what the kernel is doing (convert the uint to a half), so it's correct.

Interestingly, there does not appear to be a problem with our GPU device, which does not support SVM atomics but does support fp16 atomic load and store. I am running:

./test_conformance/c11_atomics/test_c11_atomics atomic_store

I'll add "focused review" and we'll see if we can get this merged next week.

bashbaug
bashbaug previously approved these changes Oct 30, 2025
@bashbaug
Copy link
Contributor

bashbaug commented Nov 4, 2025

Discussed in the November 4th teleconference. Will merge after @shajder 's review.

@Nuullll
Copy link
Contributor Author

Nuullll commented Nov 6, 2025

Interestingly, there does not appear to be a problem with our GPU device, which does not support SVM atomics but does support fp16 atomic load and store. I am running:

./test_conformance/c11_atomics/test_c11_atomics atomic_store

@bashbaug Thanks for pointing this out. atomic_store passes for both CPU and GPU, because by default it does not use HostFunction for verification:

virtual cl_uint MaxHostThreads()
{
if (UseSVM() || gHost)
return MAX_HOST_THREADS;
else
return 0;
}

hostThreadCount = MaxHostThreads();

if (hostThreadCount > 0)
ThreadPool_Do(HostThreadFunction, hostThreadCount,
&hostThreadContexts[0]);

static cl_int HostThreadFunction(cl_uint job_id, cl_uint thread_id,
void *userInfo)
{
THostThreadContext *threadContext =
((THostThreadContext *)userInfo) + job_id;
threadContext->test->HostFunction(
threadContext->tid, threadContext->threadCount,
threadContext->destMemory, threadContext->oldValues);
return 0;
}

virtual void HostFunction(cl_uint tid, cl_uint threadCount,
volatile HostAtomicType *destMemory,
HostDataType *oldValues)
{
host_atomic_store(&destMemory[tid], (HostDataType)tid, MemoryOrder());
}

The same issue can be exposed on CPU if we force using host threads for verification:

./test_conformance/c11_atomics/test_c11_atomics atomic_store -host

lakshmih
lakshmih previously approved these changes Nov 7, 2025
@Nuullll Nuullll requested a review from shajder November 12, 2025 05:32
@Nuullll Nuullll dismissed stale reviews from lakshmih and bashbaug via e861104 November 24, 2025 08:25
@Nuullll Nuullll force-pushed the review/Nuullll/fix-host-half-conversion branch from 93e0b57 to e861104 Compare November 24, 2025 08:25
@Nuullll Nuullll changed the title c11_atomics: Fix cl_uint --> cl_half conversion on host c11_atomics: unify host half representation and conversion with wrapper class Nov 24, 2025
@Nuullll
Copy link
Contributor Author

Nuullll commented Nov 24, 2025

I've refactored the host half handling with a new wrapper class HostHalf so that we don't need much extra caution for the half conversions everywhere.

@Nuullll Nuullll marked this pull request as draft November 25, 2025 05:45
@Nuullll Nuullll requested a review from shajder November 26, 2025 02:42
shajder
shajder previously approved these changes Nov 26, 2025
@Nuullll Nuullll marked this pull request as ready for review November 27, 2025 03:47
@bashbaug
Copy link
Contributor

bashbaug commented Dec 9, 2025

@Nuullll would you mind resolving the merge conflicts? I took a look and they don't look too bad, but they aren't quite something I'm comfortable resolving via the web UI. Thanks!

…er class

Introduce HostHalf wrapper class to eliminate explicit cl_half_from_float
and cl_half_to_float conversions throughout the test code. The wrapper
provides semantic value constructors/operators and automatic conversions,
simplifying half-precision arithmetic operations.

Key improvements:
- HostHalf class with operator overloading for arithmetic and comparisons
- Type traits is_host_atomic_fp_v and is_host_fp_v for generic FP handling
- Unified floating-point atomic operations (add/sub/min/max/exchange)
- Removed 300+ lines of half-specific conditional branches
- Consistent error tolerance calculation for all FP types

fix windows build

fix format

fix format

fix format
@Nuullll Nuullll force-pushed the review/Nuullll/fix-host-half-conversion branch from 4cc3244 to f6d3199 Compare December 10, 2025 02:18
@Nuullll Nuullll requested a review from shajder December 10, 2025 02:25
return std::isfinite(
static_cast<double>(val));
})
&& "Infinite subtraction value detected!");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
double y = num - compensation;
double t = sum - y;
compensation = (t - sum) - y;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(t-sum) == -y, so compensation == -2y here, causing sum to explode and overflow to infinity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch but this is still not quite correct. It should be:

double y = - num - compensation;
double t = sum + y;
compensation = (t - sum) - y;

@Nuullll
Copy link
Contributor Author

Nuullll commented Dec 10, 2025

@shajder @bashbaug I've resolved the conflict and fixed a minor issue of #2368 , please take another look. Thanks!

lakshmih
lakshmih previously approved these changes Dec 10, 2025
{
double y = num - compensation;
double t = sum - y;
compensation = (t - sum) - y;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch but this is still not quite correct. It should be:

double y = - num - compensation;
double t = sum + y;
compensation = (t - sum) - y;

@shajder
Copy link
Contributor

shajder commented Dec 12, 2025

@Nuullll We would like to merge this PR shortly. I’d appreciate it if you could apply any upcoming corrections in separate PRs. Thanks!

@neiltrevett neiltrevett merged commit 119af24 into KhronosGroup:main Dec 16, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants