Skip to content

Update setBit due to overflow in bit shift#623

Merged
TysonRayJones merged 3 commits intoQuEST-Kit:develfrom
JPRichings:devel
May 23, 2025
Merged

Update setBit due to overflow in bit shift#623
TysonRayJones merged 3 commits intoQuEST-Kit:develfrom
JPRichings:devel

Conversation

@JPRichings
Copy link
Contributor

Hardware target: Nvidia grace-hopper single node

Compile target: COMPILE_CUDA

Cmake: cmake .. -D ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=90 -D CMAKE_INSTALL_PREFIX=/work/jriching/Quest/prefix -D USER_SOURCE=../../QFT/qft.cpp -D OUTPUT_EXE=qft

code: qft implementation used previous Archer2 benchmarking work updated for quest v4.0

Error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  reduce failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)

Compute sanitize output:

compute-sanitizer qft 32
Total number of gates: 528
Measured probability amplitude of |0..0> state: 2.32831e-10
Calculated probability amplitude of |0..0>, C0 = 1 / 2^32: 2.32831e-10
Measuring final state: (all probabilities should be 0.5)
========= Invalid __global__ read of size 16 bytes
=========     at void cub::CUB_200200_900_NS::DeviceReduceKernel<cub::CUB_200200_900_NS::DeviceReducePolicy<double, unsigned int, thrust::plus<double>>::Policy600, thrust::transform_iterator<functor_getAmpNorm, thrust::permutation_iterator<thrust::device_ptr<double2>, thrust::transform_iterator<functor_insertBits<(int)1>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>>, thrust::use_default, thrust::use_default>, unsigned int, thrust::plus<double>, double>(T2, T5 *, T3, cub::CUB_200200_900_NS::GridEvenShare<T3>, T4)+0x2e0
=========     by thread (96,0,0) in block (100,0,0)
=========     Address 0xffd240320600 is out of bounds
=========     and is 33,855,240,704 bytes before the nearest allocation at 0xffda22200000 of size 4 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2dabd0]
=========                in /lib/aarch64-linux-gnu/libcuda.so.1
=========     Host Frame:libcudart_static_4d8b33a106dceb3c07a56e26de61f2d53bb62a68 [0x408b68]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:cudaLaunchKernel [0x45b678]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:void cub::CUB_200200_900_NS::DeviceReduceKernel<cub::CUB_200200_900_NS::DeviceReducePolicy<double, unsigned int, thrust::plus<double> >::Policy600, thrust::transform_iterator<functor_getAmpNorm, thrust::permutation_iterator<thrust::device_ptr<double2>, thrust::transform_iterator<functor_insertBits<1>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default> >, thrust::use_default, thrust::use_default>, unsigned int, thrust::plus<double>, double>(thrust::transform_iterator<functor_getAmpNorm, thrust::permutation_iterator<thrust::device_ptr<double2>, thrust::transform_iterator<functor_insertBits<1>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default> >, thrust::use_default, thrust::use_default>, double*, unsigned int, cub::CUB_200200_900_NS::GridEvenShare<unsigned int>, thrust::plus<double>) in /work/shared/nvhpc/24.3/Linux_aarch64/24.3/cuda/12.3/include/cub/device/dispatch/dispatch_reduce.cuh:183 [0x3ff008]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:cub::CUB_200200_900_NS::DispatchReduce<thrust::transform_iterator<functor_getAmpNorm, thrust::permutation_iterator<thrust::device_ptr<double2>, thrust::transform_iterator<functor_insertBits<1>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default> >, thrust::use_default, thrust::use_default>, double*, unsigned int, thrust::plus<double>, double, double, cub::CUB_200200_900_NS::DeviceReducePolicy<double, unsigned int, thrust::plus<double> > >::Dispatch(void*, unsigned long&, thrust::transform_iterator<functor_getAmpNorm, thrust::permutation_iterator<thrust::device_ptr<double2>, thrust::transform_iterator<functor_insertBits<1>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default> >, thrust::use_default, thrust::use_default>, double*, unsigned int, thrust::plus<double>, double, CUstream_st*) in /work/shared/nvhpc/24.3/Linux_aarch64/24.3/cuda/12.3/include/cub/device/dispatch/dispatch_reduce.cuh:0 [0x3e0bd4]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:double thrust::cuda_cub::detail::reduce_n_impl<thrust::cuda_cub::tag, thrust::transform_iterator<functor_getAmpNorm, thrust::permutation_iterator<thrust::device_ptr<double2>, thrust::transform_iterator<functor_insertBits<1>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default> >, thrust::use_default, thrust::use_default>, long, double, thrust::plus<double> >(thrust::cuda_cub::execution_policy<thrust::cuda_cub::tag>&, thrust::transform_iterator<functor_getAmpNorm, thrust::permutation_iterator<thrust::device_ptr<double2>, thrust::transform_iterator<functor_insertBits<1>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default> >, thrust::use_default, thrust::use_default>, long, double, thrust::plus<double>) in /work/shared/nvhpc/24.3/Linux_aarch64/24.3/cuda/12.3/include/thrust/system/cuda/detail/reduce.h:975 [0x3f9520]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:double thrust_statevec_calcProbOfMultiQubitOutcome_sub<1>(Qureg, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >) in /work/jriching/Quest/QuEST/quest/src/gpu/gpu_thrust.cuh:81 [0x3ca86c]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:double gpu_statevec_calcProbOfMultiQubitOutcome_sub<1>(Qureg, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >) in /work/jriching/Quest/QuEST/quest/src/gpu/gpu_subroutines.cpp:1470 [0x39fcdc]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:accel_statevec_calcProbOfMultiQubitOutcome_sub(Qureg, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >) in /work/jriching/Quest/QuEST/quest/src/core/accelerator.cpp:874 [0xf7954]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:localiser_statevec_calcProbOfMultiQubitOutcome(Qureg, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >) in /work/jriching/Quest/QuEST/quest/src/core/localiser.cpp:0 [0x1159d8]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:calcProbOfMultiQubitOutcome in /work/jriching/Quest/QuEST/quest/src/api/calculations.cpp:259 [0xb4b44]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:calcProbOfQubitOutcome in /work/jriching/Quest/QuEST/quest/src/api/calculations.cpp:247 [0xb47c8]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:applyQubitMeasurementAndGetProb in /work/jriching/Quest/QuEST/quest/src/api/operations.cpp:1799 [0xdfb08]
=========                in /work/jriching/Quest/QuEST/build/libQuEST.so
=========     Host Frame:main in /work/jriching/Quest/QFT/qft.cpp:91 [0x15d0]
=========                in /work/jriching/Quest/QuEST/build/qft
=========     Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:74 [0x273fc]
=========                in /lib/aarch64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main in ../csu/libc-start.c:379 [0x274cc]
=========                in /lib/aarch64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x1170]
=========                in /work/jriching/Quest/QuEST/build/qft
=========

Output with additional printing from inside https://github.com/QuEST-Kit/QuEST/blob/53f8f3ad60e5b0171646ee8250c0e6ea65878b44/quest/src/gpu/gpu_thrust.cuh#L789C7-L789C54

after second calProbOfQubitOutcomeBefore first calProbOfQubitOutcomequbits
31
outcomes
0
valueMask: 0
qureg.logNumColsPerNode: 0
powerOf2(qureg.logNumColsPerNode - qubits.size()): -9223372036854775808
qureg.numAmpsPerNode: 4294967296
qubits.size(): 1
powerOf2(qubits.size()): 2
numIts: 2147483648
before prob.
Func end
Before second calProbOfQubitOutcomequbits
31
outcomes
1
valueMask: -2147483648
qureg.logNumColsPerNode: 0
powerOf2(qureg.logNumColsPerNode - qubits.size()): -9223372036854775808
qureg.numAmpsPerNode: 4294967296
qubits.size(): 1
powerOf2(qubits.size()): 2
numIts: 2147483648
before prob.
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  reduce failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)

With proposed change:

Total number of gates: 528
Measured probability amplitude of |0..0> state: 2.32831e-10
Calculated probability amplitude of |0..0>, C0 = 1 / 2^32: 2.32831e-10
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 0 with probability 0.5
Qubit 1 measured in state 1 with probability 0.5
Qubit 2 measured in state 0 with probability 0.5
Qubit 3 measured in state 0 with probability 0.5
Qubit 4 measured in state 1 with probability 0.5
Qubit 5 measured in state 0 with probability 0.5
Qubit 6 measured in state 1 with probability 0.5
Qubit 7 measured in state 1 with probability 0.5
Qubit 8 measured in state 1 with probability 0.5
Qubit 9 measured in state 0 with probability 0.5
Qubit 10 measured in state 0 with probability 0.5
Qubit 11 measured in state 1 with probability 0.5
Qubit 12 measured in state 1 with probability 0.5
Qubit 13 measured in state 1 with probability 0.5
Qubit 14 measured in state 1 with probability 0.5
Qubit 15 measured in state 1 with probability 0.5
Qubit 16 measured in state 0 with probability 0.5
Qubit 17 measured in state 0 with probability 0.5
Qubit 18 measured in state 0 with probability 0.5
Qubit 19 measured in state 0 with probability 0.5
Qubit 20 measured in state 1 with probability 0.5
Qubit 21 measured in state 1 with probability 0.5
Qubit 22 measured in state 1 with probability 0.5
Qubit 23 measured in state 1 with probability 0.5
Qubit 24 measured in state 1 with probability 0.5
Qubit 25 measured in state 1 with probability 0.5
Qubit 26 measured in state 0 with probability 0.5
Qubit 27 measured in state 1 with probability 0.5
Qubit 28 measured in state 0 with probability 0.5
Qubit 29 measured in state 1 with probability 0.5
Qubit 30 measured in state 1 with probability 0.5
Qubit 31 measured in state 0 with probability 0.5

Final state:
|01001011100111110000111111010110>
QFT run time: 11.8524s
Total run time: 15.4872s

@otbrown
Copy link
Collaborator

otbrown commented May 23, 2025

I am reasonably confident that this might solve #618, as the issue occurs when the probability of qubit $n$ being in the state $|1\rangle$ is calculated for $n \geq 31$.

Bit strange that it doesn't happen on ARCHER2, but my best guess is that int maps to int64_t there!

@TysonRayJones
Copy link
Member

TysonRayJones commented May 23, 2025

Aha, brilliant spot! The destination variable remains a qindex so it makes sense to type parameter bitValue as a qindex rather than an explicit unsigned long long (I've generally tried to use semantic types). But still, the type then no longer recognises/communicates that bitValue is strictly either 0 or 1.

How do you feel about instead using a cast in the body of setBit? Could even comment the pitfall;

INLINE qindex setBit(qindex number, int bitIndex, int bitValue) {
    
    // beware that shifting the raw int would overflow
    qindex bitInPlace = ((qindex) bitValue) << bitIndex;
    qindex oneInPlace = QINDEX_ONE << bitIndex;
    return (number & ~oneInPlace) | bitInPlace;
}

Any alternative to the C-style cast is of course fine too, like static_cast, QINDEX_ONE & bitValue, or just pre-assignment:

INLINE qindex setBit(qindex number, int bitIndex, int bitValue) {
    
    // beware that shifting the raw int would overflow
    qindex bitSafe = bitValue;
    qindex bitInPlace = bitSafe    << bitIndex;
    qindex oneInPlace = QINDEX_ONE << bitIndex;
    return (number & ~oneInPlace) | bitInPlace;
}

@otbrown That's a brilliant realisation! 🎉

@otbrown
Copy link
Collaborator

otbrown commented May 23, 2025

Any of those works from my point of view! Although unsigned long long was chosen on the grounds that it's explicitly safe to bit shift, whereas qindex may not be.

Though now that I'm thinking about it, you're unlikely to be able to go above 32 qubits in that scenario anyway...

@JPRichings
Copy link
Contributor Author

Thanks I'll test the suggested changes now and update the PR. I'm happy with either. I agree that explicit use of unsigned long long is not ideal but also didn't want to confuse the semantics by using with qindex.

@JPRichings
Copy link
Contributor Author

updated and retested

@TysonRayJones
Copy link
Member

Fair point about qindex not being self-evidently safe (could overflow, or be negative for which shifting is UB), though all of bitwise.hpp so far uses it. If one wanted to be really rigorous, we could change the backend to use its own unsigned type, reserving signed qindex just for the frontend (so we can still protect users from their own underflow errors). This is likely overkill since qindex overflows or negatives can only occur given another bug (e.g. validation failed to reject a negative qindex or an overflowingly-large Qureg creation). Or so I naively believe :^)

Btw my (albeit very incomplete) integration tests were insufficiently big to detect this bug (max = 14 qubit density matrix = 28 qubit statevector). I suppose it's crucial to test Qureg with 31-33 qubits to detect these potential overflows. At non-distributed single precision, 33 qubits requires 64 GiB, which can fit into the 32-core 128 GiB RAM paid Github runner - I'll set this up when I get a sec. That doesn't alleviate the risk of CUDA-specific overflows; maybe one day the tests can be run on a 80 GiB H100 😌

@TysonRayJones
Copy link
Member

Btw I tentatively added James as an external-collab to the authorlist, though of course he'll imminently be migrated to "current team" anyhow. We could update it all in a handover version/release.

@TysonRayJones TysonRayJones merged commit 49e5ff3 into QuEST-Kit:devel May 23, 2025
130 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants