Which component impacted?
Decode
Is it regression? Good in old configuration?
No, this issue exist a long time
What happened?
From intel/torchlib-xpu#35
The https://github.com/intel/torchlib-xpu project implements Intel GPU plugin for the https://github.com/meta-pytorch/torchcodec project which enables media codec support within PyTorch ecosystem. Primary goal is to enable decoding and encoding on Intel GPU platforms. It is a requirement however to fail CPU fallback on the cases when media HW acceleration is not available, for example on PVC platform. That's where issue was found since PVC does not support hardware decoding and encoding.
Effectively media driver crashes on PVC during initialization due to use after free issue pointed out in the intel/torchlib-xpu#35 (comment) comment. Repeating conclusion here:
- There is an attempt to dereference a
m_osInterface pointer at media_interfaces.cpp:199
- Which was already freed at media_interfaces_pvc.cpp:342
- And which was initially allocated at media_interfaces.cpp:435
Thus the root cause is that deleterOnFailure = [&](bool deleteOsInterface, bool deleteMhwInterface) deletes objects in the wrong order. MhwInterface depends on OsInterface and can not be deleted after it as there will be dereferences of OsInterface. Applying such a fix addresses the issue.
Above analysis was done on intel-media-25.2.4 as that's the driver Intel supports for PVC. However, same issue happens on the most recent 838be24 master as well.
What's the usage scenario when you are seeing the problem?
Video Analytics
What impacted?
No response
Debug Information
For reproducing and debug information see intel/torchlib-xpu#35. The key information for debug can be obtained using valgrind:
valgrind --leak-check=full --log-file=log.txt python -m pytest --basetemp=$HOME/tmp \
test/test_decoders.py::TestVideoDecoder::test_get_frame_played_at[exact-xpu]
With the following relevant place from the log:
==3178123== Invalid read of size 8
==3178123== at 0x16222493D: MhwInterfaces::Destroy() [clone .part.0] (media_interfaces.cpp:199)
==3178123== by 0x1621AD58E: McpyDeviceXe_Xpm_Plus::Initialize(_MOS_INTERFACE*)::{lambda(bool, bool)#1}::operator()(bool, bool) const (media_interfaces_pvc.cpp:347)
==3178123== by 0x1621AD962: McpyDeviceXe_Xpm_Plus::Initialize(_MOS_INTERFACE*) (media_interfaces_pvc.cpp:366)
...
==3178123== Address 0x13ab997c8 is 1,656 bytes inside a block of size 63,616 free'd
==3178123== at 0x484988F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==3178123== by 0x1621AD577: McpyDeviceXe_Xpm_Plus::Initialize(_MOS_INTERFACE*)::{lambda(bool, bool)#1}::operator()(bool, bool) const (media_interfaces_pvc.cpp:342)
==3178123== by 0x1621AD962: McpyDeviceXe_Xpm_Plus::Initialize(_MOS_INTERFACE*) (media_interfaces_pvc.cpp:366)
...
==3178123== Block was alloc'd at
==3178123== at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==3178123== by 0x161FC2292: MosUtilities::MosAllocAndZeroMemory(unsigned long) (mos_utilities_next.cpp:308)
==3178123== by 0x162224275: McpyDevice::CreateFactory(_MOS_OS_CONTEXT*) (media_interfaces.cpp:435)
==3178123== by 0x161FB1F0F: MosMediaCopy::MosMediaCopy(_MOS_OS_CONTEXT*) (mos_mediacopy.cpp:37)
Do you want to contribute a patch to fix the issue?
Yes, I'm glad to submit a patch to fix it
Which component impacted?
Decode
Is it regression? Good in old configuration?
No, this issue exist a long time
What happened?
From intel/torchlib-xpu#35
The https://github.com/intel/torchlib-xpu project implements Intel GPU plugin for the https://github.com/meta-pytorch/torchcodec project which enables media codec support within PyTorch ecosystem. Primary goal is to enable decoding and encoding on Intel GPU platforms. It is a requirement however to fail CPU fallback on the cases when media HW acceleration is not available, for example on PVC platform. That's where issue was found since PVC does not support hardware decoding and encoding.
Effectively media driver crashes on PVC during initialization due to use after free issue pointed out in the intel/torchlib-xpu#35 (comment) comment. Repeating conclusion here:
m_osInterfacepointer at media_interfaces.cpp:199Thus the root cause is that deleterOnFailure = [&](bool deleteOsInterface, bool deleteMhwInterface) deletes objects in the wrong order.
MhwInterfacedepends onOsInterfaceand can not be deleted after it as there will be dereferences ofOsInterface. Applying such a fix addresses the issue.Above analysis was done on
intel-media-25.2.4as that's the driver Intel supports for PVC. However, same issue happens on the most recent 838be24 master as well.What's the usage scenario when you are seeing the problem?
Video Analytics
What impacted?
No response
Debug Information
For reproducing and debug information see intel/torchlib-xpu#35. The key information for debug can be obtained using valgrind:
With the following relevant place from the log:
Do you want to contribute a patch to fix the issue?
Yes, I'm glad to submit a patch to fix it