Skip to content

[Bug]: driver sefaults on initialization on PVC #2001

@dvrogozh

Description

@dvrogozh

Which component impacted?

Decode

Is it regression? Good in old configuration?

No, this issue exist a long time

What happened?

From intel/torchlib-xpu#35

The https://github.com/intel/torchlib-xpu project implements Intel GPU plugin for the https://github.com/meta-pytorch/torchcodec project which enables media codec support within PyTorch ecosystem. Primary goal is to enable decoding and encoding on Intel GPU platforms. It is a requirement however to fail CPU fallback on the cases when media HW acceleration is not available, for example on PVC platform. That's where issue was found since PVC does not support hardware decoding and encoding.

Effectively media driver crashes on PVC during initialization due to use after free issue pointed out in the intel/torchlib-xpu#35 (comment) comment. Repeating conclusion here:

  1. There is an attempt to dereference a m_osInterface pointer at media_interfaces.cpp:199
  2. Which was already freed at media_interfaces_pvc.cpp:342
  3. And which was initially allocated at media_interfaces.cpp:435

Thus the root cause is that deleterOnFailure = [&](bool deleteOsInterface, bool deleteMhwInterface) deletes objects in the wrong order. MhwInterface depends on OsInterface and can not be deleted after it as there will be dereferences of OsInterface. Applying such a fix addresses the issue.

Above analysis was done on intel-media-25.2.4 as that's the driver Intel supports for PVC. However, same issue happens on the most recent 838be24 master as well.

What's the usage scenario when you are seeing the problem?

Video Analytics

What impacted?

No response

Debug Information

For reproducing and debug information see intel/torchlib-xpu#35. The key information for debug can be obtained using valgrind:

valgrind --leak-check=full --log-file=log.txt python -m pytest --basetemp=$HOME/tmp \
  test/test_decoders.py::TestVideoDecoder::test_get_frame_played_at[exact-xpu]

With the following relevant place from the log:

==3178123== Invalid read of size 8
==3178123==    at 0x16222493D: MhwInterfaces::Destroy() [clone .part.0] (media_interfaces.cpp:199)
==3178123==    by 0x1621AD58E: McpyDeviceXe_Xpm_Plus::Initialize(_MOS_INTERFACE*)::{lambda(bool, bool)#1}::operator()(bool, bool) const (media_interfaces_pvc.cpp:347)
==3178123==    by 0x1621AD962: McpyDeviceXe_Xpm_Plus::Initialize(_MOS_INTERFACE*) (media_interfaces_pvc.cpp:366)
...
==3178123==  Address 0x13ab997c8 is 1,656 bytes inside a block of size 63,616 free'd
==3178123==    at 0x484988F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==3178123==    by 0x1621AD577: McpyDeviceXe_Xpm_Plus::Initialize(_MOS_INTERFACE*)::{lambda(bool, bool)#1}::operator()(bool, bool) const (media_interfaces_pvc.cpp:342)
==3178123==    by 0x1621AD962: McpyDeviceXe_Xpm_Plus::Initialize(_MOS_INTERFACE*) (media_interfaces_pvc.cpp:366)
...
==3178123==  Block was alloc'd at
==3178123==    at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==3178123==    by 0x161FC2292: MosUtilities::MosAllocAndZeroMemory(unsigned long) (mos_utilities_next.cpp:308)
==3178123==    by 0x162224275: McpyDevice::CreateFactory(_MOS_OS_CONTEXT*) (media_interfaces.cpp:435)
==3178123==    by 0x161FB1F0F: MosMediaCopy::MosMediaCopy(_MOS_OS_CONTEXT*) (mos_mediacopy.cpp:37)

Do you want to contribute a patch to fix the issue?

Yes, I'm glad to submit a patch to fix it

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions