Skip to content

Spark executors failing occasionally on SIGSEGV #1714

@mixermt

Description

@mixermt

Hi,

Experience occasional failure of Spark executors

│ # A fatal error has been detected by the Java Runtime Environment:                                                                                                                                                 │
│ #                                                                                                                                                                                                                  │
│ #  SIGSEGV (0xb) at pc=0x00007f079663f84e, pid=18, tid=0x00007f07347ff700                                                                                                                                          │
│ #                                                                                                                                                                                                                  │
│ # JRE version: OpenJDK Runtime Environment (Zulu 8.74.0.17-CA-linux64) (8.0_392-b08) (build 1.8.0_392-b08)                                                                                                         │
│ # Java VM: OpenJDK 64-Bit Server VM (25.392-b08 mixed mode linux-amd64 compressed oops)                                                                                                                            │
│ # Problematic frame:                                                                                                                                                                                               │
│ # V  [libjvm.so+0x8a584e]  MallocSiteTable::malloc_site(unsigned long, unsigned long)+0xe                                                                                                                          │
│ #                                                                                                                                                                                                                  │
│ # Core dump written. Default location: /opt/spark/work-dir/core or core.18                                                                                                                                         │
│ #                                                                                                                                                                                                                  │
│ # An error report file with more information is saved as:                                                                                                                                                          │
│ # /opt/spark/work-dir/hs_err_pid18.log                                                                                                                                                                             │
│ [thread 139669100558080 also had an error]                                                                                                                                                                         │
│ [thread 139669096355584 also had an error]                                                                                                                                                                         │
│ #                                                                                                                                                                                                                  │
│ # If you would like to submit a bug report, please visit:                                                                                                                                                          │
│ #   http://www.azul.com/support/                                                                                                                                                                                   │
│ #        

From Spark UI

ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: 
The executor with id 7 exited with exit code -1(unexpected).

The API gave the following container statuses:
	 container name: spark-executor
	 container image: OUR_SPARK_DOCKER_IMAGE 
	 container state: terminated
	 container started at: 2025-05-04T12:16:23Z
	 container finished at: 2025-05-04T12:17:14Z
	 exit code: 134
	 termination reason: Error

First I thought it sounds like OOM but when I've checked memory graphs of the pods, none of the pods reached even half of requested memory.
After number of retries the job succeeded with execution after it switched to another executor.
The input bytes or shuffle are really small comparing to allocated executor memory and ofHeap (50g and 30g)

Image

Our env:
Spark 3.5.4 - Comet version 0.8.0

Any ideas ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions