Skip to content

possibility of memory leak : job is killed by SIGNAL 9 after several scf steps. #2935

@QG-phy

Description

@QG-phy

Describe the bug

when running abacus jobs. I found some of my jobs were killed by SIGNAL 9 after many SCF steps.

The error message goes like this:
image

Therefore, I am wondering maybe there is a possibility of a memory leak issue.

Expected behavior

fig it out and solve it maybe.

To Reproduce

InSb.tar.gz

Environment

env and image:

registry.dp.tech/deepmodeling/abacus-intel:latest

machine:

"bohrium": {
"scass_type": "c32_m128_cpu",
"job_type": "container",
"platform": "ali"
},

command:

#!/bin/bash
source /opt/intel/oneapi/setvars.sh
export OMP_NUM_THREADS=1
cp ./scf/* ./
mpirun -np 16 abacus

Additional Context

No response

Task list for Issue attackers

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Labels

BugsBugs that only solvable with sufficient knowledge of DFT

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions