Skip to content

[POWER9] OCC locks up over time if power saving modes enabled #26

@madscientist159

Description

@madscientist159

After a random amount of time, the OCC locks up and downclocks the host if default Linux power saving modes are enabled. When this happens, not only is the machine slowed down but fan controls are completely disabled and a full host power-off reboot is required to restore functionality. opal-prd occ reset is sometimes able to restore proper clocks without a reboot but the fan controls remain offline without a full reboot.

Host dmesg:

[32809.485411] powernv-cpufreq: PMSR = 6363630040000000
[32809.520296] powernv-cpufreq: CPU Frequency could be throttled
[32844.323314] powernv-cpufreq: PMSR =  12d630000000000
[32844.356229] powernv-cpufreq: CPU Frequency could be throttled

BMC dmesg:

[390435.513335] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[390435.521174] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a
[390436.569249] sbefifo 00:00:00:06: SBE error cmd a4:04 status=00fe:000a
[390436.575975] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[390436.583663] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a
[390436.706403] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390436.713044] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390436.720885] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a
[390437.780615] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390437.787314] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390437.795153] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a
[390441.406828] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390441.413412] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390441.421240] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a
[390442.895970] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390442.902534] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390442.910341] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a

The following command, run at host start, completely prevents the issue from occurring, but is undesirable as the machines use significantly more power at idle:

echo 1 | tee /sys/devices/system/cpu/cpu*/cpuidle/state?/disable

Host Linux version: 4.19.0-5-powerpc64le
OCC GIT hash: 58e422d

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions