Skip to content

Conversation

@nuclearcat
Copy link
Member

@nuclearcat nuclearcat commented Oct 3, 2022

After analysis of Jenkins job runs we are able to identify few weak spots where job might fail early, while it is still possible to recover and continue.
This patches are addressing this, as attempt to improve job execution success rate.

Fixes #1451
Fixes kernelci/kernelci-project#124
Fixes #1461
Fixes #1462

@nuclearcat nuclearcat changed the title Improve reliability patches Improving jobs reliability Oct 3, 2022
@nuclearcat nuclearcat force-pushed the IMPROVE-reliability-patches branch 5 times, most recently from cea8bdc to 7eb3d08 Compare October 4, 2022 09:49
@nuclearcat
Copy link
Member Author

nuclearcat commented Oct 5, 2022

build.py: Add retry to _download_file works

+ cd /scratch
+ set +x
+ kci_build pull_tarball --url http://storage.staging.kernelci.org/kernelci/staging-next/staging-next-20221005.0/linux-src_kernelci_staging-next.tar.gz --retries 3 --delete
_download_file exception HTTPConnectionPool(host='storage.staging.kernelci.org', port=80): Max retries exceeded with url: /kernelci/staging-next/staging-next-20221005.0/linux-src_kernelci_staging-next.tar.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbad9b0a5e0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')), might retry
_download_file exception HTTPConnectionPool(host='storage.staging.kernelci.org', port=80): Max retries exceeded with url: /kernelci/staging-next/staging-next-20221005.0/linux-src_kernelci_staging-next.tar.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbad9b0a0d0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')), might retry
+ kci_build generate_fragments --build-config=kernelci_staging-next
kernel/configs/kselftest.config
kernel/configs/crypto.config

And build was completed successfully.

@nuclearcat nuclearcat force-pushed the IMPROVE-reliability-patches branch 2 times, most recently from ec1598a to 6fae1c0 Compare October 10, 2022 12:27
Copy link

@mgalka mgalka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some remarks

In some cases pod log read might fail, but pod got build
successfully. As we are verifying existence of build files,
log retrieval state is not reason to invalidate k8s job.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
One of most common failures is 401 error on core.list_namespaced_pod
function. This error happening only on GKE, and often when it generate such
error, many jobs will fail at same time.
Adding several retries with config reload might fix this issue.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
One of most common failures, _download_file attempts to fetch file
but gets exception and as result fail.
This need proper logic to handle exception AND retry.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
We have already retry code in wait.py function wait.py,
but it is not handling urllib3.exceptions.MaxRetryError
exception. This patch add proper handling for it.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
@nuclearcat nuclearcat force-pushed the IMPROVE-reliability-patches branch from 6fae1c0 to 8246a4b Compare October 11, 2022 07:12
@nuclearcat nuclearcat marked this pull request as ready for review October 11, 2022 15:32
@mgalka mgalka merged commit 61f7dc0 into kernelci:main Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants