Skip to content

Johnsonshi/refactor azure report ready codepath#468

Merged
OddBloke merged 30 commits into
canonical:masterfrom
johnsonshi:johnsonshi/refactor-azure-report-ready-codepath
Aug 13, 2020
Merged

Johnsonshi/refactor azure report ready codepath#468
OddBloke merged 30 commits into
canonical:masterfrom
johnsonshi:johnsonshi/refactor-azure-report-ready-codepath

Conversation

@johnsonshi
Copy link
Copy Markdown
Contributor

Refactor Azure report ready and Goal State code path.

@johnsonshi
Copy link
Copy Markdown
Contributor Author

Hi @OddBloke, as discussed the closed PR #422, I have removed the diagnostic/error logging changes to the PR. The code changes now solely focus on refactoring the Azure report ready codepath and Goal State related code. Thanks for the suggestions!

@johnsonshi
Copy link
Copy Markdown
Contributor Author

Adding @anhvoms as well for visibility.

@OddBloke OddBloke self-assigned this Jul 3, 2020
@OddBloke OddBloke self-requested a review July 3, 2020 13:46
Copy link
Copy Markdown
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @johnsonshi, thanks for this! This was much easier to review and, at a high-level, it looks good. I have a bunch of inline comments: most of these are minor (including a few suggested type annotations).

I have one high-level comment on the tests: per our Unit Testing docs:

Variables/parameter names for Mock or MagicMock instances should start with m_ to clearly distinguish them from non-mock variables

Could you go through and ensure that this naming convention is being followed for the new tests you are adding here?

Thanks again!

Comment thread cloudinit/sources/helpers/azure.py Outdated
Comment thread cloudinit/sources/helpers/azure.py Outdated
Comment thread cloudinit/sources/helpers/azure.py Outdated
Comment thread cloudinit/sources/helpers/azure.py Outdated
Comment thread cloudinit/sources/helpers/azure.py
Comment thread cloudinit/sources/helpers/azure.py
Comment thread cloudinit/sources/helpers/azure.py Outdated
Comment thread tests/unittests/test_datasource/test_azure_helper.py Outdated
Comment thread tests/unittests/test_datasource/test_azure_helper.py Outdated
Comment thread tests/unittests/test_datasource/test_azure_helper.py Outdated
@johnsonshi
Copy link
Copy Markdown
Contributor Author

Hi @OddBloke, thanks for the very detailed and thorough review! I have followed the mock test naming conventions you mentioned earlier, your type annotation PR comments, refactoring AzureEndpointHttpClient to be in self (instead of getting passed around), and a bunch of other PR comments. Let me know if you need anything else.

@OddBloke
Copy link
Copy Markdown
Collaborator

Hey @johnsonshi, thanks for the update! I was off at the end of last week, so I'm playing catch-up; I'm expecting to get to a re-review of this tomorrow.

Copy link
Copy Markdown
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good now, @johnsonshi, thanks for your work! I have a few inline comments: the two most important are a concern remaining around flushing, and around some testing. But we're very close!

Comment thread tools/.github-cla-signers
Comment thread cloudinit/sources/helpers/azure.py
Comment thread cloudinit/sources/helpers/azure.py
Comment thread cloudinit/sources/helpers/azure.py Outdated
Comment thread cloudinit/sources/helpers/azure.py Outdated
Comment thread tests/unittests/test_datasource/test_azure_helper.py Outdated
Comment thread tests/unittests/test_datasource/test_azure_helper.py
Comment thread tests/unittests/test_datasource/test_azure_helper.py Outdated
@johnsonshi
Copy link
Copy Markdown
Contributor Author

Looking at the Hyper-V KVP Reporter's threaded publisher subroutine (here and here), we see that it calls _append_kvp_item in order to write encoded_data (encoded kvp diagnostic messages) to the kvp pool file.

We see that _append_kvp_item does hold an exclusive lock and flushes output to the file, so we can be sure that diagnostic messages are indeed flushed to the Hyper-V kvp file.

We also see that HyperVKvpReportingHandler.flush() does block the main thread until the publisher thread completely empties queued kvp diagnostic messages.

You are right that time.sleep(0) is not needed since (1) the threaded publishing subroutine flushes output when it writes and (2) the main thread blocks until threaded publisher empties and writes all queued events. Thus we can be sure that all queued events are written up to this point.

In fact, it seems like HyperVKvpReportingHandler.flush() is a reliable method to ensure that queued events are indeed empties and flushed. The previous time.sleep(0) did not provide any guarantees of flushing all events. In fact it was just relying on yielding to the task scheduler and random luck. :)

@johnsonshi
Copy link
Copy Markdown
Contributor Author

Thank you so much @OddBloke. I can sense that this is getting really close indeed. Thanks for your extensive reviews each pass :)

@anhvoms
Copy link
Copy Markdown
Contributor

anhvoms commented Jul 17, 2020

Looking at the Hyper-V KVP Reporter's threaded publisher subroutine (here and here), we see that it calls _append_kvp_item in order to write encoded_data (encoded kvp diagnostic messages) to the kvp pool file.

We see that _append_kvp_item does hold an exclusive lock and flushes output to the file, so we can be sure that diagnostic messages are indeed flushed to the Hyper-V kvp file.

We also see that HyperVKvpReportingHandler.flush() does block the main thread until the publisher thread completely empties queued kvp diagnostic messages.

You are right that time.sleep(0) is not needed since (1) the threaded publishing subroutine flushes output when it writes and (2) the main thread blocks until threaded publisher empties and writes all queued events. Thus we can be sure that all queued events are written up to this point.

In fact, it seems like HyperVKvpReportingHandler.flush() is a reliable method to ensure that queued events are indeed empties and flushed. The previous time.sleep(0) did not provide any guarantees of flushing all events. In fact it was just relying on yielding to the task scheduler and random luck. :)

Looking at the Hyper-V KVP Reporter's threaded publisher subroutine (here and here), we see that it calls _append_kvp_item in order to write encoded_data (encoded kvp diagnostic messages) to the kvp pool file.

We see that _append_kvp_item does hold an exclusive lock and flushes output to the file, so we can be sure that diagnostic messages are indeed flushed to the Hyper-V kvp file.

We also see that HyperVKvpReportingHandler.flush() does block the main thread until the publisher thread completely empties queued kvp diagnostic messages.

You are right that time.sleep(0) is not needed since (1) the threaded publishing subroutine flushes output when it writes and (2) the main thread blocks until threaded publisher empties and writes all queued events. Thus we can be sure that all queued events are written up to this point.

In fact, it seems like HyperVKvpReportingHandler.flush() is a reliable method to ensure that queued events are indeed empties and flushed. The previous time.sleep(0) did not provide any guarantees of flushing all events. In fact it was just relying on yielding to the task scheduler and random luck. :)

HyperVKvpReportingHandler.flush() is reliable since it calls a queue join, but you risk blocking the report ready if the KVPs could not be flushed due to some reasons. I ran into this once when I had some malformed KVP due to utf-8 encoding. Regarldess, my opinion is that the telemetry channel should not fail the VM reporting ready. There could be bugs in the telemetry channel that we should figure out to make sure it would never cause issues, but that should not be the scope of this PR.

What we observed was that at the time of reporting ready, sometimes there were KVPs that weren't processed yet in the queue. Time.sleep(0) was a low cost best-effort to yield the scheduler so that some of these events could be processed. Is it random luck? I'm not quite sure. When i looked into this the last time, running strace showed that time.sleep(0) did trigger a context switch and schedule yielding.
https://stackoverflow.com/questions/7273474/behavior-of-pythons-time-sleep0-under-linux-does-it-cause-a-context-switch

@johnsonshi
Copy link
Copy Markdown
Contributor Author

After discussing with @anhvoms, I've decided to revert time.sleep(0) back as it was a proven and reliable method for KVPs to be written to disk before the VM reports ready. Agree with @anhvoms and his points. This PR should be scoped to refactoring only.

@johnsonshi
Copy link
Copy Markdown
Contributor Author

@OddBloke I believe this is ready for merging now :)

Copy link
Copy Markdown
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Johnson, apologies for the delay in this review! Two minor requests, plus a follow-up on that time.sleep(0) part. This is really close, thanks for your continuing work!

Comment thread cloudinit/sources/helpers/azure.py Outdated
Comment thread cloudinit/sources/helpers/azure.py
Comment thread tests/unittests/test_datasource/test_azure_helper.py
@johnsonshi
Copy link
Copy Markdown
Contributor Author

@OddBloke As per @anhvoms 's comment above, there were issues with Hyper-V's flush KVP method due to encoding errors, causing the telemetry channel to be blocked and preventing cloud-init from ever flushing the KVPs. We'd prefer to scope this PR to only be refactoring, and have the next PR deal with telemetry/reporting/logging/flushing KVPs as previously discussed :)

@OddBloke
Copy link
Copy Markdown
Collaborator

@OddBloke As per @anhvoms 's comment above, there were issues with Hyper-V's flush KVP method due to encoding errors, causing the telemetry channel to be blocked and preventing cloud-init from ever flushing the KVPs. We'd prefer to scope this PR to only be refactoring, and have the next PR deal with telemetry/reporting/logging/flushing KVPs as previously discussed :)

I'm perfectly happy with that approach. If we are going to deal with such things in a follow-up PR, then we should drop the time.sleep(0) call: I don't want us to land only part of the full change which you intend to make.

@johnsonshi
Copy link
Copy Markdown
Contributor Author

Just to clarify @OddBloke, the time.sleep(0) call was previously there (currently in master). Then there was a proposal to change the time.sleep(0) to use flush KVP instead. However, after consulting with @anhvoms, we discovered that he did try to use flush KVP but ran into issues, so I reverted back to using time.sleep(0), which was the original implementation.

@OddBloke
Copy link
Copy Markdown
Collaborator

Also @OddBloke should the type hints in this PR be removed?

TL;DR: no. :-)

Type hints were introduced in Python 3.5, and cloud-init still supports Python 3.4 I believe.

Function annotations have been supported in every version of Python 3: https://www.python.org/dev/peps/pep-3107/

What was introducing in Python 3.5 was the typing module, which our hacking doc prohibits the use of (for the compatibility reason you identified); see https://cloudinit.readthedocs.io/en/latest/topics/hacking.html#type-annotations for more details.

Additionally, older distros such as RHEL/CentOS (which only ships with Python 2.7) might not be able to easily backport changes that are Python 3 specific. What's your take on this?

cloud-init upstream is Python 3-only, we have already introduced changes which break Python 2 compatibility (including function annotations), and our codebase is better with type annotations than without: I don't think we should drop them.

@OddBloke
Copy link
Copy Markdown
Collaborator

Just to clarify @OddBloke, the time.sleep(0) call was previously there (currently in master). Then there was a proposal to change the time.sleep(0) to use flush KVP instead. However, after consulting with @anhvoms, we discovered that he did try to use flush KVP but ran into issues, so I reverted back to using time.sleep(0), which was the original implementation.

Aha, I see that now: thanks for pushing back! Given that, we should certainly leave it in.

Copy link
Copy Markdown
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

@OddBloke OddBloke merged commit c3556ae into canonical:master Aug 13, 2020
@johnsonshi johnsonshi deleted the johnsonshi/refactor-azure-report-ready-codepath branch August 14, 2020 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants