Skip to content

{Telemetry} Fix telemetry lost due to concurrency issue#29717

Merged
evelyn-ys merged 3 commits intoAzure:devfrom
evelyn-ys:fix_telemetry_lost
Aug 20, 2024
Merged

{Telemetry} Fix telemetry lost due to concurrency issue#29717
evelyn-ys merged 3 commits intoAzure:devfrom
evelyn-ys:fix_telemetry_lost

Conversation

@evelyn-ys
Copy link
Member

@evelyn-ys evelyn-ys commented Aug 15, 2024

Description

This PR is for fixing telemetry lost due to concurrency issue.

At the stage when we still had telemetry local cache, here's the process of uploading telemetry:

  1. save cli records in telemetry cache file
  2. check telemetry note file
    2.1 if it's not enough time since last telemetry uploading time, exit
    2.2 if it has been long enough since last telemetry uploading time, start a subprocess for uploading
  3. within the uploading subprocess
    3.1 check telemetry note file again to ensure it has been enough time since last telemetry sent time
    3.2 get last_sent_time from telemetry note file
    3.3 read all cli records from telemetry cache file where record_time > last_sent_time
    3.4 remove cache file as it has been loaded in 3.3
    3.5 uploading all records from 3.3
    3.6 update last_sent_time for telemetry note file

Since #26854, we have dropped telemetry cache. So step 2 and step 3.1 are removed. We won't check if it has been "enough time" since last uploading, we do the uploading every time. So here's the new process:

  1. save cli records in telemetry cache file
  2. start a subprocess for uploading
  3. within the uploading subprocess
    3.1 get last_sent_time from telemetry note file
    3.2 read all cli records from telemetry cache file where record_time > last_sent_time
    3.3 remove cache file as it has been loaded in 3.2
    3.4 uploading all records from 3.3
    3.5 update last_sent_time for telemetry note file

But this enlarges concurrency issue as we increased the frequency. Operations on two files may encounter concurrency issue, one is the telemetry cache file (step 1, step 3.2, step 3.3), another is telemetry note file (step 3.1, step 3.5):

  • Imagine a subprocess for command A and main process for command B are executing at the same time and the real execution order is subprocess A 3.2 -> main process B -> subprocess A 3.3, then we will lose the cli record for command B because it's not loaded in 3.2 but deleted in 3.3
  • Then for telemetry note file, there's lock mechanism. If subprocess A and subprocess B are both acquiring a lock to access telemetry note file, only one will success, the other will fail with portallocker.AlreadyLocked error, see
    def __enter__(self):
    self._logger.debug('acquiring lock')
    self.acquire()
    self._logger.debug('acquired lock')
    return self
    except portalocker.AlreadyLocked:
    # another upload process is running.
    logger.info('Lock out from note file under %s which means another process is running. Exit 0.', config_dir)
    sys.exit(0)
    The failed one will exit uploading directly which means we may also lose the telemetry

So the fix also comes from two part:

  • for the telemetry cache file, we log in separate cache file for each cli execution to avoid reading and writing for the same file
  • for the telemetry note file, as we have already dropped the last sent time check, we don't need telemetry not file either, so just simply remove step 3.1 and step 3.5

Testing Guide

Run a cli script which contains many cli commands
Check $HOME/.azure/logs/telemetry.log and see if all command records have been uploaded


This checklist is used to make sure that common guidelines for a pull request are followed.

@azure-client-tools-bot-prd
Copy link

azure-client-tools-bot-prd bot commented Aug 15, 2024

️✔️AzureCLI-FullTest
️✔️acr
️✔️2020-09-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️acs
️✔️2020-09-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️advisor
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️ams
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️apim
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️appconfig
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️appservice
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️aro
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️backup
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️batch
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️batchai
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️billing
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️botservice
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️cdn
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️cloud
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️cognitiveservices
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️compute_recommender
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️config
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️configure
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️consumption
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️container
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️containerapp
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️core
️✔️2018-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2019-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2020-09-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️cosmosdb
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️databoxedge
️✔️2019-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2020-09-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️dla
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️dls
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️dms
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️eventgrid
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️eventhubs
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️feedback
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️find
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️hdinsight
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️identity
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️iot
️✔️2019-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2020-09-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️keyvault
️✔️2018-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2020-09-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️kusto
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️lab
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️managedservices
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️maps
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️marketplaceordering
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️monitor
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️mysql
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️netappfiles
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️network
️✔️2018-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️policyinsights
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️privatedns
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️profile
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️rdbms
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️redis
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️relay
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️resource
️✔️2018-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2019-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️role
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️search
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️security
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️servicebus
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️serviceconnector
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️servicefabric
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️signalr
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️sql
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️sqlvm
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️storage
️✔️2018-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2019-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2020-09-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️synapse
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️telemetry
️✔️2018-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2019-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2020-09-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️util
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9
️✔️vm
️✔️2018-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2019-03-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️2020-09-01-hybrid
️✔️3.11
️✔️3.12
️✔️3.9
️✔️latest
️✔️3.11
️✔️3.12
️✔️3.9

@azure-client-tools-bot-prd
Copy link

azure-client-tools-bot-prd bot commented Aug 15, 2024

️✔️AzureCLI-BreakingChangeTest
️✔️Non Breaking Changes

@yonzhan
Copy link
Collaborator

yonzhan commented Aug 15, 2024

Telemetry

@evelyn-ys evelyn-ys requested a review from MoChilia August 15, 2024 06:43
@evelyn-ys evelyn-ys changed the title {Telemetry} Fix telemetry lost due to currency issue {Telemetry} Fix telemetry lost due to concurrency issue Aug 15, 2024
Copy link
Member

@MoChilia MoChilia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on GitHub Actions. Everything looks great, and telemetry is fully recorded.

@evelyn-ys evelyn-ys merged commit 556d198 into Azure:dev Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants