158 remote agent upgrade support by vladikD · Pull Request #201 · ITA-Dnipro/PyDataCenter5000

vladikD · 2025-07-22T14:29:45Z

This PR is not ready yet and was created to see if I am on the right track with this task.
I would be very grateful for your review and, if possible, I would like to get some advice on what should be added and what can be changed

On the agent side

A new upgrade method has been added to the agent base class (base.py), which:

downloads a new agent tar.gz package from the specified URL;
checks the integrity of the package using SHA256;
backs up the current version of the agent;
unzips the new version, replacing the old files;
in case of an error, rolls back to the backup;
updates the self.version attribute after a successful update.
Added passing of the version field in the status payload /status so that the agent reports its current version.

On the controller side

Added the maybe_dispatch_upgrade function to helpers.py:

compares the agent version with the value of settings.LATEST_AGENT_VERSION;
if the agent is outdated, it creates a command of type agent with the upgrade action for a specific host;
creates an entry in AgentUpgradeHistory with the status pending.

A new model has been added to models.py:
AgentUpgradeHistory - stores the history of agent upgrades (from version to version, status, messages, start/end times).

**Added to views.py: **
reading the transferred version of the agent and calling maybe_dispatch_upgrade immediately after saving the status.

Added to the command result handler (submit_command_result):
update the AgentUpgradeHistory record when the upgrade command is completed (done or failed).

mehalyna

Architectural Suggestions

A manifest.json alongside the tar.gz (with version, hash, files) can give you:

upgrade metadata
better flexibility for future formats
checkpoint resuming (future)

Agent should:

load current version from a version.txt
expose it in /status
fail gracefully if missing or unreadable

Prevent concurrent upgrades or overlapping retries (e.g., via a .upgrading lockfile or atomic flags).
Allow agent CLI fallback:

python agent.py --upgrade 1.3.0 --url=https://...

mehalyna · 2025-07-22T16:17:41Z

agents_infra/agents_infra/agents/base.py

The base.py agent class currently combines:

Core agent lifecycle logic

Status reporting

Communication

And now: a full upgrade system (download, validate, extract, rollback…)

This violates SRP because the agent class is now responsible for:

Its main business logic and

System upgrade orchestration (a completely separate concern)

Create a new file: upgrade_manager.py

# agents_infra/upgrade_manager.py class AgentUpgradeManager: def __init__(self, logger, agent_dir): self.logger = logger self.agent_dir = agent_dir def upgrade(self, target_version: str, url: str, sha256: str): # move all upgrade logic here # optionally return a result object (success, message, version)

Usage in base.py:

from agents_infra.upgrade_manager import AgentUpgradeManager ... def upgrade(self, target_version, url, sha256): manager = AgentUpgradeManager(self.logger, os.getcwd()) result = manager.upgrade(target_version, url, sha256) if result.success: self.version = target_version

Optional: Later we may add Strategy Pattern, if expecting multiple upgrade approaches (tarball, git pull, .deb)

class UpgradeStrategy(ABC): @abstractmethod def upgrade(self): ... class TarballUpgrade(UpgradeStrategy): ... class GitPullUpgrade(UpgradeStrategy): ...

mehalyna · 2025-07-22T16:19:16Z

agents_infra/agents_infra/agents/base.py

+            self.logger.error('Upgrade failed: {}. '
+                              'Restoring from backup.'.format(e))


Use logger.exception() here to include traceback automatically.

mehalyna · 2025-07-22T16:20:02Z

agents_infra/agents_infra/agents/base.py

+            with open(package_path, 'rb') as f:
+                while True:
+                    chunk = f.read(4096)
+                    if not chunk:
+                        break
+                    sha256_actual.update(chunk)


Cleaner alternative:

for chunk in iter(lambda: f.read(4096), b''): sha256_actual.update(chunk)

mehalyna · 2025-07-22T16:29:12Z

agents_infra/agents_infra/agents/base.py

+                             'completed successfully'.format(target_version))
+
+            # Updating the version
+            self.version = target_version


This won't persist across agent restarts. Consider storing this version in a persistent file (e.g., version.txt) and loading it on agent startup.

mehalyna · 2025-07-22T16:30:45Z

agents_infra/agents_infra/agents/base.py

+                tar.extractall(path=tmp_dir)
+
+            # Assuming the extracted dir has same name as current agent dir
+            extracted_dir = os.path.join(tmp_dir, 'agent')


Not all packages may extract into an agent/ subfolder. Either:

Validate it exists before assuming

Extract and dynamically find the top-level directory

Use a manifest.json or similar

mehalyna · 2025-07-22T16:32:13Z

pydata_center/monitoring/helpers.py

+        logger.warning(f'Agent {hostname} did not report version.')
+        return
+
+    if agent_version < latest_version:


String comparison can fail, e.g.:

"1.12" < "1.3" # evaluates True, which is wrong

Suggestion:

from packaging import version if version.parse(agent_version) < version.parse(latest_version):

mehalyna · 2025-07-22T16:33:13Z

pydata_center/monitoring/helpers.py

Future-proofing:
Since upgrade URLs are remote:

Add signature verification (GPG, HMAC, etc.)

Or ensure HTTPS + signed manifest

mehalyna · 2025-07-22T16:59:26Z

pydata_center/monitoring/views.py

+            AgentUpgradeHistory.objects.filter(
+                hostname=command.hostname,
+                to_version=command.params.get('target'),
+                status='pending'
+            ).update(
+                status='success' if final_status == 'done' else 'failed',
+                finished_at=now(),
+                message=final_result
+            )
+            logger.info(
+                f'Updated AgentUpgradeHistory for {command.hostname} '
+                f'to {final_status}'
+            )


Add audit logs for upgrade initiation in CommandHistory

Consider marking CommandHistory.status = done/failed as well

mehalyna · 2025-07-22T17:00:21Z

pydata_center/pydata_center/settings.py

+AGENT_PACKAGE_URL = 'https://storage.example.com/agent-latest.tar.gz'
+AGENT_PACKAGE_SHA256 = '123abc456def789...'


These should be environment-specific. Consider .env or config management tool (Vault, AWS SSM, etc.)

vladikD · 2025-07-27T12:59:52Z

@CatSonbenim коли буде час переглянь будь ласка мій PR. Буду дуже вдячний!

CatSonbenim

Overall great job on large feature)

CatSonbenim · 2025-07-28T16:52:55Z

agents_infra/agents_infra/managers/upgrade_manager.py

+        self.message = message
+
+
+class AgentUpgradeManager:


It would be better to make this class abstract with
def upgrade(self, *args, **kwargs): raise NotImplementedError

and incapsulate upgrade from tar logic in separate child class. Otherwise we will break OCP once we want to add other source of download.

CatSonbenim · 2025-07-28T16:54:12Z

agents_infra/agents_infra/agents/base.py

            return command_history
+
+    def upgrade(self, target_version, url, sha256):
+        manager = AgentUpgradeManager(self.logger, os.getcwd())


please pass the manager as the argument, that should be the class with .upgrade interface.

CatSonbenim · 2025-07-28T16:55:05Z

agents_infra/agents_infra/agents/base.py


            return command_history
+
+    def upgrade(self, target_version, url, sha256):


add *args, **kwargs here, and pass them to .upgrade method (we don't know what could be needed for the upgrade).

CatSonbenim · 2025-07-28T16:56:05Z

agents_infra/agents_infra/managers/upgrade_manager.py

+        self.logger = logger
+        self.agent_dir = agent_dir
+
+    def upgrade(self, target_version, url, sha256):


lets make this method suitable to pass arguments in *args, **kwargs format.

CatSonbenim · 2025-07-28T16:59:55Z

agents_infra/agents_infra/managers/upgrade_manager.py

+
+class UpgradeResult:
+    def __init__(self, success, message):
+        self.success = success


Let's make this not boolean value, but status codes. For example i saw the case where upgrade is skipped. In that case we can not say this was unseccessfull or seccessfull - it was skipped. In future that can help us filter what to alert user about (not sure if we have allerts feature, but it can be introduced further). In such case we can say that user should get alerts on failed upgrades (or even add more specific instructions with 2/3 digit status codes.

CatSonbenim · 2025-07-28T17:00:34Z

agents_infra/agents_infra/managers/upgrade_manager.py

+        if os.path.exists(lockfile_path):
+            self.logger.warning('Upgrade already in progress. '
+                                'Lockfile exists.')
+            return UpgradeResult(False, 'Upgrade skipped: '


as a follow-up to the comment above - pass here a skipped status code

CatSonbenim · 2025-07-28T17:04:03Z

agents_infra/agents_infra/managers/upgrade_manager.py

+            shutil.copytree(self.agent_dir, backup_dir)
+            self.logger.info('Backup created at {}'.format(backup_dir))


let's make a separate file with upgrade utils and split the code of this method into separate defs. We are saying that in the future we can have more UpgradeManagers and all of them will need backup creation etc

Also that will be good for incupsulation - now it's quite hard to read the code and follow the flow. Also it will allow to do more specific, maybe custom error handeling so we can distinguish causes of failed upgrade and inform the user about it

Do separate functions for the blocks you have under comments

CatSonbenim · 2025-07-28T17:09:49Z

pydata_center/monitoring/views.py

        final_result = serializer.validated_data.get('result', '')

        alert_if_command_failed(command.hostname, final_result)
+        if final_status in ['done', 'failed']:


I suppose status code in UpgradeResult can help you here

vladikD added 6 commits July 22, 2025 17:13

Implement upgrade mechanism with backup, SHA256 check and rollback

7f4d4ac

Add maybe_dispatch_upgrade to trigger agent upgrade if outdated

d0605f3

Add AgentUpgradeHistory model to track agent upgrades

577944c

add version

bc0d515

add agent upgrade settings (latest version, package URL, SHA256)

59aa494

trigger upgrade check on agent status POST

9d0405d

vladikD requested review from NovitskaMariia, YelyzavetaV and mehalyna July 22, 2025 14:29

vladikD self-assigned this Jul 22, 2025

mehalyna reviewed Jul 22, 2025

View reviewed changes

vladikD added 6 commits July 25, 2025 15:15

Corrected base.py after comments

da44914

Corrected helpers.py after comments

d0a0eef

Corrected settings.py after comments

b5e2268

Moved the implementation of the update to upgrade_manager.py

afb80be

Corrected views.py after comments

4526836

created manifest.json and version.txt for each agent

d457f0b

vladikD requested review from CatSonbenim and mehalyna July 25, 2025 12:50

CatSonbenim requested changes Jul 28, 2025

View reviewed changes

vladikD added 4 commits July 30, 2025 16:27

Support UpgradeResult status codes in command result API

df79740

Add upgrade utility functions for backup and extraction

8f1c811

Use upgrade utils and manifest.json in upgrade manager

a12d2e4

Refactor: Integrate upgrade manager into agent base class

6123c12

		self.logger.error('Upgrade failed: {}. '
		'Restoring from backup.'.format(e))

		AGENT_PACKAGE_URL = 'https://storage.example.com/agent-latest.tar.gz'
		AGENT_PACKAGE_SHA256 = '123abc456def789...'


		return command_history

		def upgrade(self, target_version, url, sha256):

		shutil.copytree(self.agent_dir, backup_dir)
		self.logger.info('Backup created at {}'.format(backup_dir))

Conversation

vladikD commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mehalyna left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vladikD commented Jul 27, 2025

Uh oh!

CatSonbenim left a comment

Choose a reason for hiding this comment

Uh oh!

CatSonbenim Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

vladikD commented Jul 22, 2025 •

edited

Loading

CatSonbenim Jul 28, 2025 •

edited

Loading