Skip to content

158 remote agent upgrade support#201

Open
vladikD wants to merge 16 commits intodevelopfrom
158-remote-agent-upgrade-support
Open

158 remote agent upgrade support#201
vladikD wants to merge 16 commits intodevelopfrom
158-remote-agent-upgrade-support

Conversation

@vladikD
Copy link
Collaborator

@vladikD vladikD commented Jul 22, 2025

This PR is not ready yet and was created to see if I am on the right track with this task.
I would be very grateful for your review and, if possible, I would like to get some advice on what should be added and what can be changed

On the agent side

A new upgrade method has been added to the agent base class (base.py), which:

  • downloads a new agent tar.gz package from the specified URL;
  • checks the integrity of the package using SHA256;
  • backs up the current version of the agent;
  • unzips the new version, replacing the old files;
  • in case of an error, rolls back to the backup;
  • updates the self.version attribute after a successful update.
  • Added passing of the version field in the status payload /status so that the agent reports its current version.

On the controller side

Added the maybe_dispatch_upgrade function to helpers.py:

  • compares the agent version with the value of settings.LATEST_AGENT_VERSION;
  • if the agent is outdated, it creates a command of type agent with the upgrade action for a specific host;
  • creates an entry in AgentUpgradeHistory with the status pending.

A new model has been added to models.py:
AgentUpgradeHistory - stores the history of agent upgrades (from version to version, status, messages, start/end times).

**Added to views.py: **
reading the transferred version of the agent and calling maybe_dispatch_upgrade immediately after saving the status.

Added to the command result handler (submit_command_result):
update the AgentUpgradeHistory record when the upgrade command is completed (done or failed).

@vladikD vladikD self-assigned this Jul 22, 2025
Copy link
Collaborator

@mehalyna mehalyna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architectural Suggestions

  1. A manifest.json alongside the tar.gz (with version, hash, files) can give you:
  • upgrade metadata
  • better flexibility for future formats
  • checkpoint resuming (future)
  1. Agent should:
  • load current version from a version.txt
  • expose it in /status
  • fail gracefully if missing or unreadable
  1. Prevent concurrent upgrades or overlapping retries (e.g., via a .upgrading lockfile or atomic flags).

  2. Allow agent CLI fallback:

python agent.py --upgrade 1.3.0 --url=https://...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base.py agent class currently combines:

  • Core agent lifecycle logic
  • Status reporting
  • Communication
  • And now: a full upgrade system (download, validate, extract, rollback…)

This violates SRP because the agent class is now responsible for:

  • Its main business logic and
  • System upgrade orchestration (a completely separate concern)

Create a new file: upgrade_manager.py

# agents_infra/upgrade_manager.py
class AgentUpgradeManager:
    def __init__(self, logger, agent_dir):
        self.logger = logger
        self.agent_dir = agent_dir

    def upgrade(self, target_version: str, url: str, sha256: str):
        # move all upgrade logic here
        # optionally return a result object (success, message, version)

Usage in base.py:

from agents_infra.upgrade_manager import AgentUpgradeManager

...

def upgrade(self, target_version, url, sha256):
    manager = AgentUpgradeManager(self.logger, os.getcwd())
    result = manager.upgrade(target_version, url, sha256)
    if result.success:
        self.version = target_version

Optional: Later we may add Strategy Pattern, if expecting multiple upgrade approaches (tarball, git pull, .deb)

class UpgradeStrategy(ABC):
    @abstractmethod
    def upgrade(self): ...

class TarballUpgrade(UpgradeStrategy): ...
class GitPullUpgrade(UpgradeStrategy): ...

Comment on lines 656 to 657
self.logger.error('Upgrade failed: {}. '
'Restoring from backup.'.format(e))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use logger.exception() here to include traceback automatically.

Comment on lines 609 to 614
with open(package_path, 'rb') as f:
while True:
chunk = f.read(4096)
if not chunk:
break
sha256_actual.update(chunk)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaner alternative:

for chunk in iter(lambda: f.read(4096), b''):
    sha256_actual.update(chunk)

'completed successfully'.format(target_version))

# Updating the version
self.version = target_version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't persist across agent restarts. Consider storing this version in a persistent file (e.g., version.txt) and loading it on agent startup.

tar.extractall(path=tmp_dir)

# Assuming the extracted dir has same name as current agent dir
extracted_dir = os.path.join(tmp_dir, 'agent')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all packages may extract into an agent/ subfolder. Either:

  • Validate it exists before assuming
  • Extract and dynamically find the top-level directory
  • Use a manifest.json or similar

logger.warning(f'Agent {hostname} did not report version.')
return

if agent_version < latest_version:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String comparison can fail, e.g.:

"1.12" < "1.3"  # evaluates True, which is wrong

Suggestion:

from packaging import version
if version.parse(agent_version) < version.parse(latest_version):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future-proofing:
Since upgrade URLs are remote:

  • Add signature verification (GPG, HMAC, etc.)
  • Or ensure HTTPS + signed manifest

Comment on lines 288 to 300
AgentUpgradeHistory.objects.filter(
hostname=command.hostname,
to_version=command.params.get('target'),
status='pending'
).update(
status='success' if final_status == 'done' else 'failed',
finished_at=now(),
message=final_result
)
logger.info(
f'Updated AgentUpgradeHistory for {command.hostname} '
f'to {final_status}'
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Add audit logs for upgrade initiation in CommandHistory
  • Consider marking CommandHistory.status = done/failed as well

Comment on lines 341 to 342
AGENT_PACKAGE_URL = 'https://storage.example.com/agent-latest.tar.gz'
AGENT_PACKAGE_SHA256 = '123abc456def789...'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be environment-specific. Consider .env or config management tool (Vault, AWS SSM, etc.)

@vladikD vladikD requested review from CatSonbenim and mehalyna July 25, 2025 12:50
@vladikD
Copy link
Collaborator Author

vladikD commented Jul 27, 2025

@CatSonbenim коли буде час переглянь будь ласка мій PR. Буду дуже вдячний!

Copy link
Collaborator

@CatSonbenim CatSonbenim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall great job on large feature)

self.message = message


class AgentUpgradeManager:
Copy link
Collaborator

@CatSonbenim CatSonbenim Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to make this class abstract with
def upgrade(self, *args, **kwargs): raise NotImplementedError

and incapsulate upgrade from tar logic in separate child class. Otherwise we will break OCP once we want to add other source of download.

return command_history

def upgrade(self, target_version, url, sha256):
manager = AgentUpgradeManager(self.logger, os.getcwd())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please pass the manager as the argument, that should be the class with .upgrade interface.


return command_history

def upgrade(self, target_version, url, sha256):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add *args, **kwargs here, and pass them to .upgrade method (we don't know what could be needed for the upgrade).

self.logger = logger
self.agent_dir = agent_dir

def upgrade(self, target_version, url, sha256):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets make this method suitable to pass arguments in *args, **kwargs format.


class UpgradeResult:
def __init__(self, success, message):
self.success = success
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this not boolean value, but status codes. For example i saw the case where upgrade is skipped. In that case we can not say this was unseccessfull or seccessfull - it was skipped. In future that can help us filter what to alert user about (not sure if we have allerts feature, but it can be introduced further). In such case we can say that user should get alerts on failed upgrades (or even add more specific instructions with 2/3 digit status codes.

if os.path.exists(lockfile_path):
self.logger.warning('Upgrade already in progress. '
'Lockfile exists.')
return UpgradeResult(False, 'Upgrade skipped: '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a follow-up to the comment above - pass here a skipped status code

Comment on lines 44 to 45
shutil.copytree(self.agent_dir, backup_dir)
self.logger.info('Backup created at {}'.format(backup_dir))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make a separate file with upgrade utils and split the code of this method into separate defs. We are saying that in the future we can have more UpgradeManagers and all of them will need backup creation etc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also that will be good for incupsulation - now it's quite hard to read the code and follow the flow. Also it will allow to do more specific, maybe custom error handeling so we can distinguish causes of failed upgrade and inform the user about it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do separate functions for the blocks you have under comments

final_result = serializer.validated_data.get('result', '')

alert_if_command_failed(command.hostname, final_result)
if final_status in ['done', 'failed']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose status code in UpgradeResult can help you here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments