Skip to content

Conversation

@ephraimbuddy
Copy link
Contributor

@ephraimbuddy ephraimbuddy commented Mar 20, 2025

We currently use Psutil's create_time for process start time,
which gives the time the process started using the system clock. We can use
time.time to track when the process started processing the files instead.

In breeze, once the laptop hibbernates, you would have to restart the dag
processor but this fixes it. Since this does not happen in other deployments,
we suspect this issue is peculiar to breeze.

Closes: #47937, Closes: #47294

pierrejeambrun
pierrejeambrun previously approved these changes Mar 20, 2025
@property
def start_time(self) -> float:
return self._process.create_time()
return timezone.utcnow().timestamp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've probably found the right root cause, but returning now for start_time isn't the fix I don't think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@pierrejeambrun pierrejeambrun dismissed their stale review March 20, 2025 13:27

Dismiss following Jed comment

@ashb
Copy link
Member

ashb commented Mar 20, 2025

We don't compare it to time.monotonic though -- we compare it to time.time()

now = time.time()
processors_to_remove = []
for file, processor in self._processors.items():
duration = now - processor.start_time

@ephraimbuddy ephraimbuddy force-pushed the change-process-starttime-utc branch from 31ba78b to 551e735 Compare March 20, 2025 23:47
@ephraimbuddy
Copy link
Contributor Author

I ha

We don't compare it to time.monotonic though -- we compare it to time.time()

now = time.time()
processors_to_remove = []
for file, processor in self._processors.items():
duration = now - processor.start_time

Yeah. My description is wrong. I will update the commit message

@ephraimbuddy ephraimbuddy force-pushed the change-process-starttime-utc branch from 551e735 to 900d291 Compare March 20, 2025 23:55
@ephraimbuddy ephraimbuddy changed the title Use timezone.utcnow for dag file process start time Use time.time() for dag file process start time Mar 21, 2025
@ephraimbuddy ephraimbuddy force-pushed the change-process-starttime-utc branch 2 times, most recently from 1d58774 to 4cb35fd Compare March 21, 2025 07:51
@ashb
Copy link
Member

ashb commented Mar 21, 2025

https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L766-L773

If you are worried about process reuse and the start_time being wrong, then call proc.create_time() once first.

However the chance of process reuse happening and being the cause of the bugs is tiny. I'm not saying it can't happen, but that the for it to, a large number of processes need to have been created, enough for the pid to overflow and wrap around. I don't know what the limit is, but I suspect it's somewhere around 64k processes on OSX.

We currently use Psutil's create_time for process start time,
which gives the time the process started using the system clock. We can use
time.time to track when the process started processing the files instead.

In breeze, once the laptop hibbernates, you would have to restart the dag
processor but this fixes it. Since this does not happen in other deployments,
we suspect that this issue is peculiar to breeze.

Closes: apache#47937, Closes: apache#47294
@ephraimbuddy
Copy link
Contributor Author

https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L766-L773

If you are worried about process reuse and the start_time being wrong, then call proc.create_time() once first.

However the chance of process reuse happening and being the cause of the bugs is tiny. I'm not saying it can't happen, but that the for it to, a large number of processes need to have been created, enough for the pid to overflow and wrap around. I don't know what the limit is, but I suspect it's somewhere around 64k processes on OSX.

I was able to determine that it’s not process reuse by printing out the previous PIDs before process creation and checking it with the new ID.

My observation is that any process created after the heartbeat recovery message of say recovered after 290ms use the time before the recovery instead of the time after the recovery which is when the process started being used for file processing. So instead of using the process start time, we can use the time that the process started processing the files.

@ephraimbuddy ephraimbuddy force-pushed the change-process-starttime-utc branch from 4cb35fd to 7732ffc Compare March 22, 2025 07:32
@github-actions
Copy link

github-actions bot commented May 7, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label May 7, 2025
@ashb
Copy link
Member

ashb commented May 7, 2025

Closed in favour of #50238

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:DAG-processing stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DAGs disappearing after some idle time

4 participants