Skip to content

Conversation

@molcay
Copy link
Contributor

@molcay molcay commented Sep 23, 2024

This PR is about changing the type of priority_weight property of the TaskInstance. For different databases, we have different possible max values for the integer columns.

There are several attempts to fix this problem:

This PR contains the changes in #38160 (created by Taragolis) on top of #38222 as mentioned here.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:db-migrations PRs with DB migration area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues kind:documentation labels Sep 23, 2024
@molcay molcay force-pushed the fix/out-of-range-for-priority-weight branch from a82c385 to 7385aad Compare September 23, 2024 13:06
@jscheffl
Copy link
Contributor

I don't understand the reason and need to change the priority weight from int to float. For me this adds just extra complexity and I see no benefit.

@molcay
Copy link
Contributor Author

molcay commented Sep 27, 2024

Hi @jscheffl,
There was this issue/discussion: #22781 which is about the integer out-of-range for the priority weight field. This error might happen for PostgreSQL. There were a lot of discussions around it in different PRs (that I mentioned in this PR's description). As far as I understand; the next planned move was to make this change after #38222 as mentioned #38160 (comment).

@pierrejeambrun
Copy link
Member

pierrejeambrun commented Sep 27, 2024

This seems a little odd to me, just to be sure I understand what we are trying to solve, are we trying to address integer overflow for weights and operation on weights ?

I don't see the use case for using such big numbers for priority weights.

@molcay
Copy link
Contributor Author

molcay commented Oct 10, 2024

Hi @pierrejeambrun,

The prior target is to address integer overflow. Because it is different for different database engines.
Since we don't have any given range to the users, I think we just need to be sure that it is not failing or do this to avoid failing. I read all the discussions that were held before; this was the one of the previous implementation but closed and waited to make this change after #38222

@jscheffl
Copy link
Contributor

Do we know what the "lowest range" supported in (positive and negative) range is? Then we could build a proper validation to prevent roll-over and document this.

Making this to float I assume will make other problems... like if you have a very large number, adding one will still be the same number :-D

@VladaZakharova
Copy link
Contributor

hi @potiuk !
As I understand there was a discussion regarding the way we want to implement this logic (the whole list is in the description of the PR). As I understand, now we have some kind of new ideas what we want here, so maybe we can agree with @jscheffl on the way it will be better to implement. WDYT?

@potiuk
Copy link
Member

potiuk commented Oct 31, 2024

Yeah. Actually I like the idea of changing it to float - this has far less problems than int (rollover etc.) and the problem with +1/-1 is virtually not existing IMHO - because when you get to the value where +1/-1 is the same, then .... it is already such a big number that it does not matter any more.

I'd be for rebasing and merging this change (@jscheffl ?)

@jscheffl
Copy link
Contributor

jscheffl commented Nov 1, 2024

I am not a big fan of floats - maybe because in my job history I had always long long discussions about FMA and reproducibility of execution... but whatever.

I would not block this but I still feel this is somewhat strange. integers have a very wide range in the default and we are talking about potentially a couple of thousand tasks we want to set priority. I don't see a point that there is a real "functional need" to set priorities exponential or in another manner that - with a normal modelling - yo need such large ranges. Anyway that would be hard to manage from a functional standpoint. Also I don't see a realistic setup where the priority weight strategy because of a very hghe long DAG would get to such boundaries.
And the argument that at a certain level a big big float with +1 is the same big big float and no harm... then we could also figure out the smallest bound of INT supported by Postgres/MySQL/whatsoever DB and just cap the upper/lower boundaries to prevent a roll-over.

It would be more convincing if there is a real problem/use case for such big priorities and not caused because DAG develops just add t00000000 many zeroes to the priority.

@potiuk
Copy link
Member

potiuk commented Nov 1, 2024

I don't see a point that there is a real "functional need" to set priorities exponential or in another manner that - with a normal modelling - yo need such large ranges

Let's not forget that we have the priority-weight cumulation: upstream, downstream and eventually Custom: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html#custom-weight-rule

I can quite easily imagine some big numbers when we have huge 1000s task dags with multiple upstream/downstream tasks (and layers) and especially with custom weights I can easily imagine those numbers to add up (or maybe multiply in custom rules if one wants to have more aggressive priority propagation).

@jscheffl
Copy link
Contributor

jscheffl commented Nov 2, 2024

I don't see a point that there is a real "functional need" to set priorities exponential or in another manner that - with a normal modelling - yo need such large ranges

Let's not forget that we have the priority-weight cumulation: upstream, downstream and eventually Custom: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html#custom-weight-rule

I can quite easily imagine some big numbers when we have huge 1000s task dags with multiple upstream/downstream tasks (and layers) and especially with custom weights I can easily imagine those numbers to add up (or maybe multiply in custom rules if one wants to have more aggressive priority propagation).

I still don't see a real need and use of such high numbers. Yes we accumulate priority weights by making sums. Assume we have a DAG with 1000 tasks chained (I hope nobody is modelling this, will really run a long time) and we use a priority of 10k (=10000). Then the accumulated priority is at 10 million.

Looking into the INT value we use today the supported database have integer ranges with:

This means I still can have 1000 tasks with a priority of 1 million in my DAG. Which is also something in the range of values you can model to fit into the INT range.

Instead of switching to float I think we should rather cap the values and ensure they can not roll-over. And add documentation about the limits. The limtis of postgres and mysql are the same and sound reasonable (else: there is also the option to switch to bigint of course if you want to support incredible numbers non-float).

@pierrejeambrun
Copy link
Member

pierrejeambrun commented Nov 4, 2024

I am a little undecided on this one. On one hand if that is a use case, then why not, on the other hand, I tend to agree with Jens. I am having a hard time imagining a legitimate scenario for such big priority weights where downscaling to more reasonable values is not possible.

@potiuk
Copy link
Member

potiuk commented Nov 4, 2024

@molcay -> is there a case that you might give as an example where it would be needed?

@molcay
Copy link
Contributor Author

molcay commented Nov 7, 2024

Hi @potiuk, I am sorry but I don't have any specific case that can be used as an example. I only know #22781 was the start. Maybe, @kosteev has an idea?

@kosteev
Copy link
Contributor

kosteev commented Nov 7, 2024

Not pretending that this is very practical case, however in Cloud Composer we saw this (and not only once):

Imagine that you give customer an example DAG that looks like this:

...

dag = DAG(
    'dag_id',
    default_args=default_args,
    schedule_interval='*/10 * * * *',
)

# task with highest priority
t1 = BashOperator(
    task_id='task_id',
    bash_command='echo test',
    dag=dag,
    priority_weight=2**31 - 1)

Customer modifies DAG and adds extra task t2 by copying t1 and setting dependencies between them:

...

dag = DAG(
    'dag_id',
    default_args=default_args,
    schedule_interval='*/10 * * * *',
)

# task with highest priority
t1 = BashOperator(
    task_id='task_id',
    bash_command='echo test',
    dag=dag,
    priority_weight=2**31 - 1)

t2 = BashOperator(
    task_id='task_id2',
    bash_command='echo test2',
    dag=dag,
    priority_weight=2**31 - 1)

t1 >> t2

Then this DAG will cause an issue and break scheduler (because t2 priority_weight will overflow).

Btw, I found and example DAG like this on stackoverflow https://stackoverflow.com/questions/66098050/airflow-dag-not-triggered-at-schedule-time.

I am not saying that this is at all common, but it is very unexpected for user to have scheduler broken after slight modification of the DAG like this.

@pierrejeambrun
Copy link
Member

pierrejeambrun commented Nov 7, 2024

Those are just unreasonable values and it is not a surprise to me that things break down the line.

If I code a mapped task that expand to (2*32 - 1) most likely things will also break.

We can always add a safeguard, or a better error handling on the scheduler side to make that more explicit, but changing the underlying datatype to allow usage of extremely high values does not achieve much (IMO).

Also I believe that this can already be solved by a custom cluster policy if a company encounter this edge case. (preventing people from using too high values)

@kosteev
Copy link
Contributor

kosteev commented Nov 7, 2024

I agree that changing datatype doesn't solve issue radically.

I haven't read all the comments here, but my vote is to have validation/safeguard for this (e.g. validation during DAG parsing and throwing import error for such a DAG, that customer could see it).

In general the fact that Airflow scheduler (or DAG processor) can break because of user code doesn't look right, IMHO.

@potiuk
Copy link
Member

potiuk commented Nov 11, 2024

Yeah. I am convinced now, that we should rather cap the values, not change the type.

@molcay
Copy link
Contributor Author

molcay commented Nov 13, 2024

Since #43611 is already approved and looks like it is about to merge, I am closing this PR

@molcay molcay closed this Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:db-migrations PRs with DB migration area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues kind:documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants