Skip to content

Conversation

@jscheffl
Copy link
Contributor

@jscheffl jscheffl commented Nov 26, 2023

Add a migration tool in order to deprecate MSSQL support in Airflow 2.8.0

I placed this in the /scripts/tools folder for now as it is not something targetted (yet) to be supported in main airflow core. Nevertheless because of the nature how it is it could be made into one airflow db ... command.
I am open where we store it (for the moment).

How to test?

  • Jump on any (dev or test) Airflow instance running 2.7.3, copy the python file migrate_script.py into a place where the Airflow python env is available
  • Stop the runnign airflow processes and execute python migrate_script.py --extract and see that a migration.db SQLite file is built in your current folder.
  • Copy the migration.db and the migrate_script.py to a test environment with Airflow 2.7.3, stop the airflow processes and and execute python migrate_script.py --restore - before hitting enter ensure no valuable data is contained, as it will wipe the DB content before copy!
  • Start the Airflow processes and check that the content from the source system is visible in the UI. Check scheduler and logs of worker for errors. Start a DAG to see the system can run new stuff. Log-on with another user as well

I tested with:

  • Source MSSQL, Target MySQL, Postgres and SQLite
  • Example DAGs, execucted some runs, added a user, a variable and a pool

There might be other data "in the wild", probably there will be "glitches" if more situations are seen. Feedback welcome!

@jscheffl jscheffl marked this pull request as ready for review November 26, 2023 15:08
@jscheffl jscheffl requested review from ashb and potiuk as code owners November 26, 2023 15:08
@potiuk
Copy link
Member

potiuk commented Nov 26, 2023

I am not going to block it in this form - but - as explained in a number of messages - I have a huge concern about

a) making it part of official airflow tooling
b) making it part of our repository in general
c) and especially about making it part of the airflow db command (for me this is a "no-go")

The problem is that if there are any issues found that need to be fixed, we will have to RELEASE fixes to it. Users are not able to modify and manipulate the export file itself, any problems with the migration will come to us as an issue to be solved and user's expectation will be to release a fix. Currenlty we do not have a mechanism to do so. We release stuff in providers, airflow, helm chart and we have formal process to do so. Once we make it part of any of these, users will expect us to release a .1, .2. version or whatever in case there are any issues to be fixed.

There are many problems if we try to support this as "official tool" released by the Airflow PMC (and having it in our repo will make people expect it).

For example what happens if a new version of sqlite library gets released that we will have to support it?

By the time we release airflow 2.8, 2.9, the DAG objects of our will be different than 2.7.3. So what should we do if we make it part of airflow db ? Should we still support users dumping versin 2.7.3 and importing it to a new version in version of this scropt that will import airflow from 2.9? How do we test if it still works? How do we keep up with dependencies? What are the exact instructions the users should use for it in this case.

There will be implicit expectations that this tool should be evolving together with airflow. So eventually we will have to add --version-from and --version-too and all the stuff because users will be confused. Is this script working with the current version of airflow? Is this script only supposed to be used on 2.7.3? What if I take version of that script released in 2.9 and run it on my "2.7.3" database? Should I expect it to work? Should I get my target airflow db migrated to latest version before using the sqlite dump to import the data?

What happens if somoene has 2.5.3 version of MSSQL? Can they migrate without upgrading to 2.7.3? Will it still work?

etc. etc.

And most importantly - should we support issues and questions of users when - for whatever reason the sqlite export will stop working (for example because they will use sqlite 9.13 (imaginary) released 2 years from now)? Should we release new versions of that script when sqlite 9.13 gets released? I'd love to avoid all those questions and issues to be honest.

That's why I am hesitating to get it in this form and as part of "airflow" repository.

PROPOSAL:

I am quite ok to have this as MSSQL migration tool for 2.7.3 -> Postgres/Sqlite 2.7.3 ONLY. Nothing else. No generic migration tool. We should limit it to specific versions of sqlite, airflow and whatever external dependency needed, ideally we should have requirements.txt with fixed versions of everything that is needed to run it. AND we should have it outside of airlfow repo.

It's very easy (I can do it in minutes) to create a separate repository (apache/airflow-mssql-2-7-3-migration for example) where we could put ONLY this script and all the documentation/requirements.txt files and make it non-released. Just keep it there - separately from Airflow, fixing all the versions of everything, making it very clear that this is a MSSQL migration tool only for 2.7.3 version and we should NOT RELEASE IT - just point people to that separate repository.

I think then, I will be even ok with using sqlite db as intermediary form - not being a text file.

@potiuk
Copy link
Member

potiuk commented Nov 26, 2023

I wonder - what others think about it. Especially those who were in Airflow when we migrated to Airflow 2. from 1.10.15.

And just to add a bit more context. We did release https://pypi.org/project/apache-airflow-upgrade-check/ when we started Airflow 1.10 -> Airflow 2 version. And that was not even a migration tool, that was a checker that checked if you are ready to migrate. And my concerns are heavily based on experienced we had then. The more we limit and constraint the users, the less "black" or even "gray" box our migration tool will be, the better for any kind of issues.

If we RELEASE something that looks like ready-to-use-CLI-tool, users will expect it to works as a generic tool, simply. The more we make it "do-it-yourself" solution, the more likely it is they will look inside and fix it, knowing that this is is what they are supposed to do.

@jscheffl
Copy link
Contributor Author

Hi @potiuk Yes, maybe this is a mis-understanding. I did NOT want to make this script a release piece. If we see we need it in other cases we could leverage it.

So it is now intended solely to support the MSSQL migration purposes. If you make another repo I am fine using this PR just for review purposes and drop the file to another location (or do the review on another side).

@potiuk
Copy link
Member

potiuk commented Nov 26, 2023

And... Fantastic you are doing it ! Thanks for that!

@eladkal
Copy link
Contributor

eladkal commented Nov 26, 2023

making it part of our repository in general

I see no problem with that.
Mssql is considered experimental. We can publish it under statement of best effort.
In the mailing list thread one of the comments was that we should provide tooling to ease migration for users. That means the tool/script is mantained and released by Airflow. I don't think it matters where the script saved (from user point of view).

@potiuk
Copy link
Member

potiuk commented Nov 28, 2023

I see no problem with that. Mssql is considered experimental. We can publish it under statement of best effort. In the mailing list thread one of the comments was that we should provide tooling to ease migration for users. That means the tool/script is mantained and released by Airflow. I don't think it matters where the script saved (from user point of view).

I think we never said we will maintain and relase the scripts. We discussed - we should tell the users how to do it, but not that it will become part of standard airflow tooling and that we will keep it maintained.

Let me explain why I think it should be in a separate repo.

I personaly think we should never have the code in our main airlfow repo that we are not planning to update dependencies on regularly. This opens up for example for CVE/security vulnerabilities. There are already CVE/vulnerabilities reported to us about the code that is just somewhere in our code when it is merely an "example". This basically means that we are supposed - to update to newer version of dependencies in the future, because some 3rd-party dependencies might have a security issue.

For example if we add requiremetns.txt - even in a subfolder of the main airflow repo and we will say sqlite==3.14.5 and a security fix is found in it, then (especially soon in the future) we will get some of our users scanning airflow repo and pointing out to us "you are using an outdated and insecure version of sqlite".

And we at least will have to respond to those reports and explain, and most likely if more people will be complaining we will likely have to do something about it - publish VEX information stating "this is only a tool that we are not really maintaining with airflow any more".

Having separate repo with very clear README and explanation what's the purpose of it, and information that this is only an example/one-time-conversion tool that users are going to use and modify on their own as needed makes it much less possible that such reports will be raised, especially if we just write a blog post about it at Airflow Publication in medium and won't link to it from our documentation. And it is quite fair I think as well from our side - we will provide a way for our users to migrate, but we wont commit to maintain it in the future. Which we can also clearly say - the longer you delay, the more "future" problems YOU will be dealing with - it's your choice - dear user.

I think it's an assertive and fair way of communicating with our MSSQL users. The more explicitly we do it now, the more "separate" it is now from "airlfow" code, the easier (and fair) it will be to tell our users who will come to us 2 years from now telling that the migration script does not work that they are on their own, because they have not done what we advised. They had a chance to do it when our eyes and hands were on it, but 2 years later our eyes and hands will be looking elsewwhere.

This is something that we have a bit of problem now - when we have the users that come to us 3 years after 1.10 reached end-of-life. They see our docs (if they look at it at all), they see released "migration check" and if does not work (for example because setuptools release broke it) they will come to us to "fix" it. We refuse, of coure, but I would like to have this migration efforts very clear that this is quite a bit their problem if they delay the migration and make it very, very clear from day one. I do not want to give users the impression we are going to support them 3 years from now when they come and say "this script does not work any more".

@ephraimbuddy
Copy link
Contributor

+1 to keep this separate from the Airflow repo considering the added maintenance time it would cost and of course the CVE issues

@jscheffl
Copy link
Contributor Author

Migrated script to https://github.com/apache/airflow-mssql-migration

@jscheffl jscheffl closed this Dec 29, 2023
@ephraimbuddy ephraimbuddy removed this from the Airflow 2.9.0 milestone Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants