KAFKA-14995: Automate asf.yaml collaborators refresh#17124
KAFKA-14995: Automate asf.yaml collaborators refresh#17124mumrah merged 15 commits intoapache:trunkfrom
Conversation
…ntributors by lifetime commits, rather than by commits from the last year (as in git shortlog --email --numbered --summary --since=2022-04-28), since we have some prolific contributors in the past who are no longer active.'
…an be 'None'. Ensure the updated '.asf.yaml' file retains the previous format ( retain the comments and only update the 'github_whitelist' list and the 'collaborators' list). Instead of directly merging to trunk, the script will create a new branch for the changes, commit the changes to the new branch, and open a pull request with the updated '.asf.yaml' file.
…nges to the branch, and open a pull request with the updated '.asf.yaml' file. Otherwise, the script will create a new branch for the changes, commit the changes to the new branch, and open a pull request with the updated '.asf.yaml' file.
|
@mimaison, @vvcephei, as you had been on the original PR, I am mentioning you here and @stevenbooke as co-author. I also mention @jlprat as we are discussing in the Jira task. Thank you all! :) |
mumrah
left a comment
There was a problem hiding this comment.
Thanks for the patch @fonsdant!
Left some comments inline.
This looks like a useful addition. It does seem like there are a lot of moving pieces to this. I wonder if it would be sufficient to make it a script that committers can run manually and open their own PR.
| on: | ||
| schedule: | ||
| - cron: "0 0 1 */3 *" |
There was a problem hiding this comment.
Can we make this workflow dispatch rather than cron?
There was a problem hiding this comment.
I think so. I will give it a try!
| one_year_ago: datetime = datetime.now() - timedelta(days=365) | ||
| contributors: Dict[str, int] = {} | ||
|
|
||
| last_year_commits: PaginatedList[Commit] = repo.get_commits(since=one_year_ago) |
There was a problem hiding this comment.
How many pages will this result in? Do we need to worry about API limits?
Since the action runs in the context of the Git repo, can we get the collaborator info there?
E.g., git shortlog -s -n | head -10
There was a problem hiding this comment.
-
I will check it, have only ran it and it seems to work fine for the current repo state.
-
Yes! I have already used collaborator info (login) in:
for contributor in repo.get_contributors(): if contributor.login not in committers:We can perform these operations.
There was a problem hiding this comment.
Since we are authenticated (have a token), I think rate limit is not a problem. See Rate limits for the REST API.
You can use a personal access token to make API requests. Additionally, you can authorize a GitHub App or OAuth app, which can then make API requests on your behalf.
All of these requests count towards your personal rate limit of 5,000 requests per hour. Requests made on your behalf by a GitHub App that is owned by a GitHub Enterprise Cloud organization have a higher rate limit of 15,000 requests per hour. Similarly, requests made on your behalf by a OAuth app that is owned or approved by a GitHub Enterprise Cloud organization have a higher rate limit of 15,000 requests per hour if you are a member of the GitHub Enterprise Cloud organization.
In the past 4 years, we count 4248 commits. Even if we request each, we will be within the rate limit for authenticated users.
$ git rev-list --count --since="3 year ago" origin/trunk
4248
| logging.info(f"Fetching committers from the repository {REPO_KAFKA_SITE}") | ||
| committers_file: ContentFile = repo.get_contents("committers.html") | ||
| content: bytes = committers_file.decoded_content | ||
| soup: BeautifulSoup = BeautifulSoup(content, "html.parser") |
There was a problem hiding this comment.
This seems a little brittle. Is there another way we could identify committer vs contributor? Perhaps the GH API?
There was a problem hiding this comment.
I have picked it from the previous code. I will try to look for another approach.
There was a problem hiding this comment.
Ah, I didn't realize there was some extant code for this. Maybe we can keep this approach, even if it's not perfect.
Thanks for reviewing, David! :) Yes, it is possible to run the script locally/manually. To do this, I have generated a GitHub personal access token and used it as GITHUB_TOKEN. One can replace I have thought about to create a Gradle task to run it, like |
|
@fonsdant, let's do this for now:
If the committers find ourselves really wanting some automation, we can address it in a future PR. For now, the complications of making the script reliable and dealing with automatically opening a PR seem unnecessary. Just having a script to analyze the history is quite useful and accomplishes 90% of the goal :) WDYT? |
Alright! LGTM! With Remove the Github bits, should I remove the PR generation, right? We still need GitHub client to get repo contributors as Josep have stated:
|
Yes, sorry I really meant "Github Actions bits" |
|
Test with updated script: $ python refresh_collaborators.py
{"timestamp":"2024-09-10 13:52:22,013","level":"INFO","message":"Successfully initialized GitHub client"}
{"timestamp":"2024-09-10 13:52:22,726","level":"INFO","message":"Fetching committers from the repository apache/kafka-site"}
{"timestamp":"2024-09-10 13:52:23,906","level":"INFO","message":"Found 57 committers"}
{"timestamp":"2024-09-10 13:52:24,955","level":"INFO","message":"Fetching contributors from the repository apache/kafka"}
{"timestamp":"2024-09-10 13:53:58,883","level":"INFO","message":"Found 10 top contributors who are not committers"}
{"timestamp":"2024-09-10 13:53:58,890","level":"INFO","message":"Updating ../.asf.yaml with 10 new collaborators"}
{"timestamp":"2024-09-10 13:53:58,894","level":"INFO","message":"Local file ../.asf.yaml updated successfully"}Diff for $ git diff .asf.yaml | cat
diff --git a/.asf.yaml b/.asf.yaml
index ed20450a6a..136d5ac3d1 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -16,8 +16,8 @@
# under the License.
notifications:
- commits: commits@kafka.apache.org
- issues: jira@kafka.apache.org
+ commits: commits@kafka.apache.org
+ issues: jira@kafka.apache.org
pullrequests: jira@kafka.apache.org
jira_options: link label
@@ -25,28 +25,25 @@ notifications:
# https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-JenkinsPRwhitelisting
jenkins:
github_whitelist:
- - FrankYang0529
- - kamalcph
- - apoorvmittal10
- - lianetm
- - brandboat
- - kirktrue
- - nizhikov
- - OmniaGM
- - dongnuo123
- - frankvicky
-
-# This list allows you to triage pull requests. It can have a maximum of 10 people.
-# https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-AssigningexternalcollaboratorswiththetriageroleonGitHub
+ - FrankYang0529
+ - apoorvmittal10
+ - brandboat
+ - frankvicky
+ - AndrewJSchofield
+ - m1a2st
+ - kirktrue
+ - kamalcph
+ - nizhikov
+ - OmniaGM
github:
collaborators:
- - FrankYang0529
- - kamalcph
- - apoorvmittal10
- - lianetm
- - brandboat
- - kirktrue
- - nizhikov
- - OmniaGM
- - dongnuo123
- - frankvicky
\ No newline at end of file
+ - FrankYang0529
+ - apoorvmittal10
+ - brandboat
+ - frankvicky
+ - AndrewJSchofield
+ - m1a2st
+ - kirktrue
+ - kamalcph
+ - nizhikov
+ - OmniaGM |
mumrah
left a comment
There was a problem hiding this comment.
Looks great! Just two small things
|
@fonsdant Looks like the license checker is failing due to requirements.txt https://github.com/apache/kafka/actions/runs/10797389243/job/29948652044?pr=17124#step:6:4792 |
Fixed! I have missed adding license to it |
| from github.ContentFile import ContentFile | ||
| from github.PaginatedList import PaginatedList | ||
| from github.Repository import Repository | ||
| from ruamel.yaml import YAML |
There was a problem hiding this comment.
Is there a reason why we need this particular YAML parser instead of using the more common PyYAML?
There was a problem hiding this comment.
Yep, PyYAML does not handle comments well and removes them. ruamel.yaml can handle them since they are inside collections as I have described at PR description (see paragraph For update_yaml_content function)
mumrah
left a comment
There was a problem hiding this comment.
I tested this out locally and it seems to work. LGTM!
|
@mumrah many, many thanks! It have been my first contribution merged! I am very happy! 😁 |
|
@fonsdant thanks for your contribution -- hope to see more in the future 😄 |
Add a Python script that analyzes our Git history to find top contributors. This can be used by committers to update the list of contributors in .asf.yaml without a lot of tedious effort. Co-authored-by: stevenbooke <steviebeee55@gmail.com> Co-authored-by: Joao Pedro Fonseca <fonsdant@gmail.com> Reviewers: David Arthur <mumrah@gmail.com>
This PR resumes and improves the work done on #13842.
I have refacted the script and workflow files, breaking the code into functions and adjusting naming and operations. Now we have three files:
refresh_collaborators.py,requirements.txt, andrefresh_collaborators.yml. I have replaced the dash with an underscore in filenames to adequate them to the standards for Python (see PEP-8) and GitHub Workflow YAML files as other workflows in the project use snake case. I have created a requirements file to facilitate dependency management.For
refresh_collaborators.py, I have broken the code into functions. I hope it helps us read and maintain the code. There are some observations forget_top_contributorsandupdate_yaml_contentfunctions.For
get_top_contributorsfunction, the best approach I have used to do it is to filter commits of last year and apply a search by its authors in repo contributors. I cannot find a way as out-of-the-box as insight graphs of the contributor's page as neither PyGitHub nor GitHub REST API offers an interface to get contributors by top contributions for last year (see GitHub-List-Contributors and PyGitHub-Get-Contributors).For
update_yaml_contentfunction, I have usedcollaborators.copy()to avoid to create anchors. Although ruamel preserves comments, I have noticed that ruamel has been unable to parse them correctly, causing missing comments forgithub.collaborators, like:So I have discovered it works well since we move comments into collections, like:
This approach seems right for me too as it pretty documents the list (github_whitelist/collaborators), not the collection (jenkins/github).
I have run a test on my fork (see Test). Disregard the missing Jenkins comment as it was not fixed when I ran this test.