KAFKA-14995: Automate asf.yaml collaborators refresh by fonsdant · Pull Request #17124 · apache/kafka

fonsdant · 2024-09-08T03:49:23Z

This PR resumes and improves the work done on #13842.

I have refacted the script and workflow files, breaking the code into functions and adjusting naming and operations. Now we have three files: refresh_collaborators.py, requirements.txt, and refresh_collaborators.yml. I have replaced the dash with an underscore in filenames to adequate them to the standards for Python (see PEP-8) and GitHub Workflow YAML files as other workflows in the project use snake case. I have created a requirements file to facilitate dependency management.

For refresh_collaborators.py, I have broken the code into functions. I hope it helps us read and maintain the code. There are some observations for get_top_contributors and update_yaml_content functions.

For get_top_contributors function, the best approach I have used to do it is to filter commits of last year and apply a search by its authors in repo contributors. I cannot find a way as out-of-the-box as insight graphs of the contributor's page as neither PyGitHub nor GitHub REST API offers an interface to get contributors by top contributions for last year (see GitHub-List-Contributors and PyGitHub-Get-Contributors).

For update_yaml_content function, I have used collaborators.copy() to avoid to create anchors. Although ruamel preserves comments, I have noticed that ruamel has been unable to parse them correctly, causing missing comments for github.collaborators, like:

# This list allows you to trigger builds on pull requests. It can have a maximum of 10 people.
# https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-JenkinsPRwhitelisting
jenkins:
  github_whitelist:
    - blabla
github:
  collaborators:
    - blabla

So I have discovered it works well since we move comments into collections, like:

jenkins:
  # This list allows you to trigger builds on pull requests. It can have a maximum of 10 people.
  # https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-JenkinsPRwhitelisting
  github_whitelist:
github:
  # This list allows you to triage pull requests. It can have a maximum of 10 people.
  # https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-AssigningexternalcollaboratorswiththetriageroleonGitHub
  collaborators:

This approach seems right for me too as it pretty documents the list (github_whitelist/collaborators), not the collection (jenkins/github).

I have run a test on my fork (see Test). Disregard the missing Jenkins comment as it was not fixed when I ran this test.

…ntributors by lifetime commits, rather than by commits from the last year (as in git shortlog --email --numbered --summary --since=2022-04-28), since we have some prolific contributors in the past who are no longer active.'

…an be 'None'. Ensure the updated '.asf.yaml' file retains the previous format ( retain the comments and only update the 'github_whitelist' list and the 'collaborators' list). Instead of directly merging to trunk, the script will create a new branch for the changes, commit the changes to the new branch, and open a pull request with the updated '.asf.yaml' file.

…nges to the branch, and open a pull request with the updated '.asf.yaml' file. Otherwise, the script will create a new branch for the changes, commit the changes to the new branch, and open a pull request with the updated '.asf.yaml' file.

fonsdant · 2024-09-08T03:57:04Z

@mimaison, @vvcephei, as you had been on the original PR, I am mentioning you here and @stevenbooke as co-author. I also mention @jlprat as we are discussing in the Jira task.

Thank you all! :)

mumrah

Thanks for the patch @fonsdant!

Left some comments inline.

This looks like a useful addition. It does seem like there are a lot of moving pieces to this. I wonder if it would be sufficient to make it a script that committers can run manually and open their own PR.

mumrah · 2024-09-08T14:49:32Z

+on:
+  schedule:
+    - cron: "0 0 1 */3 *"


Can we make this workflow dispatch rather than cron?

I think so. I will give it a try!

mumrah · 2024-09-08T14:51:16Z

+    one_year_ago: datetime = datetime.now() - timedelta(days=365)
+    contributors: Dict[str, int] = {}
+
+    last_year_commits: PaginatedList[Commit] = repo.get_commits(since=one_year_ago)


How many pages will this result in? Do we need to worry about API limits?

Since the action runs in the context of the Git repo, can we get the collaborator info there?

E.g., git shortlog -s -n | head -10

I will check it, have only ran it and it seems to work fine for the current repo state.

Yes! I have already used collaborator info (login) in:

for contributor in repo.get_contributors(): if contributor.login not in committers:

We can perform these operations.

Since we are authenticated (have a token), I think rate limit is not a problem. See Rate limits for the REST API.

You can use a personal access token to make API requests. Additionally, you can authorize a GitHub App or OAuth app, which can then make API requests on your behalf.

All of these requests count towards your personal rate limit of 5,000 requests per hour. Requests made on your behalf by a GitHub App that is owned by a GitHub Enterprise Cloud organization have a higher rate limit of 15,000 requests per hour. Similarly, requests made on your behalf by a OAuth app that is owned or approved by a GitHub Enterprise Cloud organization have a higher rate limit of 15,000 requests per hour if you are a member of the GitHub Enterprise Cloud organization.

In the past 4 years, we count 4248 commits. Even if we request each, we will be within the rate limit for authenticated users.

$ git rev-list --count --since="3 year ago" origin/trunk 4248

mumrah · 2024-09-08T14:55:27Z

+    logging.info(f"Fetching committers from the repository {REPO_KAFKA_SITE}")
+    committers_file: ContentFile = repo.get_contents("committers.html")
+    content: bytes = committers_file.decoded_content
+    soup: BeautifulSoup = BeautifulSoup(content, "html.parser")


This seems a little brittle. Is there another way we could identify committer vs contributor? Perhaps the GH API?

I have picked it from the previous code. I will try to look for another approach.

Ah, I didn't realize there was some extant code for this. Maybe we can keep this approach, even if it's not perfect.

fonsdant · 2024-09-08T16:09:48Z

Thanks for the patch @fonsdant!

Left some comments inline.

This looks like a useful addition. It does seem like there are a lot of moving pieces to this. I wonder if it would be sufficient to make it a script that committers can run manually and open their own PR.

Thanks for reviewing, David! :)

Yes, it is possible to run the script locally/manually. To do this, I have generated a GitHub personal access token and used it as GITHUB_TOKEN. One can replace os.getenv("GITHUB_TOKEN") with its token or set it to this variable in its environment.

I have thought about to create a Gradle task to run it, like gradle refreshCollaborators. This way, one does not need to worry about install Python dependencies and run the script. How about it?

mumrah · 2024-09-09T17:37:14Z

@fonsdant, let's do this for now:

Move the new files into a "committer-tools" directory
Remove the Github bits
Add a readme in this directory with install/setup instructions

If the committers find ourselves really wanting some automation, we can address it in a future PR. For now, the complications of making the script reliable and dealing with automatically opening a PR seem unnecessary. Just having a script to analyze the history is quite useful and accomplishes 90% of the goal :)

WDYT?

fonsdant · 2024-09-09T23:24:15Z

@fonsdant, let's do this for now:

Move the new files into a "committer-tools" directory

Remove the Github bits

Add a readme in this directory with install/setup instructions

If the committers find ourselves really wanting some automation, we can address it in a future PR. For now, the complications of making the script reliable and dealing with automatically opening a PR seem unnecessary. Just having a script to analyze the history is quite useful and accomplishes 90% of the goal :)

WDYT?

Alright! LGTM!

With Remove the Github bits, should I remove the PR generation, right? We still need GitHub client to get repo contributors as Josep have stated:

As we discovered in #16679 using either email or author name doesn't work that nicely as people contribute with different emails (work + personal) and some people with multiple given names commit sometimes with different variations of their name.

The fairer approach we found is applying a filter in the GitHub's collaborators page and remove committers from there. Another benefit of this is that we directly have the GitHub handles at hand. (for example https://github.com/apache/kafka/graphs/contributors?from=2023-07-24&to=2024-07-24&type=c)

mumrah · 2024-09-09T23:52:21Z

Remove the Github bits

Yes, sorry I really meant "Github Actions bits"

fonsdant · 2024-09-10T16:58:29Z

Test with updated script:

$ python refresh_collaborators.py 
{"timestamp":"2024-09-10 13:52:22,013","level":"INFO","message":"Successfully initialized GitHub client"}
{"timestamp":"2024-09-10 13:52:22,726","level":"INFO","message":"Fetching committers from the repository apache/kafka-site"}
{"timestamp":"2024-09-10 13:52:23,906","level":"INFO","message":"Found 57 committers"}
{"timestamp":"2024-09-10 13:52:24,955","level":"INFO","message":"Fetching contributors from the repository apache/kafka"}
{"timestamp":"2024-09-10 13:53:58,883","level":"INFO","message":"Found 10 top contributors who are not committers"}
{"timestamp":"2024-09-10 13:53:58,890","level":"INFO","message":"Updating ../.asf.yaml with 10 new collaborators"}
{"timestamp":"2024-09-10 13:53:58,894","level":"INFO","message":"Local file ../.asf.yaml updated successfully"}

Diff for .asf.yaml:

$ git diff .asf.yaml | cat
diff --git a/.asf.yaml b/.asf.yaml
index ed20450a6a..136d5ac3d1 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -16,8 +16,8 @@
 # under the License.
 
 notifications:
-  commits:      commits@kafka.apache.org
-  issues:       jira@kafka.apache.org
+  commits: commits@kafka.apache.org
+  issues: jira@kafka.apache.org
   pullrequests: jira@kafka.apache.org
   jira_options: link label
 
@@ -25,28 +25,25 @@ notifications:
 # https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-JenkinsPRwhitelisting
 jenkins:
   github_whitelist:
-    - FrankYang0529
-    - kamalcph
-    - apoorvmittal10
-    - lianetm
-    - brandboat
-    - kirktrue
-    - nizhikov
-    - OmniaGM
-    - dongnuo123
-    - frankvicky
-
-# This list allows you to triage pull requests. It can have a maximum of 10 people.
-# https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-AssigningexternalcollaboratorswiththetriageroleonGitHub
+  - FrankYang0529
+  - apoorvmittal10
+  - brandboat
+  - frankvicky
+  - AndrewJSchofield
+  - m1a2st
+  - kirktrue
+  - kamalcph
+  - nizhikov
+  - OmniaGM
 github:
   collaborators:
-    - FrankYang0529
-    - kamalcph
-    - apoorvmittal10
-    - lianetm
-    - brandboat
-    - kirktrue
-    - nizhikov
-    - OmniaGM
-    - dongnuo123
-    - frankvicky
\ No newline at end of file
+  - FrankYang0529
+  - apoorvmittal10
+  - brandboat
+  - frankvicky
+  - AndrewJSchofield
+  - m1a2st
+  - kirktrue
+  - kamalcph
+  - nizhikov
+  - OmniaGM

mumrah

Looks great! Just two small things

mumrah · 2024-09-10T18:21:33Z

@fonsdant Looks like the license checker is failing due to requirements.txt https://github.com/apache/kafka/actions/runs/10797389243/job/29948652044?pr=17124#step:6:4792

fonsdant · 2024-09-10T21:33:40Z

@fonsdant Looks like the license checker is failing due to requirements.txt https://github.com/apache/kafka/actions/runs/10797389243/job/29948652044?pr=17124#step:6:4792

Fixed! I have missed adding license to it

mumrah · 2024-09-10T22:38:08Z

+from github.ContentFile import ContentFile
+from github.PaginatedList import PaginatedList
+from github.Repository import Repository
+from ruamel.yaml import YAML


Is there a reason why we need this particular YAML parser instead of using the more common PyYAML?

Yep, PyYAML does not handle comments well and removes them. ruamel.yaml can handle them since they are inside collections as I have described at PR description (see paragraph For update_yaml_content function)

Thanks, I missed that 👍

mumrah

I tested this out locally and it seems to work. LGTM!

fonsdant · 2024-09-11T01:36:14Z

@mumrah many, many thanks! It have been my first contribution merged! I am very happy! 😁

mumrah · 2024-09-11T15:11:02Z

@fonsdant thanks for your contribution -- hope to see more in the future 😄

Add a Python script that analyzes our Git history to find top contributors. This can be used by committers to update the list of contributors in .asf.yaml without a lot of tedious effort. Co-authored-by: stevenbooke <steviebeee55@gmail.com> Co-authored-by: Joao Pedro Fonseca <fonsdant@gmail.com> Reviewers: David Arthur <mumrah@gmail.com>

stevenbooke and others added 5 commits September 8, 2024 00:05

Automate asf.yaml collaborators refresh

3c72094

resume and refact work

40e4c83

fonsdant marked this pull request as ready for review September 8, 2024 03:53

mumrah reviewed Sep 8, 2024

View reviewed changes

move script to committer tools dir

33e5762

fonsdant added 3 commits September 10, 2024 13:59

delete refresh collaborators workflow

58b6f1e

remove commit creation and pr generation

a18af22

add readme

9ee7099

fonsdant requested a review from mumrah September 10, 2024 17:14

mumrah reviewed Sep 10, 2024

View reviewed changes

Comment thread committer-tools/README.md Outdated

Comment thread committer-tools/refresh_collaborators.py Outdated

fonsdant added 2 commits September 10, 2024 18:30

rename readme title

059516f

add license into requirements file

06c5bf8

fonsdant added 4 commits September 10, 2024 18:35

turn logs human friendly

f268b13

add instruction github token with gh cli

2a695b2

add instruction github token with gh cli

b55e9fa

fix typo

597e472

mumrah reviewed Sep 10, 2024

View reviewed changes

mumrah approved these changes Sep 10, 2024

View reviewed changes

mumrah merged commit 794e9a4 into apache:trunk Sep 11, 2024

fonsdant deleted the automate-asf-yaml-contributors-refresh branch November 20, 2024 18:33

Conversation

fonsdant commented Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fonsdant commented Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mumrah left a comment

Choose a reason for hiding this comment

Uh oh!

mumrah Sep 8, 2024

Choose a reason for hiding this comment

Uh oh!

fonsdant Sep 8, 2024

Choose a reason for hiding this comment

Uh oh!

mumrah Sep 8, 2024

Choose a reason for hiding this comment

Uh oh!

fonsdant Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fonsdant Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mumrah Sep 8, 2024

Choose a reason for hiding this comment

Uh oh!

fonsdant Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mumrah Sep 9, 2024

Choose a reason for hiding this comment

Uh oh!

fonsdant commented Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mumrah commented Sep 9, 2024

Uh oh!

fonsdant commented Sep 9, 2024

Uh oh!

mumrah commented Sep 9, 2024

Uh oh!

fonsdant commented Sep 10, 2024

Uh oh!

mumrah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mumrah commented Sep 10, 2024

Uh oh!

fonsdant commented Sep 10, 2024

Uh oh!

mumrah Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

fonsdant Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mumrah Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

mumrah left a comment

Choose a reason for hiding this comment

Uh oh!

fonsdant commented Sep 11, 2024

Uh oh!

mumrah commented Sep 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

fonsdant commented Sep 8, 2024 •

edited

Loading

fonsdant commented Sep 8, 2024 •

edited

Loading

fonsdant Sep 8, 2024 •

edited

Loading

fonsdant Sep 9, 2024 •

edited

Loading

fonsdant Sep 8, 2024 •

edited

Loading

fonsdant commented Sep 8, 2024 •

edited

Loading

fonsdant Sep 10, 2024 •

edited

Loading