Skip to content

KAFKA-14995: Automate asf.yaml collaborators refresh#17124

Merged
mumrah merged 15 commits intoapache:trunkfrom
fonsdant:automate-asf-yaml-contributors-refresh
Sep 11, 2024
Merged

KAFKA-14995: Automate asf.yaml collaborators refresh#17124
mumrah merged 15 commits intoapache:trunkfrom
fonsdant:automate-asf-yaml-contributors-refresh

Conversation

@fonsdant
Copy link
Copy Markdown
Contributor

@fonsdant fonsdant commented Sep 8, 2024

This PR resumes and improves the work done on #13842.

I have refacted the script and workflow files, breaking the code into functions and adjusting naming and operations. Now we have three files: refresh_collaborators.py, requirements.txt, and refresh_collaborators.yml. I have replaced the dash with an underscore in filenames to adequate them to the standards for Python (see PEP-8) and GitHub Workflow YAML files as other workflows in the project use snake case. I have created a requirements file to facilitate dependency management.

For refresh_collaborators.py, I have broken the code into functions. I hope it helps us read and maintain the code. There are some observations for get_top_contributors and update_yaml_content functions.

For get_top_contributors function, the best approach I have used to do it is to filter commits of last year and apply a search by its authors in repo contributors. I cannot find a way as out-of-the-box as insight graphs of the contributor's page as neither PyGitHub nor GitHub REST API offers an interface to get contributors by top contributions for last year (see GitHub-List-Contributors and PyGitHub-Get-Contributors).

For update_yaml_content function, I have used collaborators.copy() to avoid to create anchors. Although ruamel preserves comments, I have noticed that ruamel has been unable to parse them correctly, causing missing comments for github.collaborators, like:

# This list allows you to trigger builds on pull requests. It can have a maximum of 10 people.
# https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-JenkinsPRwhitelisting
jenkins:
  github_whitelist:
    - blabla
github:
  collaborators:
    - blabla

So I have discovered it works well since we move comments into collections, like:

jenkins:
  # This list allows you to trigger builds on pull requests. It can have a maximum of 10 people.
  # https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-JenkinsPRwhitelisting
  github_whitelist:
github:
  # This list allows you to triage pull requests. It can have a maximum of 10 people.
  # https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-AssigningexternalcollaboratorswiththetriageroleonGitHub
  collaborators:

This approach seems right for me too as it pretty documents the list (github_whitelist/collaborators), not the collection (jenkins/github).

I have run a test on my fork (see Test). Disregard the missing Jenkins comment as it was not fixed when I ran this test.

stevenbooke and others added 5 commits September 8, 2024 00:05
…ntributors by lifetime commits, rather than by commits from the last year (as in git shortlog --email --numbered --summary --since=2022-04-28), since we have some prolific contributors in the past who are no longer active.'
…an be 'None'. Ensure the updated '.asf.yaml' file retains the previous format ( retain the comments and only update the 'github_whitelist' list and the 'collaborators' list). Instead of directly merging to trunk, the script will create a new branch for the changes, commit the changes to the new branch, and open a pull request with the updated '.asf.yaml' file.
…nges to the branch, and open a pull request with the updated '.asf.yaml' file. Otherwise, the script will create a new branch for the changes, commit the changes to the new branch, and open a pull request with the updated '.asf.yaml' file.
@fonsdant fonsdant marked this pull request as ready for review September 8, 2024 03:53
@fonsdant
Copy link
Copy Markdown
Contributor Author

fonsdant commented Sep 8, 2024

@mimaison, @vvcephei, as you had been on the original PR, I am mentioning you here and @stevenbooke as co-author. I also mention @jlprat as we are discussing in the Jira task.

Thank you all! :)

Copy link
Copy Markdown
Member

@mumrah mumrah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @fonsdant!

Left some comments inline.

This looks like a useful addition. It does seem like there are a lot of moving pieces to this. I wonder if it would be sufficient to make it a script that committers can run manually and open their own PR.

Comment on lines +20 to +22
on:
schedule:
- cron: "0 0 1 */3 *"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this workflow dispatch rather than cron?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I will give it a try!

Comment thread refresh_collaborators.py
one_year_ago: datetime = datetime.now() - timedelta(days=365)
contributors: Dict[str, int] = {}

last_year_commits: PaginatedList[Commit] = repo.get_commits(since=one_year_ago)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many pages will this result in? Do we need to worry about API limits?

Since the action runs in the context of the Git repo, can we get the collaborator info there?

E.g., git shortlog -s -n | head -10

Copy link
Copy Markdown
Contributor Author

@fonsdant fonsdant Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I will check it, have only ran it and it seems to work fine for the current repo state.

  2. Yes! I have already used collaborator info (login) in:

    for contributor in repo.get_contributors():
        if contributor.login not in committers:
    

    We can perform these operations.

Copy link
Copy Markdown
Contributor Author

@fonsdant fonsdant Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are authenticated (have a token), I think rate limit is not a problem. See Rate limits for the REST API.

You can use a personal access token to make API requests. Additionally, you can authorize a GitHub App or OAuth app, which can then make API requests on your behalf.

All of these requests count towards your personal rate limit of 5,000 requests per hour. Requests made on your behalf by a GitHub App that is owned by a GitHub Enterprise Cloud organization have a higher rate limit of 15,000 requests per hour. Similarly, requests made on your behalf by a OAuth app that is owned or approved by a GitHub Enterprise Cloud organization have a higher rate limit of 15,000 requests per hour if you are a member of the GitHub Enterprise Cloud organization.

In the past 4 years, we count 4248 commits. Even if we request each, we will be within the rate limit for authenticated users.

$ git rev-list --count --since="3 year ago" origin/trunk
4248

Comment thread refresh_collaborators.py
Comment on lines +76 to +79
logging.info(f"Fetching committers from the repository {REPO_KAFKA_SITE}")
committers_file: ContentFile = repo.get_contents("committers.html")
content: bytes = committers_file.decoded_content
soup: BeautifulSoup = BeautifulSoup(content, "html.parser")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little brittle. Is there another way we could identify committer vs contributor? Perhaps the GH API?

Copy link
Copy Markdown
Contributor Author

@fonsdant fonsdant Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have picked it from the previous code. I will try to look for another approach.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize there was some extant code for this. Maybe we can keep this approach, even if it's not perfect.

@fonsdant
Copy link
Copy Markdown
Contributor Author

fonsdant commented Sep 8, 2024

Thanks for the patch @fonsdant!

Left some comments inline.

This looks like a useful addition. It does seem like there are a lot of moving pieces to this. I wonder if it would be sufficient to make it a script that committers can run manually and open their own PR.

Thanks for reviewing, David! :)

Yes, it is possible to run the script locally/manually. To do this, I have generated a GitHub personal access token and used it as GITHUB_TOKEN. One can replace os.getenv("GITHUB_TOKEN") with its token or set it to this variable in its environment.

I have thought about to create a Gradle task to run it, like gradle refreshCollaborators. This way, one does not need to worry about install Python dependencies and run the script. How about it?

@mumrah
Copy link
Copy Markdown
Member

mumrah commented Sep 9, 2024

@fonsdant, let's do this for now:

  1. Move the new files into a "committer-tools" directory
  2. Remove the Github bits
  3. Add a readme in this directory with install/setup instructions

If the committers find ourselves really wanting some automation, we can address it in a future PR. For now, the complications of making the script reliable and dealing with automatically opening a PR seem unnecessary. Just having a script to analyze the history is quite useful and accomplishes 90% of the goal :)

WDYT?

@fonsdant
Copy link
Copy Markdown
Contributor Author

fonsdant commented Sep 9, 2024

@fonsdant, let's do this for now:

  1. Move the new files into a "committer-tools" directory
  2. Remove the Github bits
  3. Add a readme in this directory with install/setup instructions

If the committers find ourselves really wanting some automation, we can address it in a future PR. For now, the complications of making the script reliable and dealing with automatically opening a PR seem unnecessary. Just having a script to analyze the history is quite useful and accomplishes 90% of the goal :)

WDYT?

Alright! LGTM!

With Remove the Github bits, should I remove the PR generation, right? We still need GitHub client to get repo contributors as Josep have stated:

As we discovered in #16679 using either email or author name doesn't work that nicely as people contribute with different emails (work + personal) and some people with multiple given names commit sometimes with different variations of their name.

The fairer approach we found is applying a filter in the GitHub's collaborators page and remove committers from there. Another benefit of this is that we directly have the GitHub handles at hand. (for example https://github.com/apache/kafka/graphs/contributors?from=2023-07-24&to=2024-07-24&type=c)

@mumrah
Copy link
Copy Markdown
Member

mumrah commented Sep 9, 2024

Remove the Github bits

Yes, sorry I really meant "Github Actions bits"

@fonsdant
Copy link
Copy Markdown
Contributor Author

Test with updated script:

$ python refresh_collaborators.py 
{"timestamp":"2024-09-10 13:52:22,013","level":"INFO","message":"Successfully initialized GitHub client"}
{"timestamp":"2024-09-10 13:52:22,726","level":"INFO","message":"Fetching committers from the repository apache/kafka-site"}
{"timestamp":"2024-09-10 13:52:23,906","level":"INFO","message":"Found 57 committers"}
{"timestamp":"2024-09-10 13:52:24,955","level":"INFO","message":"Fetching contributors from the repository apache/kafka"}
{"timestamp":"2024-09-10 13:53:58,883","level":"INFO","message":"Found 10 top contributors who are not committers"}
{"timestamp":"2024-09-10 13:53:58,890","level":"INFO","message":"Updating ../.asf.yaml with 10 new collaborators"}
{"timestamp":"2024-09-10 13:53:58,894","level":"INFO","message":"Local file ../.asf.yaml updated successfully"}

Diff for .asf.yaml:

$ git diff .asf.yaml | cat
diff --git a/.asf.yaml b/.asf.yaml
index ed20450a6a..136d5ac3d1 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -16,8 +16,8 @@
 # under the License.
 
 notifications:
-  commits:      commits@kafka.apache.org
-  issues:       jira@kafka.apache.org
+  commits: commits@kafka.apache.org
+  issues: jira@kafka.apache.org
   pullrequests: jira@kafka.apache.org
   jira_options: link label
 
@@ -25,28 +25,25 @@ notifications:
 # https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-JenkinsPRwhitelisting
 jenkins:
   github_whitelist:
-    - FrankYang0529
-    - kamalcph
-    - apoorvmittal10
-    - lianetm
-    - brandboat
-    - kirktrue
-    - nizhikov
-    - OmniaGM
-    - dongnuo123
-    - frankvicky
-
-# This list allows you to triage pull requests. It can have a maximum of 10 people.
-# https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=Git+-+.asf.yaml+features#Git.asf.yamlfeatures-AssigningexternalcollaboratorswiththetriageroleonGitHub
+  - FrankYang0529
+  - apoorvmittal10
+  - brandboat
+  - frankvicky
+  - AndrewJSchofield
+  - m1a2st
+  - kirktrue
+  - kamalcph
+  - nizhikov
+  - OmniaGM
 github:
   collaborators:
-    - FrankYang0529
-    - kamalcph
-    - apoorvmittal10
-    - lianetm
-    - brandboat
-    - kirktrue
-    - nizhikov
-    - OmniaGM
-    - dongnuo123
-    - frankvicky
\ No newline at end of file
+  - FrankYang0529
+  - apoorvmittal10
+  - brandboat
+  - frankvicky
+  - AndrewJSchofield
+  - m1a2st
+  - kirktrue
+  - kamalcph
+  - nizhikov
+  - OmniaGM

@fonsdant fonsdant requested a review from mumrah September 10, 2024 17:14
Copy link
Copy Markdown
Member

@mumrah mumrah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just two small things

Comment thread committer-tools/README.md Outdated
Comment thread committer-tools/refresh_collaborators.py Outdated
@mumrah
Copy link
Copy Markdown
Member

mumrah commented Sep 10, 2024

@fonsdant Looks like the license checker is failing due to requirements.txt https://github.com/apache/kafka/actions/runs/10797389243/job/29948652044?pr=17124#step:6:4792

@fonsdant
Copy link
Copy Markdown
Contributor Author

@fonsdant Looks like the license checker is failing due to requirements.txt https://github.com/apache/kafka/actions/runs/10797389243/job/29948652044?pr=17124#step:6:4792

Fixed! I have missed adding license to it

from github.ContentFile import ContentFile
from github.PaginatedList import PaginatedList
from github.Repository import Repository
from ruamel.yaml import YAML
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we need this particular YAML parser instead of using the more common PyYAML?

Copy link
Copy Markdown
Contributor Author

@fonsdant fonsdant Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, PyYAML does not handle comments well and removes them. ruamel.yaml can handle them since they are inside collections as I have described at PR description (see paragraph For update_yaml_content function)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I missed that 👍

Copy link
Copy Markdown
Member

@mumrah mumrah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out locally and it seems to work. LGTM!

@mumrah mumrah merged commit 794e9a4 into apache:trunk Sep 11, 2024
@fonsdant
Copy link
Copy Markdown
Contributor Author

@mumrah many, many thanks! It have been my first contribution merged! I am very happy! 😁

@mumrah
Copy link
Copy Markdown
Member

mumrah commented Sep 11, 2024

@fonsdant thanks for your contribution -- hope to see more in the future 😄

@fonsdant fonsdant deleted the automate-asf-yaml-contributors-refresh branch November 20, 2024 18:33
tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
Add a Python script that analyzes our Git history to find top contributors. This can be used by committers to update
the list of contributors in .asf.yaml without a lot of tedious effort. 

Co-authored-by: stevenbooke <steviebeee55@gmail.com>
Co-authored-by: Joao Pedro Fonseca <fonsdant@gmail.com>
Reviewers: David Arthur <mumrah@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants