Skip to content

The GitHub crawler does not seem to wait between each query to the GitHub API (API rate limit exceeded) #109

@SyleKu

Description

@SyleKu

Summary

By crawling a GitHub organisation the rate limit will exceed at the really beginning of the crawling since the organisatzion that is going to be crawled has a lot of repositories.

Type of Issue

It is a :

  • bug
  • request
  • question regarding the documentation

Motivation

I am trying to crawl a github organisation (https://github.com/python) but unfortunately at the really early stage of the crawling the github-crawler-starter-2.0.1-exec.jar is throwing this error:

2022-04-27 11:35:00.382 ERROR 23962 --- [           main] ication$$EnhancerBySpringCGLIB$$1b1fa732 : problem while running github crawler


com.fasterxml.jackson.module.kotlin.MissingKotlinParameterException: Instantiation of [simple type, class com.societegenerale.githubcrawler.model.SearchResult] value failed for JSON property items due to missing (therefore NULL) value for creator parameter items which is a non-nullable type
 at [Source: (String)"{"message":"API rate limit exceeded for user ID [USERID].","documentation_url":"https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"}"; line: 1, column: 158] (through reference chain: com.societegenerale.githubcrawler.model.SearchResult["items"])

[...]

or

com.fasterxml.jackson.module.kotlin.MissingKotlinParameterException: Instantiation of [simple type, class com.societegenerale.githubcrawler.model.SearchResult] value failed for JSON property items due to missing (therefore NULL) value for creator parameter items which is a non-nullable type
 at [Source: (String)"{
  "documentation_url": "https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits",
  "message": "You have exceeded a secondary rate limit. Please wait a few minutes before you try again."
}
"; line: 4, column: 1] (through reference chain: com.societegenerale.githubcrawler.model.SearchResult["items"])
        at com.fasterxml.jackson.module.kotlin.KotlinValueInstantiator.createFromObjectWith(KotlinValueInstantiator.kt:116) ~[jackson-module-kotlin-2.12.6.jar!/:2.12.6]
        at com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:202) ~[jackson-databind-2.12.6.jar!/:2.12.6]
        at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:520) ~[jackson-databind-2.12.6.jar!/:2.12.6]

[...]

These two errors are of course because of the rate limit, so that there is not the expected result inside the received arguments.
Unfortunately the application will terminate right here.

Current Behavior

No matter if running the code crawl-in-parallel is true or false, the rate limit always gets exceeded.

Expected Behavior

A default parameter which respects the GitHub API where the application will wait every 10 seconds between each query should be available to avoid getting banned. The user itself should also be able to change the amount of time wait between each query in the config file.

I hope this is still somehow possible to do in the current release. If I missed it, could you please let me know what I have to do, to respect the GitHub API waiting time?

Please do not hesitate to contact me if you need more information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions