Skip to content

Conversation

@shubham36deshpande
Copy link
Contributor

@shubham36deshpande shubham36deshpande commented Jun 16, 2025

…Helm chart

This PR introduces a TTL (time-to-live) mechanism in conjunction with an LRU (least-recently-used) cache for StatsD metric aggregation. The change is designed to automatically clean up stale or unused metric entries, preventing uncontrolled memory growth in long-running statsd daemons—a problem highlighted in issue #50645.

closes: #50645


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg
Copy link

boring-cyborg bot commented Jun 16, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@boring-cyborg boring-cyborg bot added the area:helm-chart Airflow Helm Chart label Jun 16, 2025
@jscheffl
Copy link
Contributor

Some static checks fail....

though I am not an statsd expert... @AutomationDev85 Do you have an opinion to the configs?

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for my side. Thanks.

But as I am not a real expert in Helm... @jedcunningham can you make a second pass opinion on the PR?

@eladkal
Copy link
Contributor

eladkal commented Jul 3, 2025

tests are failing

Copy link
Member

@jedcunningham jedcunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a newsfragment describing the new values and explaining how to get the old behavior back.

@jscheffl jscheffl requested a review from Miretpl December 10, 2025 21:36
Copy link
Contributor

@Miretpl Miretpl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to drop the re-review request, as there were no changes related to my previous review (despite newsfragment).

@shubham36deshpande, when the changes will be introduced, please request re-review

@shubham36deshpande
Copy link
Contributor Author

@Miretpl can you please review he latest changes, I have refactored the options like cache.sizem cache.type and cache.ttl as you suggested.

@jscheffl
Copy link
Contributor

@shubham36deshpande Can you please also run (at least) static checks before pushing? There are not 83 commits on this PR but either it was semantically incorrect of in most cases simple tests failed in CI. Would save time for you, CI and me taking a look at this PR if it is basic checked locally before pushing over-and-over basic errors.

Copy link
Contributor

@Miretpl Miretpl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ what Jens mentioned. I guess you did not configure the pre-commit for checking the code. You have a nice guide here https://github.com/apache/airflow/blob/main/contributing-docs/README.rst

@@ -0,0 +1,11 @@
StatsD metrics aggregation now supports configurable TTL-enabled LRU cache to prevent memory growth in long-running daemons.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After changing in the values, you need to update the information in newsfragment file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thats for pointing it out, I have fixed the mistake.

Comment on lines 99 to 111
{{- if .Values.statsd.cache.size }}
- "--statsd.cache-size={{ .Values.statsd.cache.size }}"
{{- end }}
{{- if .Values.statsd.cache.type }}
- "--statsd.cache-type={{ .Values.statsd.cache.type }}"
{{- end }}
{{- if .Values.statsd.cache.ttl }}
- "--ttl={{ .Values.statsd.cache.ttl }}"
{{- end }}
{{- if .Values.statsd.args }}
{{- range $arg := .Values.statsd.args }}
- {{ $arg | quote }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe inside of statsd.cache.enabled flag 🤔, but I don't have a strong opinion about that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miretpl - statsd.enabled is outside of cache as people might want to enable statsd without cache options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shubham36deshpande I mentioned statsd.cache.enabled (it does not exist in the current version, but it replicates the behaviour of statsd.cache flag from the first version of the PR), not statsd.enabled. As you mentioned, people might want to have statsd enabled without cache (it is also for preserving backward compatibility). In the first version of the PR, when we had the statsd.cache flag for it (btw. it is still mentioned in the newsfragment file) it was easier to do and I think that when we have the statsd.cache section now, it would be beneficial to add a flag for it e.g. statsd.cache.enabled with default false value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense, I will implement it in the next commit.

@shubham36deshpande
Copy link
Contributor Author

@jscheffl - Sorry for that, I was running breeze tests but didnt run the CI static check, thats why it failed. I have fixed the error and ran the CI and helm-tests again.
I will be careful from next time.

@kunaljubce
Copy link
Contributor

kunaljubce commented Jan 2, 2026

@jscheffl - Sorry for that, I was running breeze tests but didnt run the CI static check, thats why it failed. I have fixed the error and ran the CI and helm-tests again. I will be careful from next time.

Hey @shubham36deshpande, most of the static checks are configured as prek pre-commit hooks, so just have the pre-commit hooks installed and most of your linting/static check issues would get autofixed when you're pushing your changes.

@shubham36deshpande
Copy link
Contributor Author

image @jscheffl , the test for airflow_core are passing in my local but it is failing here, what might be the reason? Is there anything I am doing incorrectly in PR?

@jscheffl
Copy link
Contributor

@jscheffl , the test for airflow_core are passing in my local but it is failing here, what might be the reason? Is there anything I am doing incorrectly in PR?

Also not sure but I can say that the same error like in the CI happens on my machine. Maybe can you breeze cleanup and rebuild all containers?

# from the exported /metrics output.
# Format: Go duration string (e.g. "30s", "5m", "1h")
# Default: "0s" (disabled, never expires)
ttl: "0s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to apply this parameter but then statsd fails to start with:
statsd_exporter: error: unknown long flag '--ttl', try --help

--help gives me:

usage: statsd_exporter [<flags>]                                                                                                                                                                                                            
                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                            
Flags:                                                                                                                                                                                                                                      
  -h, --[no-]help               Show context-sensitive help (also try                                                                                                                                                                       
                                --help-long and --help-man).                                                                                                                                                                                
      --web.listen-address=":9102"                                                                                                                                                                                                          
                                The address on which to expose the web interface                                                                                                                                                            
                                and generated Prometheus metrics.                                                                                                                                                                           
      --[no-]web.enable-lifecycle                                                                                                                                                                                                           
                                Enable shutdown and reload via HTTP request.                                                                                                                                                                
      --web.telemetry-path="/metrics"                                                                                                                                                                                                       
                                Path under which to expose metrics.                                                                                                                                                                         
      --statsd.listen-udp=":9125"                                                                                                                                                                                                           
                                The UDP address on which to receive statsd                                                                                                                                                                  
                                metric lines. "" disables it.                                                                                                                                                                               
      --statsd.listen-tcp=":9125"                                                                                                                                                                                                           
                                The TCP address on which to receive statsd                                                                                                                                                                  
                                metric lines. "" disables it.                                                                                                                                                                               
      --statsd.listen-unixgram=""                                                                                                                                                                                                           
                                The Unixgram socket path to receive statsd                                                                                                                                                                  
                                metric lines in datagram. "" disables it.                                                                                                                                                                   
      --statsd.unixsocket-mode="755"                                                                                                                                                                                                        
                                The permission mode of the unix socket.                                                                                                                                                                     
      --statsd.mapping-config=STATSD.MAPPING-CONFIG                                                                                                                                                                                         
                                Metric mapping configuration file name.                                                                                                                                                                     
      --statsd.read-buffer=STATSD.READ-BUFFER                                                                                                                                                                                               
                                Size (in bytes) of the operating system's                                                                                                                                                                   
                                transmit read buffer associated with the UDP or                                                                                                                                                             
                                Unixgram connection. Please make sure the kernel                                                                                                                                                            
                                parameters net.core.rmem_max is set to a value                                                                                                                                                              
                                greater than the value specified.                                                                                                                                                                           
      --statsd.cache-size=1000  Maximum size of your metric mapping cache.                                                                                                                                                                  
                                Relies on least recently used replacement policy                                                                                                                                                            
                                if max size is reached.                                                                                                                                                                                     
      --statsd.cache-type=lru   Metric mapping cache type. Valid options are                                                                                                                                                                
                                "lru" and "random"                                                                                                                                                                                          
      --statsd.event-queue-size=10000                                                                                                                                                                                                       
                                Size of internal queue for processing events.                                                                                                                                                               
      --statsd.event-flush-threshold=1000                                                                                                                                                                                                   
                                Number of events to hold in queue before                                                                                                                                                                    
                                flushing.                                                                                                                                                                                                   
      --statsd.event-flush-interval=200ms                                                                                                                                                                                                   
                                Maximum time between event queue flushes.                                                                                                                                                                   
      --debug.dump-fsm=""       The path to dump internal FSM generated for glob                                                                                                                                                            
                                matching as Dot file.                                                                                                                                                                                       
      --[no-]check-config       Check configuration and exit.                                                                                                                                                                               
      --[no-]statsd.parse-dogstatsd-tags                                                                                                                                                                                                    
                                Parse DogStatsd style tags. Enabled by default.                                                                                                                                                             
      --[no-]statsd.parse-influxdb-tags                                                                                                                                                                                                     
                                Parse InfluxDB style tags. Enabled by default.                                                                                                                                                              
      --[no-]statsd.parse-librato-tags                                                                                                                                                                                                      
                                Parse Librato style tags. Enabled by default.                                                                                                                                                               
      --[no-]statsd.parse-signalfx-tags                                                                                                                                                                                                     
                                Parse SignalFX style tags. Enabled by default.                                                                                                                                                              
      --statsd.relay.address=STATSD.RELAY.ADDRESS                                                                                                                                                                                           
                                The UDP relay target address (host:port)                                                                                                                                                                    
      --statsd.relay.packet-length=1400                                                                                                                                                                                                     
                                Maximum relay output packet length to avoid                                                                                                                                                                 
                                fragmentation                                                                                                                                                                                               
      --statsd.udp-packet-queue-size=10000                                                                                                                                                                                                  
                                Size of internal queue for processing UDP                                                                                                                                                                   
                                packets.                                                                                                                                                                                                    
      --log.level=info          Only log messages with the given severity or                                                                                                                                                                
                                above. One of: [debug, info, warn, error]                                                                                                                                                                   
      --log.format=logfmt       Output format of log messages. One of: [logfmt,                                                                                                                                                             
                                json]                                                                                                                                                                                                       
      --[no-]version            Show application version.    

I tested with statsd_expoerter v0.28.0 - is this only available of a fork or a manually built image?

@jscheffl
Copy link
Contributor

Indirectly via PR #60933 the fix is landed on main, closing this PR - thanks for the (indirect) contribution! With this also your efforts landed on main!

@jscheffl jscheffl closed this Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:helm-chart Airflow Helm Chart

Projects

None yet

Development

Successfully merging this pull request may close these issues.

StatsD memory accumulation causes probe failures and restarts

7 participants