Add sample data for cloud scatterplots #30

bouweandela · 2025-04-29T12:12:18Z

Description

Add sample data for cloud scatterplots in Add cloud scatterplots climate-ref#261
Change regridding so it does not crash on parametric vertical coordinates
Apply temporal subsetting before regridding
Use Dask arrays instead of NumPy arrays
Use the Dask distributed scheduler

Checklist

Please confirm that this pull request has done the following:

Data registry up to date (regenerate if necessary with a comment on this PR of /regenerate)
Documentation added (where applicable)
Changelog item added to changelog/

bouweandela · 2025-04-29T12:13:16Z

/regenerate

github-actions · 2025-04-29T12:51:47Z

PR comment handling
Something went wrong!

Details: https://github.com/Climate-REF/ref-sample-data/actions/runs/14730933211

bouweandela · 2025-04-29T16:37:06Z

/regenerate

github-actions · 2025-04-29T16:51:04Z

PR comment handling
Something went wrong!

Details: https://github.com/Climate-REF/ref-sample-data/actions/runs/14736606473

bouweandela · 2025-04-29T17:10:35Z

It looks like there is an issue with regridding the data

Time slice before regridding Make regridding work with data on model levels Add log messages

…-scatterplots-data

bouweandela · 2025-05-20T15:13:31Z

@lewisjared Is there an easy way to test if Climate-REF/climate-ref#261 works with the data added in this pull request without first merging this?

bouweandela · 2025-05-20T15:18:24Z

@nocollier I tried to use the streaming feature in the fetch_test_data.py script, but I get an assertion error coming from intake-esgf if I do that. Would you have time to take a look?

nocollier · 2025-05-20T16:29:41Z

Sigh... so I run this:

import intake_esgf

intake_esgf.conf.set(all_indices=True)
cat = intake_esgf.ESGFCatalog().search(
    project="obs4MIPs", variable_id="ta", source_id="ERA-5"
)
dpd = cat.to_path_dict(prefer_streaming=True, minimal_keys=False)

and 1 out of 3 times I get the 44 links. The assertion you are seeing is a check I have to make sure that when I partition out the file info based on how we access/transfer the data, that we didn't lose a file somehow in the logic. If a user prefers streaming, I added this select link function to return the fastest link which first returns a response. If it fails to do so, the logic should break and the file will be queued for https download. I am not sure what is happening, but it looks like checking the OPENDAP server for the status of 44 links maybe makes it fail.

In the short term we could try something that should work but is a little messy. You could:

import intake_esgf

intake_esgf.conf.set(all_indices=True)
cat = intake_esgf.ESGFCatalog().search(
    project="obs4MIPs", variable_id="ta", source_id="ERA-5"
)
infos = cat._get_file_info()

This will return a list of dictionaries which look like:

{
    "key": "obs4MIPs.ECMWF.ERA-5.mon.ta.gn",
    "dataset_id": "obs4MIPs.ECMWF.ERA-5.mon.ta.gn.v20250220|esgf-data2.llnl.gov",
    "checksum_type": "SHA256",
    "checksum": "d3c14ba6fb16a49cef03ea622bae954f29555c1fe7df4af6d4a272cf160a2eaa",
    "size": 494866132,
    "HTTPServer": [
        "https://esgf-data2.llnl.gov/thredds/fileServer/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
        "https://esgf-data2.llnl.gov/thredds/fileServer/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
    ],
    "OPENDAP": [
        "https://esgf-data2.llnl.gov/thredds/dodsC/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
        "https://esgf-data2.llnl.gov/thredds/dodsC/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
    ],
    "Globus": [
        "globus:1889ea03-25ad-4f9f-8110-1ce8833a9d7e/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
        "globus:1889ea03-25ad-4f9f-8110-1ce8833a9d7e/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
    ],
    "path": PosixPath(
        "obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc"
    ),
}

Longer term I may need to think of a way a user can get detailed information that I haven't tried to reduce for them. OPENDAP may not be around much longer but I have a feeling users will break things no matter what and sometimes all the information is the best we can do.

nocollier · 2025-05-20T19:18:16Z

I need to generate some large datasets also (thetao for an ocean diagnostic). I am going to try to get the timestamp thing implemented into intake-esgf and maybe that will work for this too. Sorry for the trouble.

bouweandela · 2025-05-20T19:36:09Z

The timestamp thing would definitely help a bit here. By the way, are you aware of NetCDF Byterange Support? Most servers on ESGF support this type of access, so that could be a way to support streaming without using OPeNDAP.

lewisjared · 2025-05-21T04:27:08Z

@lewisjared Is there an easy way to test if Climate-REF/climate-ref#261 works with the data added in this pull request without first merging this?

You can use a different directory for the test data by specifying REF_TEST_OUTPUT. If REF_TEST_OUTPUT is specified no fetching is done. This is largely untested, but may likely work.

https://climate-ref.readthedocs.io/en/latest/configuration/#ref_test_output

You might want to merge main back into this so you can use the --slug argument to the script to only run a specified set of requests.

bouweandela · 2025-05-22T12:24:53Z

@lewisjared Do you understand why installing climate-ref as a dependency is failing in CI? It works fine on my laptop. I'm not sure why climate-ref is needed as a dependency, do you use this for testing purposes?

bouweandela · 2025-05-23T07:25:01Z

@lewisjared Is there an easy way to test if https://github.com/Climate-REF/clim>ate-ref/pull/261 works with the data added in this pull request without first merging this?

You can use a different directory for the test data by specifying REF_TEST_OUTPUT

Unfortunately that didn't work. Making a change like this worked, though it still downloads the existing sample data too. Is that something you would like to have included in the ref package or is it too specific to create yet another environmental variable?

diff --git a/conftest.py b/conftest.py
index b15cc97e..a9881067 100644
--- a/conftest.py
+++ b/conftest.py
@@ -93,6 +93,8 @@ def test_data_dir() -> Path:
 
 @pytest.fixture(scope="session")
 def sample_data_dir(test_data_dir) -> Path:
+    if "REF_SAMPLE_DATA_DIR" in os.environ:
+        return Path(os.environ["REF_SAMPLE_DATA_DIR"])
     return test_data_dir / "sample-data"

lewisjared · 2025-05-23T07:43:40Z

Apologies. I must have had that functionality in another branch.

bouweandela · 2025-05-23T07:51:10Z

I'm not sure why climate-ref is needed as a dependency, do you use this for testing purposes?

I figured it out by removing it, it's used to get the obs4REF data..

bouweandela · 2025-05-23T11:49:14Z

why installing climate-ref as a dependency is failing in CI?

It looks like this is a bug in pixi, the way to generates the URL for pip does not seem valid with recent versions.

bouweandela · 2025-05-26T21:28:04Z

Do you have the ERA 5 datasets locally so we can add them to the S3 bucket? They can then be processed via the obs4REF request

@lewisjared Files uploaded and checksums added in Climate-REF/climate-ref#334. I guess I would need that merged before I can use the files in this PR?

bouweandela · 2025-05-26T21:33:32Z

@nocollier Thanks a lot for your patience and all your help! It looks like the information I needed was all already provided by intake-esgf, but I'm just not familiar enough with it. To make it even more obvious, maybe a slightly different exception could be raised if files have been found but all fileservers providing them are offline, e.g. something like

NoFileServersAvailableError: No servers were available that provide the files in this dataset: {'obs4MIPs.ECMWF.ERA-5.mon.ta.gn'}. Your access options could affect the 
possibilities.

bouweandela · 2025-05-27T09:37:48Z

/regenerate

github-actions · 2025-05-27T10:03:20Z