Skip to content

Conversation

@bouweandela
Copy link
Contributor

@bouweandela bouweandela commented Apr 29, 2025

Description

Checklist

Please confirm that this pull request has done the following:

  • Data registry up to date (regenerate if necessary with a comment on this PR of /regenerate)
  • Documentation added (where applicable)
  • Changelog item added to changelog/

@bouweandela
Copy link
Contributor Author

/regenerate

@github-actions
Copy link

PR comment handling
Something went wrong!

Details: https://github.com/Climate-REF/ref-sample-data/actions/runs/14730933211

@bouweandela
Copy link
Contributor Author

/regenerate

@github-actions
Copy link

PR comment handling
Something went wrong!

Details: https://github.com/Climate-REF/ref-sample-data/actions/runs/14736606473

@bouweandela
Copy link
Contributor Author

It looks like there is an issue with regridding the data

Time slice before regridding

Make regridding work with data on model levels

Add log messages
@bouweandela
Copy link
Contributor Author

@lewisjared Is there an easy way to test if Climate-REF/climate-ref#261 works with the data added in this pull request without first merging this?

@bouweandela
Copy link
Contributor Author

bouweandela commented May 20, 2025

@nocollier I tried to use the streaming feature in the fetch_test_data.py script, but I get an assertion error coming from intake-esgf if I do that. Would you have time to take a look?

@nocollier
Copy link
Contributor

nocollier commented May 20, 2025

Sigh... so I run this:

import intake_esgf

intake_esgf.conf.set(all_indices=True)
cat = intake_esgf.ESGFCatalog().search(
    project="obs4MIPs", variable_id="ta", source_id="ERA-5"
)
dpd = cat.to_path_dict(prefer_streaming=True, minimal_keys=False)

and 1 out of 3 times I get the 44 links. The assertion you are seeing is a check I have to make sure that when I partition out the file info based on how we access/transfer the data, that we didn't lose a file somehow in the logic. If a user prefers streaming, I added this select link function to return the fastest link which first returns a response. If it fails to do so, the logic should break and the file will be queued for https download. I am not sure what is happening, but it looks like checking the OPENDAP server for the status of 44 links maybe makes it fail.

In the short term we could try something that should work but is a little messy. You could:

import intake_esgf

intake_esgf.conf.set(all_indices=True)
cat = intake_esgf.ESGFCatalog().search(
    project="obs4MIPs", variable_id="ta", source_id="ERA-5"
)
infos = cat._get_file_info()

This will return a list of dictionaries which look like:

{
    "key": "obs4MIPs.ECMWF.ERA-5.mon.ta.gn",
    "dataset_id": "obs4MIPs.ECMWF.ERA-5.mon.ta.gn.v20250220|esgf-data2.llnl.gov",
    "checksum_type": "SHA256",
    "checksum": "d3c14ba6fb16a49cef03ea622bae954f29555c1fe7df4af6d4a272cf160a2eaa",
    "size": 494866132,
    "HTTPServer": [
        "https://esgf-data2.llnl.gov/thredds/fileServer/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
        "https://esgf-data2.llnl.gov/thredds/fileServer/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
    ],
    "OPENDAP": [
        "https://esgf-data2.llnl.gov/thredds/dodsC/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
        "https://esgf-data2.llnl.gov/thredds/dodsC/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
    ],
    "Globus": [
        "globus:1889ea03-25ad-4f9f-8110-1ce8833a9d7e/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
        "globus:1889ea03-25ad-4f9f-8110-1ce8833a9d7e/user_pub_work/obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc",
    ],
    "path": PosixPath(
        "obs4MIPs/ECMWF/ERA-5/mon/ta/gn/v20250220/ta_mon_ERA-5_PCMDI_gn_197901-197912.nc"
    ),
}

Longer term I may need to think of a way a user can get detailed information that I haven't tried to reduce for them. OPENDAP may not be around much longer but I have a feeling users will break things no matter what and sometimes all the information is the best we can do.

@nocollier
Copy link
Contributor

I need to generate some large datasets also (thetao for an ocean diagnostic). I am going to try to get the timestamp thing implemented into intake-esgf and maybe that will work for this too. Sorry for the trouble.

@bouweandela
Copy link
Contributor Author

The timestamp thing would definitely help a bit here. By the way, are you aware of NetCDF Byterange Support? Most servers on ESGF support this type of access, so that could be a way to support streaming without using OPeNDAP.

@lewisjared
Copy link
Contributor

@lewisjared Is there an easy way to test if Climate-REF/climate-ref#261 works with the data added in this pull request without first merging this?

You can use a different directory for the test data by specifying REF_TEST_OUTPUT. If REF_TEST_OUTPUT is specified no fetching is done. This is largely untested, but may likely work.

https://climate-ref.readthedocs.io/en/latest/configuration/#ref_test_output

You might want to merge main back into this so you can use the --slug argument to the script to only run a specified set of requests.

@bouweandela
Copy link
Contributor Author

@lewisjared Do you understand why installing climate-ref as a dependency is failing in CI? It works fine on my laptop. I'm not sure why climate-ref is needed as a dependency, do you use this for testing purposes?

@bouweandela
Copy link
Contributor Author

@lewisjared Is there an easy way to test if https://github.com/Climate-REF/clim>ate-ref/pull/261 works with the data added in this pull request without first merging this?

You can use a different directory for the test data by specifying REF_TEST_OUTPUT

Unfortunately that didn't work. Making a change like this worked, though it still downloads the existing sample data too. Is that something you would like to have included in the ref package or is it too specific to create yet another environmental variable?

diff --git a/conftest.py b/conftest.py
index b15cc97e..a9881067 100644
--- a/conftest.py
+++ b/conftest.py
@@ -93,6 +93,8 @@ def test_data_dir() -> Path:
 
 @pytest.fixture(scope="session")
 def sample_data_dir(test_data_dir) -> Path:
+    if "REF_SAMPLE_DATA_DIR" in os.environ:
+        return Path(os.environ["REF_SAMPLE_DATA_DIR"])
     return test_data_dir / "sample-data"

@lewisjared
Copy link
Contributor

Apologies. I must have had that functionality in another branch.

@bouweandela
Copy link
Contributor Author

I'm not sure why climate-ref is needed as a dependency, do you use this for testing purposes?

I figured it out by removing it, it's used to get the obs4REF data..

@bouweandela bouweandela force-pushed the cloud-scatterplots-data branch from 12a19fe to 3e25d55 Compare May 23, 2025 08:09
@bouweandela
Copy link
Contributor Author

why installing climate-ref as a dependency is failing in CI?

It looks like this is a bug in pixi, the way to generates the URL for pip does not seem valid with recent versions.

@bouweandela
Copy link
Contributor Author

bouweandela commented May 26, 2025

Do you have the ERA 5 datasets locally so we can add them to the S3 bucket? They can then be processed via the obs4REF request

@lewisjared Files uploaded and checksums added in Climate-REF/climate-ref#334. I guess I would need that merged before I can use the files in this PR?

@bouweandela
Copy link
Contributor Author

@nocollier Thanks a lot for your patience and all your help! It looks like the information I needed was all already provided by intake-esgf, but I'm just not familiar enough with it. To make it even more obvious, maybe a slightly different exception could be raised if files have been found but all fileservers providing them are offline, e.g. something like

NoFileServersAvailableError: No servers were available that provide the files in this dataset: {'obs4MIPs.ECMWF.ERA-5.mon.ta.gn'}. Your access options could affect the 
possibilities.

@bouweandela
Copy link
Contributor Author

/regenerate

@github-actions
Copy link

PR comment handling
The regenerate task is done!

You can find the workflow here:
https://github.com/Climate-REF/ref-sample-data/actions/runs/15271865717

@bouweandela
Copy link
Contributor Author

/regenerate

@github-actions
Copy link

PR comment handling
The regenerate task is done!

You can find the workflow here:
https://github.com/Climate-REF/ref-sample-data/actions/runs/15274091552

@bouweandela
Copy link
Contributor Author

/regenerate

@github-actions
Copy link

PR comment handling
The regenerate task is done!

You can find the workflow here:
https://github.com/Climate-REF/ref-sample-data/actions/runs/15277486068

@bouweandela
Copy link
Contributor Author

/regenerate

@github-actions
Copy link

PR comment handling
The regenerate task is done!

You can find the workflow here:
https://github.com/Climate-REF/ref-sample-data/actions/runs/15279819477

@bouweandela
Copy link
Contributor Author

/regenerate

@github-actions
Copy link

PR comment handling
The regenerate task is done!

You can find the workflow here:
https://github.com/Climate-REF/ref-sample-data/actions/runs/15282069379

@lewisjared
Copy link
Contributor

@bouweandela Do you think the Dask processing will be more preformant than just niavely processing each dataset in parallel?

I'm happy to merge main back into this for you

@bouweandela
Copy link
Contributor Author

I'll start a new pull request with the updated main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants