Initialize STACFrames project by CloudNiner · Pull Request #1 · azavea/stacframes

CloudNiner · 2020-11-06T16:22:09Z

Initial implementation allows users to load a STAC catalog into a DataFrame via pystac, convert pystac.Item to pd.Series and back, and write a DataFrame to a STAC Catalog, with optional nesting of Items in Collections based on Item properties.

Complete with:

An example!
Docstrings!
CI and Release workflows via GitHub Actions!
CHANGELOG!
RELEASE checklist!
Linting!
Code formatting!

- Update README with docs - Add CHANGELOG - Add RELEASE checklist - Add test and cibuild STRTA scripts

Example generates a catalog for the JPL AVIRIS dataset.

CloudNiner · 2020-11-06T19:55:50Z

+        pd_dataframe (pandas.DataFrame): A DataFrame of rows structured as described
+            by stacframes.item_from.
+        catalog (pystac.Catalog): The Catalog to add the items in pd_dataframe to.
+        parents (list): An ordered list of column names on the passed dataframe.


@echeipesh and I discussed a few alternatives to the strategy implemented here where we only encode the catalog structure when its written:

A STAC Extension that would allow a user to encode the heirarchy on an Item. Rejected because it duplicates implicit flexible structure on the Item that the Item doesn't care about.

Do the same as in (1) but use the STAC Layer Extension layers the Item belongs to. Rejected because it overloads Layers.

Do nothing and users define catalogs and collections outside this method, filter their primary dataframe to subsets for each collection, and write individually. Rejected because too much boilerplate which eliminated the advantage of using pandas in the first place.

Write Catalog and Collection objects into the DataFrame to be written. Rejected because you can't know whether you've visited the catalog an item belongs to already when you're iterating over each item to write.

Write your catalogs and collections to a separate dataframe and use join operations. Rejected because it required the user to write more complicated code.

The selected strategy ended up being relatively little extra code, provides a very simple API, and can be extended in the future to allow the caller to customize the catalog ID for each of these generated sub catalogs by defining a custom function.

I'd like to suggest handling the parents relationship on per-row level rather than on import. The current implementation requires that all information that is used to determine parents relationship is part of item properties. This may not always be desirable.

I'm thinking something like this may be a better alternative:

df = df.apply(map_series_to_item, axis=1) # manually specify df['parents'] = ['aviris', 'aviris-demo'] # something handles nesting for you df['parents'] = stacframes.utils.prefix_parents(['avris', 'demo']) # something handles pulling values from properties and testing for you df['parents'] = df.appy(stacframes.utils.parents_from_props_fn('Year', 'Flight'), axis=1) # user writes his own lambda using prefix_parents df['parents'] = df.apply(lambda r: prefix_parents(r['Year'], r['Flight'], 'extra'), axis=1) # parents_col can be default stacframes.add(df, catalog, parents_col='parents')

I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.

Good suggestion. This feels way more dataframe-y and is much more flexible. Happy to make the code changes necessary to trial this.

Just to clarify... your comment here:

I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.

Is just additional support for your previous comment:

I'd like to suggest handling the parents relationship on per-row level

In that setting the parents as a df column on a per-item basis is more flexible and allows the user greater control over how the item is attached to parents.

Implementation for the above discussion is in 228086e

Note that the helper methods operate on the entire dataframe rather than a single column, this is so the user doesn't have to always write the boilerplate df.apply(helper_method, axis=1).

Hmm, I softly disagree with the choice to cut-down the 'df.apply' boilerplate. To my addled mind its better to have the clarity that this is per/row dataframe operations by the shape of the code and have no doubt rather than having to learn a WHOLE new function. But, I think this area, generating parents column, is something we're going to iterate on either way so I'm not insisting on code changes. LGTM

I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.

To clarify this comment:
Membership of collection and membership of catalog is not the same thing. I'm thinking about how to tease those out. As a mental test case I'm thinking how I'd generate Landsat 8 STAC catalog using this stacframes.

There all items are part of lc8 collection:
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/catalog.json
Note that all items has collection property as: landsat-8-l1
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/122/2015-01-02/LC81791222015002LGN00.json
However each item is is part of Row/Path catalogs:
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/catalog.json
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/122/catalog.json

So in stacframe parlance the parents column would be like ['landsat-8-l1', '179', '122'] but collection column would be landsat-8-l1

Also for humor please note that Landsat STAC catalog breaks global uniqueness "requirement" and https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/122/catalog.json has id of 122.

I don't necessarily want to drag this PR out. So if it feels like there is more discussion to be had here or this is not PR review change I think it can be broken out into new issue.

Hmm, I softly disagree with the choice to cut-down the 'df.apply' boilerplate

Noted. I don't have a strong preference, and it was difficult to write a docstring explaining what these methods do. I think it would be easier to describe what this does if these methods were to operate on Series. I think "its better to have the clarity that this is per/row dataframe operations by the shape of the code" gets at this same idea.

Membership of collection and membership of catalog is not the same thing

Ah. Your landsat-8 example is instructive. To support weaving catalogs and collections on write we'd have to add some way for the parents column to indicate whether a particular entry is to be a catalog or collection. Added #2 to track this

Landsat STAC catalog breaks global uniqueness "requirement"

👀 that's not good oops. We shouldn't support writing invalid catalogs. AFAIK though, stacframes will read that invalid catalog just fine, it would be up to the user to fix in a DataFrame or pystac before re-writing.

echeipesh

Added comments on stacframes.add API for discussion

echeipesh · 2020-11-10T15:36:16Z

+        pd_dataframe (pandas.DataFrame): A DataFrame of rows structured as described
+            by stacframes.item_from.
+        catalog (pystac.Catalog): The Catalog to add the items in pd_dataframe to.
+        parents (list): An ordered list of column names on the passed dataframe.


I'd like to suggest handling the parents relationship on per-row level rather than on import. The current implementation requires that all information that is used to determine parents relationship is part of item properties. This may not always be desirable.

I'm thinking something like this may be a better alternative:

df = df.apply(map_series_to_item, axis=1) # manually specify df['parents'] = ['aviris', 'aviris-demo'] # something handles nesting for you df['parents'] = stacframes.utils.prefix_parents(['avris', 'demo']) # something handles pulling values from properties and testing for you df['parents'] = df.appy(stacframes.utils.parents_from_props_fn('Year', 'Flight'), axis=1) # user writes his own lambda using prefix_parents df['parents'] = df.apply(lambda r: prefix_parents(r['Year'], r['Flight'], 'extra'), axis=1) # parents_col can be default stacframes.add(df, catalog, parents_col='parents')

echeipesh · 2020-11-10T15:42:11Z

+        pd_dataframe (pandas.DataFrame): A DataFrame of rows structured as described
+            by stacframes.item_from.
+        catalog (pystac.Catalog): The Catalog to add the items in pd_dataframe to.
+        parents (list): An ordered list of column names on the passed dataframe.


I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.

Allows more user control over generating parent collection ids to nest the items under. Adds two helper methods for generating parents column values in `stacframes.parents` and updates the AVIRIS example to use them.

echeipesh

There is some ongoing discussion about API but overall I think this is an amazing initial commit and I'm OK with merging it as is.

At least "NASA Log" column also has NaN values

CloudNiner added 2 commits November 6, 2020 10:44

Initial implementation, docs, tests + setup.py

b4f5454

Add CI and PyPi release workflows

2c7aab5

- Update README with docs - Add CHANGELOG - Add RELEASE checklist - Add test and cibuild STRTA scripts

CloudNiner mentioned this pull request Nov 6, 2020

Generate AVIRIS STAC Catalog via Pandas DataFrames azavea/nasa-hyperspectral#21

Closed

Add a complete stacframes example

55d0574

Example generates a catalog for the JPL AVIRIS dataset.

CloudNiner force-pushed the feature/awf/init branch from ebd27c5 to 55d0574 Compare November 6, 2020 19:45

CloudNiner commented Nov 6, 2020

View reviewed changes

CloudNiner requested review from echeipesh and jpolchlo November 6, 2020 20:00

CloudNiner self-assigned this Nov 6, 2020

echeipesh reviewed Nov 10, 2020

View reviewed changes

CloudNiner added 2 commits November 10, 2020 13:29

Rename stacframes.add to df_from

bcd17dc

Refactor stacframes.df_to to look for a parents column

228086e

Allows more user control over generating parent collection ids to nest the items under. Adds two helper methods for generating parents column values in `stacframes.parents` and updates the AVIRIS example to use them.

CloudNiner force-pushed the feature/awf/init branch from e54d78d to 228086e Compare November 10, 2020 23:11

CloudNiner mentioned this pull request Nov 11, 2020

Move stacframes AVIRIS example to this repo azavea/nasa-hyperspectral#33

Closed

echeipesh approved these changes Nov 12, 2020

View reviewed changes

CloudNiner mentioned this pull request Nov 12, 2020

Allow writing nested Items that belong to a mix of catalogs and collections #2

Open

pomadchin assigned jpolchlo Nov 12, 2020

Use fillna for all columns

63c10b3

At least "NASA Log" column also has NaN values

CloudNiner mentioned this pull request Nov 20, 2020

Ingest AVIRIS STAC into Franklin azavea/nasa-hyperspectral#42

Closed

CloudNiner merged commit dd7959d into master Nov 23, 2020

CloudNiner deleted the feature/awf/init branch November 23, 2020 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize STACFrames project#1

Initialize STACFrames project#1
CloudNiner merged 6 commits into
masterfrom
feature/awf/init

CloudNiner commented Nov 6, 2020 •

edited

Loading

Uh oh!

CloudNiner Nov 6, 2020

Uh oh!

echeipesh Nov 10, 2020

Uh oh!

echeipesh Nov 10, 2020

Uh oh!

CloudNiner Nov 10, 2020

Uh oh!

CloudNiner Nov 10, 2020

Uh oh!

echeipesh Nov 12, 2020

Uh oh!

echeipesh Nov 12, 2020 •

edited

Loading

Uh oh!

CloudNiner Nov 12, 2020

Uh oh!

echeipesh left a comment

Uh oh!

Uh oh!

echeipesh Nov 10, 2020

Uh oh!

echeipesh Nov 10, 2020

Uh oh!

echeipesh left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CloudNiner commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

echeipesh Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

echeipesh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

echeipesh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CloudNiner commented Nov 6, 2020 •

edited

Loading

echeipesh Nov 12, 2020 •

edited

Loading