Initialize STACFrames project#1
Conversation
- Update README with docs - Add CHANGELOG - Add RELEASE checklist - Add test and cibuild STRTA scripts
Example generates a catalog for the JPL AVIRIS dataset.
ebd27c5 to
55d0574
Compare
| pd_dataframe (pandas.DataFrame): A DataFrame of rows structured as described | ||
| by stacframes.item_from. | ||
| catalog (pystac.Catalog): The Catalog to add the items in pd_dataframe to. | ||
| parents (list): An ordered list of column names on the passed dataframe. |
There was a problem hiding this comment.
@echeipesh and I discussed a few alternatives to the strategy implemented here where we only encode the catalog structure when its written:
- A STAC Extension that would allow a user to encode the heirarchy on an Item. Rejected because it duplicates implicit flexible structure on the Item that the Item doesn't care about.
- Do the same as in (1) but use the STAC Layer Extension layers the Item belongs to. Rejected because it overloads Layers.
- Do nothing and users define catalogs and collections outside this method, filter their primary dataframe to subsets for each collection, and write individually. Rejected because too much boilerplate which eliminated the advantage of using pandas in the first place.
- Write Catalog and Collection objects into the DataFrame to be written. Rejected because you can't know whether you've visited the catalog an item belongs to already when you're iterating over each item to write.
- Write your catalogs and collections to a separate dataframe and use join operations. Rejected because it required the user to write more complicated code.
The selected strategy ended up being relatively little extra code, provides a very simple API, and can be extended in the future to allow the caller to customize the catalog ID for each of these generated sub catalogs by defining a custom function.
There was a problem hiding this comment.
I'd like to suggest handling the parents relationship on per-row level rather than on import. The current implementation requires that all information that is used to determine parents relationship is part of item properties. This may not always be desirable.
I'm thinking something like this may be a better alternative:
df = df.apply(map_series_to_item, axis=1)
# manually specify
df['parents'] = ['aviris', 'aviris-demo']
# something handles nesting for you
df['parents'] = stacframes.utils.prefix_parents(['avris', 'demo'])
# something handles pulling values from properties and testing for you
df['parents'] = df.appy(stacframes.utils.parents_from_props_fn('Year', 'Flight'), axis=1)
# user writes his own lambda using prefix_parents
df['parents'] = df.apply(lambda r: prefix_parents(r['Year'], r['Flight'], 'extra'), axis=1)
# parents_col can be default
stacframes.add(df, catalog, parents_col='parents')There was a problem hiding this comment.
I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.
There was a problem hiding this comment.
Good suggestion. This feels way more dataframe-y and is much more flexible. Happy to make the code changes necessary to trial this.
Just to clarify... your comment here:
I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.
Is just additional support for your previous comment:
I'd like to suggest handling the parents relationship on per-row level
In that setting the parents as a df column on a per-item basis is more flexible and allows the user greater control over how the item is attached to parents.
There was a problem hiding this comment.
Implementation for the above discussion is in 228086e
Note that the helper methods operate on the entire dataframe rather than a single column, this is so the user doesn't have to always write the boilerplate df.apply(helper_method, axis=1).
There was a problem hiding this comment.
Hmm, I softly disagree with the choice to cut-down the 'df.apply' boilerplate. To my addled mind its better to have the clarity that this is per/row dataframe operations by the shape of the code and have no doubt rather than having to learn a WHOLE new function. But, I think this area, generating parents column, is something we're going to iterate on either way so I'm not insisting on code changes. LGTM
There was a problem hiding this comment.
I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.
To clarify this comment:
Membership of collection and membership of catalog is not the same thing. I'm thinking about how to tease those out. As a mental test case I'm thinking how I'd generate Landsat 8 STAC catalog using this stacframes.
There all items are part of lc8 collection:
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/catalog.json
Note that all items has collection property as: landsat-8-l1
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/122/2015-01-02/LC81791222015002LGN00.json
However each item is is part of Row/Path catalogs:
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/catalog.json
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/122/catalog.json
So in stacframe parlance the parents column would be like ['landsat-8-l1', '179', '122'] but collection column would be landsat-8-l1
Also for humor please note that Landsat STAC catalog breaks global uniqueness "requirement" and https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/122/catalog.json has id of 122.
I don't necessarily want to drag this PR out. So if it feels like there is more discussion to be had here or this is not PR review change I think it can be broken out into new issue.
There was a problem hiding this comment.
Hmm, I softly disagree with the choice to cut-down the 'df.apply' boilerplate
Noted. I don't have a strong preference, and it was difficult to write a docstring explaining what these methods do. I think it would be easier to describe what this does if these methods were to operate on Series. I think "its better to have the clarity that this is per/row dataframe operations by the shape of the code" gets at this same idea.
Membership of collection and membership of catalog is not the same thing
Ah. Your landsat-8 example is instructive. To support weaving catalogs and collections on write we'd have to add some way for the parents column to indicate whether a particular entry is to be a catalog or collection. Added #2 to track this
Landsat STAC catalog breaks global uniqueness "requirement"
👀 that's not good oops. We shouldn't support writing invalid catalogs. AFAIK though, stacframes will read that invalid catalog just fine, it would be up to the user to fix in a DataFrame or pystac before re-writing.
echeipesh
left a comment
There was a problem hiding this comment.
Added comments on stacframes.add API for discussion
| pd_dataframe (pandas.DataFrame): A DataFrame of rows structured as described | ||
| by stacframes.item_from. | ||
| catalog (pystac.Catalog): The Catalog to add the items in pd_dataframe to. | ||
| parents (list): An ordered list of column names on the passed dataframe. |
There was a problem hiding this comment.
I'd like to suggest handling the parents relationship on per-row level rather than on import. The current implementation requires that all information that is used to determine parents relationship is part of item properties. This may not always be desirable.
I'm thinking something like this may be a better alternative:
df = df.apply(map_series_to_item, axis=1)
# manually specify
df['parents'] = ['aviris', 'aviris-demo']
# something handles nesting for you
df['parents'] = stacframes.utils.prefix_parents(['avris', 'demo'])
# something handles pulling values from properties and testing for you
df['parents'] = df.appy(stacframes.utils.parents_from_props_fn('Year', 'Flight'), axis=1)
# user writes his own lambda using prefix_parents
df['parents'] = df.apply(lambda r: prefix_parents(r['Year'], r['Flight'], 'extra'), axis=1)
# parents_col can be default
stacframes.add(df, catalog, parents_col='parents')| pd_dataframe (pandas.DataFrame): A DataFrame of rows structured as described | ||
| by stacframes.item_from. | ||
| catalog (pystac.Catalog): The Catalog to add the items in pd_dataframe to. | ||
| parents (list): An ordered list of column names on the passed dataframe. |
There was a problem hiding this comment.
I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.
Allows more user control over generating parent collection ids to nest the items under. Adds two helper methods for generating parents column values in `stacframes.parents` and updates the AVIRIS example to use them.
e54d78d to
228086e
Compare
echeipesh
left a comment
There was a problem hiding this comment.
There is some ongoing discussion about API but overall I think this is an amazing initial commit and I'm OK with merging it as is.
At least "NASA Log" column also has NaN values
Initial implementation allows users to load a STAC catalog into a DataFrame via pystac, convert pystac.Item to pd.Series and back, and write a DataFrame to a STAC Catalog, with optional nesting of Items in Collections based on Item properties.
Complete with: