Skip to content

Initialize STACFrames project#1

Merged
CloudNiner merged 6 commits into
masterfrom
feature/awf/init
Nov 23, 2020
Merged

Initialize STACFrames project#1
CloudNiner merged 6 commits into
masterfrom
feature/awf/init

Conversation

@CloudNiner
Copy link
Copy Markdown
Contributor

@CloudNiner CloudNiner commented Nov 6, 2020

Initial implementation allows users to load a STAC catalog into a DataFrame via pystac, convert pystac.Item to pd.Series and back, and write a DataFrame to a STAC Catalog, with optional nesting of Items in Collections based on Item properties.

Complete with:

  • An example!
  • Docstrings!
  • CI and Release workflows via GitHub Actions!
  • CHANGELOG!
  • RELEASE checklist!
  • Linting!
  • Code formatting!

- Update README with docs
- Add CHANGELOG
- Add RELEASE checklist
- Add test and cibuild STRTA scripts
Example generates a catalog for the JPL
AVIRIS dataset.
Comment thread stacframes/__init__.py Outdated
pd_dataframe (pandas.DataFrame): A DataFrame of rows structured as described
by stacframes.item_from.
catalog (pystac.Catalog): The Catalog to add the items in pd_dataframe to.
parents (list): An ordered list of column names on the passed dataframe.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echeipesh and I discussed a few alternatives to the strategy implemented here where we only encode the catalog structure when its written:

  1. A STAC Extension that would allow a user to encode the heirarchy on an Item. Rejected because it duplicates implicit flexible structure on the Item that the Item doesn't care about.
  2. Do the same as in (1) but use the STAC Layer Extension layers the Item belongs to. Rejected because it overloads Layers.
  3. Do nothing and users define catalogs and collections outside this method, filter their primary dataframe to subsets for each collection, and write individually. Rejected because too much boilerplate which eliminated the advantage of using pandas in the first place.
  4. Write Catalog and Collection objects into the DataFrame to be written. Rejected because you can't know whether you've visited the catalog an item belongs to already when you're iterating over each item to write.
  5. Write your catalogs and collections to a separate dataframe and use join operations. Rejected because it required the user to write more complicated code.

The selected strategy ended up being relatively little extra code, provides a very simple API, and can be extended in the future to allow the caller to customize the catalog ID for each of these generated sub catalogs by defining a custom function.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to suggest handling the parents relationship on per-row level rather than on import. The current implementation requires that all information that is used to determine parents relationship is part of item properties. This may not always be desirable.

I'm thinking something like this may be a better alternative:

    df = df.apply(map_series_to_item, axis=1)
    # manually specify
    df['parents'] = ['aviris', 'aviris-demo']
    # something handles nesting for you
    df['parents'] = stacframes.utils.prefix_parents(['avris', 'demo'])
    # something handles pulling values from properties and testing for you
    df['parents'] = df.appy(stacframes.utils.parents_from_props_fn('Year', 'Flight'), axis=1)
    # user writes his own lambda using prefix_parents
    df['parents'] = df.apply(lambda r: prefix_parents(r['Year'], r['Flight'], 'extra'), axis=1)

    # parents_col can be default
    stacframes.add(df, catalog, parents_col='parents')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. This feels way more dataframe-y and is much more flexible. Happy to make the code changes necessary to trial this.

Just to clarify... your comment here:

I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.

Is just additional support for your previous comment:

I'd like to suggest handling the parents relationship on per-row level

In that setting the parents as a df column on a per-item basis is more flexible and allows the user greater control over how the item is attached to parents.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation for the above discussion is in 228086e

Note that the helper methods operate on the entire dataframe rather than a single column, this is so the user doesn't have to always write the boilerplate df.apply(helper_method, axis=1).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I softly disagree with the choice to cut-down the 'df.apply' boilerplate. To my addled mind its better to have the clarity that this is per/row dataframe operations by the shape of the code and have no doubt rather than having to learn a WHOLE new function. But, I think this area, generating parents column, is something we're going to iterate on either way so I'm not insisting on code changes. LGTM

Copy link
Copy Markdown

@echeipesh echeipesh Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.

To clarify this comment:
Membership of collection and membership of catalog is not the same thing. I'm thinking about how to tease those out. As a mental test case I'm thinking how I'd generate Landsat 8 STAC catalog using this stacframes.

There all items are part of lc8 collection:
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/catalog.json
Note that all items has collection property as: landsat-8-l1
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/122/2015-01-02/LC81791222015002LGN00.json
However each item is is part of Row/Path catalogs:
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/catalog.json
https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/122/catalog.json

So in stacframe parlance the parents column would be like ['landsat-8-l1', '179', '122'] but collection column would be landsat-8-l1

Also for humor please note that Landsat STAC catalog breaks global uniqueness "requirement" and https://landsat-stac.s3.amazonaws.com/landsat-8-l1/179/122/catalog.json has id of 122.

I don't necessarily want to drag this PR out. So if it feels like there is more discussion to be had here or this is not PR review change I think it can be broken out into new issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I softly disagree with the choice to cut-down the 'df.apply' boilerplate

Noted. I don't have a strong preference, and it was difficult to write a docstring explaining what these methods do. I think it would be easier to describe what this does if these methods were to operate on Series. I think "its better to have the clarity that this is per/row dataframe operations by the shape of the code" gets at this same idea.

Membership of collection and membership of catalog is not the same thing

Ah. Your landsat-8 example is instructive. To support weaving catalogs and collections on write we'd have to add some way for the parents column to indicate whether a particular entry is to be a catalog or collection. Added #2 to track this

Landsat STAC catalog breaks global uniqueness "requirement"

👀 that's not good oops. We shouldn't support writing invalid catalogs. AFAIK though, stacframes will read that invalid catalog just fine, it would be up to the user to fix in a DataFrame or pystac before re-writing.

@CloudNiner CloudNiner self-assigned this Nov 6, 2020
Copy link
Copy Markdown

@echeipesh echeipesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments on stacframes.add API for discussion

Comment thread examples/aviris/main.py Outdated
Comment thread stacframes/__init__.py Outdated
pd_dataframe (pandas.DataFrame): A DataFrame of rows structured as described
by stacframes.item_from.
catalog (pystac.Catalog): The Catalog to add the items in pd_dataframe to.
parents (list): An ordered list of column names on the passed dataframe.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to suggest handling the parents relationship on per-row level rather than on import. The current implementation requires that all information that is used to determine parents relationship is part of item properties. This may not always be desirable.

I'm thinking something like this may be a better alternative:

    df = df.apply(map_series_to_item, axis=1)
    # manually specify
    df['parents'] = ['aviris', 'aviris-demo']
    # something handles nesting for you
    df['parents'] = stacframes.utils.prefix_parents(['avris', 'demo'])
    # something handles pulling values from properties and testing for you
    df['parents'] = df.appy(stacframes.utils.parents_from_props_fn('Year', 'Flight'), axis=1)
    # user writes his own lambda using prefix_parents
    df['parents'] = df.apply(lambda r: prefix_parents(r['Year'], r['Flight'], 'extra'), axis=1)

    # parents_col can be default
    stacframes.add(df, catalog, parents_col='parents')

Comment thread stacframes/__init__.py Outdated
pd_dataframe (pandas.DataFrame): A DataFrame of rows structured as described
by stacframes.item_from.
catalog (pystac.Catalog): The Catalog to add the items in pd_dataframe to.
parents (list): An ordered list of column names on the passed dataframe.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think collection membership should be explicit. In reality its just as valid for a user to want to add items to closests-to-root collection as well as closests-to-leaf. I think we can add parameters to add method to cover these user cases but on the face of it I don't see why not just make it manual.

Allows more user control over generating
parent collection ids to nest the items under.

Adds two helper methods for generating parents column
values in `stacframes.parents` and updates the AVIRIS
example to use them.
Copy link
Copy Markdown

@echeipesh echeipesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some ongoing discussion about API but overall I think this is an amazing initial commit and I'm OK with merging it as is.

At least "NASA Log" column also has NaN values
@CloudNiner CloudNiner merged commit dd7959d into master Nov 23, 2020
@CloudNiner CloudNiner deleted the feature/awf/init branch November 23, 2020 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants