262 saas connector pagination strategies #286

galvana · 2022-03-14T19:38:57Z

Purpose

To add pagination strategies for SaaS connectors

Changes

Added offset, link, and cursor pagination strategies
Added pagination to retrieve_data function in SaaS Connector
Updated documentation with pagination strategies
Updated Postman collection
Changed saas_config spec to define the unwrap filter as a standalone data_path on the request to reduce verbosity and to prevent pagination strategies (which need the data_path) from depending on postprocessor configurations.

Checklist

Applicable documentation updated (guides, quickstart, postman collections, tutorial, fidesdemo, database diagram)
Good unit test/integration test coverage
This PR contains a DB migration. If checked, the reviewer should confirm with the author that the down_revision correctly references the previous migration before merging
The Run Unsafe PR Checks label has been applied, and checks have passed, if this PR touches any external services

Ticket

Closes #262

Updating documentation and Postman collection

eastandwestwind · 2022-03-14T20:41:38Z

I have some concerns around whether we can always assume that our data path for pagination will be the exact same as that needed for unwrap post-processing. For example, if pagination data is within a diff data path than the actual identity data we need to extract?

Further, what if we have a use case where we need to unwrap with data path a, filter, then unwrap again, using data path b? What do you think @galvana ?

galvana · 2022-03-14T20:58:46Z

I have some concerns around whether we can always assume that our data path for pagination will be the exact same as that needed for unwrap post-processing. For example, if pagination data is within a diff data path than the actual identity data we need to extract?

Further, what if we have a use case where we need to unwrap with data path a, filter, then unwrap again, using data path b? What do you think @galvana ?

Hey @eastandwestwind these are valid concerns, here are my thoughts on these two points:

With regards to pagination, the data_path is just used to see if there is any data at all in that location (before filtering or other post-processing). This is how we determine if we should generate a subsequent request for more data. It could be the case that a specific endpoint call doesn't result in any valid data after postprocessing but if the raw response still had data at the data_path location, we want to keep paginating until nothing is returned (an empty list).
The unwrap filter is still available for more complex scenarios like this one. We would just have to unwrap based on where the data is now after the default data_path unwrap and any filter postprocessors.

eastandwestwind · 2022-03-14T21:54:22Z

To elaborate a bit, for clarity:

Scenario 1

If API response:

 [
    {"email": {"personal": "me@email.com"}},
    {"email": {"personal": "someone@email.com"}}
  ]

And we first needed a filter processor, then an unwrap with data_path = email, to retrieve the resulting data: {"personal": "me@email.com"}

In this scenario, if we also needed pagination, what would our data_path be? I don't think pydash.get(response.json(), data_path) supports data_paths that are within arrays? Is this a valid scenario?

Scenario 2:

If we have the following API response:

{
"results":
  [
    {"info": {"email": "who@email.com", "preferences": {}}},
    {"info": {"other@email.com", "preferences": {}}}
  ]
}

Where we first implement an unwrap processor such that data_path = results, then a filter to get the relevant data, then another unwrap such that data_path = info, to retrieve the resulting data: {"email": "who@email.com", "preferences": {}}. Do you think we'd ever run into this situation where we'd need a diff data_path per unwrap processor?

galvana · 2022-03-14T22:23:43Z

Thanks for the examples, it helped uncover the scenario where the response data can be found at the root level of the response and does not need a data_path. If we make the data_path optional for these scenarios, then this is how we can handle the scenarios you specified.

Scenario 1:

[
  {"email": {"personal": "me@email.com"}},
  {"email": {"personal": "someone@email.com"}}
]

In this case we can omit the data_path from the request and use this configuration:

postprocessors:
  - strategy: unwrap
    configuration
      data_path: email
  - strategy: filter
     configuration:
       field: personal
       value:
         identity: email
pagination:
  strategy: offset
  configuration:
    incremental_param: page
    increment_by: 1
    limit: 10

The expected results from this would be:

[
  {"personal": "me@email.com"}
]

And the pagination strategy would see 2 results in the list (because of the empty data_path) and would continue to the next page.

Scenario 2:

{
"results":
  [
    {"info": {"email": "who@email.com", "preferences": {}}},
    {"info": {"email": "other@email.com", "preferences": {}}}
  ]
}

For this scenario we could do this:

data_path: results
postprocessors:
  - strategy: unwrap
    configuration
      data_path: info
  - strategy: filter
     configuration:
       field: personal
       value:
         identity: email
pagination:
  strategy: offset
  configuration:
    incremental_param: page
    increment_by: 1
    limit: 10

The result would be:

{"email": "who@email.com", "preferences": {}}

And the pagination strategy would also see the two results after the implicit data_path unwrap and continue paging.

eastandwestwind · 2022-03-14T22:31:42Z

This helps a lot @galvana ! So to summarize, we'll keep data_path as optional both on the postprocessor, AND at the more generic endpoint level.

So, a postprocessor will first look within the postprocessor config, and, if not found there, will look at the endpoint level. Is that right?

galvana · 2022-03-14T22:48:34Z

This helps a lot @galvana ! So to summarize, we'll keep data_path as optional both on the postprocessor, AND at the more generic endpoint level.

So, a postprocessor will first look within the postprocessor config, and, if not found there, will look at the endpoint level. Is that right?

To the first point, yes, and empty data_path can be the way we denote "unwrapping" an array at both the endpoint and postprocessor level. For the second point, the postprocessor will only look in its config for the data_path. The only way the endpoint data_path will impact postprocessors is that it will affect what gets passed into the postprocssors (the data found at data_path instead of the raw response). I'll at some test cases for the scenarios we discussed here.

Accounting for the use case where the list of objects is at the root level of the response and does not need a data_path

eastandwestwind

Just a couple comments, otherwise looking good!

eastandwestwind · 2022-03-15T15:56:55Z

src/fidesops/service/pagination/pagination_strategy_cursor.py

-        object_list = pydash.get(response.json(), data_path)
-        if object_list:
+        object_list = (
+            pydash.get(response.json(), data_path) if data_path else response.json()


do we need to pass in the entire response from execute_prepared_request? Seems like we can just use response_data to reduce duplicate logic in all the strategies? (20245aa#diff-23646bca49075c0841fb934d0e2c0db46cb46242f3deeba83fb13f50432f7198R176)

There's a few reasons I went with this approach:

The link paging strategy uses the header information from the raw response, so I wanted to keep that info.

I didn't want to pass in the raw response and the unwrapped response, these seemed too similar.

I wanted to pass in as much raw data as possible into the paging strategies and not make any assumptions about what's needed.

I didn't want to make the assumption outside of the paging strategy if there should be a next page, since it may vary per strategy.

This does lead to some duplicate logic in the paging strategies but I'd rather be explicit and offload as much as possible to each strategy.

this makes sense to me, thanks for the detailed explanation!

src/fidesops/service/processors/post_processor_strategy/post_processor_strategy_unwrap.py

eastandwestwind

Looks great, thanks for all the hard work addressing the additional use cases @galvana !

* Implementations of offset, link, and cursor pagination * Adding pagination to SaaS connector workflow Updating documentation and Postman collection * Fixing Pylint warning * Updating unwrap postprocessor to accepts lists in addition to dicts Accounting for the use case where the list of objects is at the root level of the response and does not need a data_path * Adding missing test case Co-authored-by: Adrian Galvan <adrian@ethyca.com>

galvana added 2 commits March 11, 2022 14:52

Implementations of offset, link, and cursor pagination

fc4d30a

Adding pagination to SaaS connector workflow

df98175

Updating documentation and Postman collection

galvana linked an issue Mar 14, 2022 that may be closed by this pull request

[SaaS Connector] Pagination strategies #262

Closed

Fixing Pylint warning

d01d2bb

seanpreston assigned eastandwestwind Mar 15, 2022

Updating unwrap postprocessor to accepts lists in addition to dicts

20245aa

Accounting for the use case where the list of objects is at the root level of the response and does not need a data_path

eastandwestwind reviewed Mar 15, 2022

View reviewed changes

Adding missing test case

b05131d

eastandwestwind added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Mar 16, 2022

galvana added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Mar 16, 2022

eastandwestwind approved these changes Mar 17, 2022

View reviewed changes

eastandwestwind merged commit 96df91d into main Mar 17, 2022

eastandwestwind deleted the 262-saas-connector-pagination-strategies branch March 17, 2022 14:00

262 saas connector pagination strategies #286

262 saas connector pagination strategies #286

Uh oh!

Conversation

galvana commented Mar 14, 2022

Purpose

Changes

Checklist

Ticket

Uh oh!

eastandwestwind commented Mar 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

galvana commented Mar 14, 2022

Uh oh!

eastandwestwind commented Mar 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scenario 1

Scenario 2:

Uh oh!

galvana commented Mar 14, 2022

Scenario 1:

Scenario 2:

Uh oh!

eastandwestwind commented Mar 14, 2022

Uh oh!

galvana commented Mar 14, 2022

Uh oh!

eastandwestwind left a comment

Choose a reason for hiding this comment

Uh oh!

eastandwestwind Mar 15, 2022

Choose a reason for hiding this comment

Uh oh!

galvana Mar 15, 2022

Choose a reason for hiding this comment

Uh oh!

eastandwestwind Mar 16, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eastandwestwind left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eastandwestwind commented Mar 14, 2022 •

edited

Loading

eastandwestwind commented Mar 14, 2022 •

edited

Loading