Skip to content
This repository was archived by the owner on Nov 30, 2022. It is now read-only.

Conversation

@galvana
Copy link
Collaborator

@galvana galvana commented Mar 14, 2022

Purpose

To add pagination strategies for SaaS connectors

Changes

  • Added offset, link, and cursor pagination strategies
  • Added pagination to retrieve_data function in SaaS Connector
  • Updated documentation with pagination strategies
  • Updated Postman collection
  • Changed saas_config spec to define the unwrap filter as a standalone data_path on the request to reduce verbosity and to prevent pagination strategies (which need the data_path) from depending on postprocessor configurations.

Checklist

  • Applicable documentation updated (guides, quickstart, postman collections, tutorial, fidesdemo, database diagram)
  • Good unit test/integration test coverage
  • This PR contains a DB migration. If checked, the reviewer should confirm with the author that the down_revision correctly references the previous migration before merging
  • The Run Unsafe PR Checks label has been applied, and checks have passed, if this PR touches any external services

Ticket

Closes #262

@galvana galvana linked an issue Mar 14, 2022 that may be closed by this pull request
@eastandwestwind
Copy link
Contributor

eastandwestwind commented Mar 14, 2022

I have some concerns around whether we can always assume that our data path for pagination will be the exact same as that needed for unwrap post-processing. For example, if pagination data is within a diff data path than the actual identity data we need to extract?

Further, what if we have a use case where we need to unwrap with data path a, filter, then unwrap again, using data path b? What do you think @galvana ?

@galvana
Copy link
Collaborator Author

galvana commented Mar 14, 2022

I have some concerns around whether we can always assume that our data path for pagination will be the exact same as that needed for unwrap post-processing. For example, if pagination data is within a diff data path than the actual identity data we need to extract?

Further, what if we have a use case where we need to unwrap with data path a, filter, then unwrap again, using data path b? What do you think @galvana ?

Hey @eastandwestwind these are valid concerns, here are my thoughts on these two points:

  1. With regards to pagination, the data_path is just used to see if there is any data at all in that location (before filtering or other post-processing). This is how we determine if we should generate a subsequent request for more data. It could be the case that a specific endpoint call doesn't result in any valid data after postprocessing but if the raw response still had data at the data_path location, we want to keep paginating until nothing is returned (an empty list).

  2. The unwrap filter is still available for more complex scenarios like this one. We would just have to unwrap based on where the data is now after the default data_path unwrap and any filter postprocessors.

@eastandwestwind
Copy link
Contributor

eastandwestwind commented Mar 14, 2022

To elaborate a bit, for clarity:

Scenario 1

If API response:

 [
    {"email": {"personal": "me@email.com"}},
    {"email": {"personal": "someone@email.com"}}
  ]

And we first needed a filter processor, then an unwrap with data_path = email, to retrieve the resulting data: {"personal": "me@email.com"}

In this scenario, if we also needed pagination, what would our data_path be? I don't think pydash.get(response.json(), data_path) supports data_paths that are within arrays? Is this a valid scenario?

Scenario 2:

If we have the following API response:

{
"results":
  [
    {"info": {"email": "who@email.com", "preferences": {}}},
    {"info": {"other@email.com", "preferences": {}}}
  ]
}

Where we first implement an unwrap processor such that data_path = results, then a filter to get the relevant data, then another unwrap such that data_path = info, to retrieve the resulting data: {"email": "who@email.com", "preferences": {}}. Do you think we'd ever run into this situation where we'd need a diff data_path per unwrap processor?

@galvana
Copy link
Collaborator Author

galvana commented Mar 14, 2022

Thanks for the examples, it helped uncover the scenario where the response data can be found at the root level of the response and does not need a data_path. If we make the data_path optional for these scenarios, then this is how we can handle the scenarios you specified.

Scenario 1:

[
  {"email": {"personal": "me@email.com"}},
  {"email": {"personal": "someone@email.com"}}
]

In this case we can omit the data_path from the request and use this configuration:

postprocessors:
  - strategy: unwrap
    configuration
      data_path: email
  - strategy: filter
     configuration:
       field: personal
       value:
         identity: email
pagination:
  strategy: offset
  configuration:
    incremental_param: page
    increment_by: 1
    limit: 10

The expected results from this would be:

[
  {"personal": "me@email.com"}
]

And the pagination strategy would see 2 results in the list (because of the empty data_path) and would continue to the next page.

Scenario 2:

{
"results":
  [
    {"info": {"email": "who@email.com", "preferences": {}}},
    {"info": {"email": "other@email.com", "preferences": {}}}
  ]
}

For this scenario we could do this:

data_path: results
postprocessors:
  - strategy: unwrap
    configuration
      data_path: info
  - strategy: filter
     configuration:
       field: personal
       value:
         identity: email
pagination:
  strategy: offset
  configuration:
    incremental_param: page
    increment_by: 1
    limit: 10

The result would be:

{"email": "who@email.com", "preferences": {}}

And the pagination strategy would also see the two results after the implicit data_path unwrap and continue paging.

@eastandwestwind
Copy link
Contributor

This helps a lot @galvana ! So to summarize, we'll keep data_path as optional both on the postprocessor, AND at the more generic endpoint level.

So, a postprocessor will first look within the postprocessor config, and, if not found there, will look at the endpoint level. Is that right?

@galvana
Copy link
Collaborator Author

galvana commented Mar 14, 2022

This helps a lot @galvana ! So to summarize, we'll keep data_path as optional both on the postprocessor, AND at the more generic endpoint level.

So, a postprocessor will first look within the postprocessor config, and, if not found there, will look at the endpoint level. Is that right?

To the first point, yes, and empty data_path can be the way we denote "unwrapping" an array at both the endpoint and postprocessor level. For the second point, the postprocessor will only look in its config for the data_path. The only way the endpoint data_path will impact postprocessors is that it will affect what gets passed into the postprocssors (the data found at data_path instead of the raw response). I'll at some test cases for the scenarios we discussed here.

Accounting for the use case where the list of objects is at the root level of the response and does not need a data_path
Copy link
Contributor

@eastandwestwind eastandwestwind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments, otherwise looking good!

object_list = pydash.get(response.json(), data_path)
if object_list:
object_list = (
pydash.get(response.json(), data_path) if data_path else response.json()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to pass in the entire response from execute_prepared_request? Seems like we can just use response_data to reduce duplicate logic in all the strategies? (20245aa#diff-23646bca49075c0841fb934d0e2c0db46cb46242f3deeba83fb13f50432f7198R176)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few reasons I went with this approach:

  • The link paging strategy uses the header information from the raw response, so I wanted to keep that info.
  • I didn't want to pass in the raw response and the unwrapped response, these seemed too similar.
  • I wanted to pass in as much raw data as possible into the paging strategies and not make any assumptions about what's needed.
  • I didn't want to make the assumption outside of the paging strategy if there should be a next page, since it may vary per strategy.

This does lead to some duplicate logic in the paging strategies but I'd rather be explicit and offload as much as possible to each strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me, thanks for the detailed explanation!

@eastandwestwind eastandwestwind added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Mar 16, 2022
@galvana galvana added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Mar 16, 2022
Copy link
Contributor

@eastandwestwind eastandwestwind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for all the hard work addressing the additional use cases @galvana !

@eastandwestwind eastandwestwind merged commit 96df91d into main Mar 17, 2022
@eastandwestwind eastandwestwind deleted the 262-saas-connector-pagination-strategies branch March 17, 2022 14:00
sanders41 pushed a commit that referenced this pull request Sep 22, 2022
* Implementations of offset, link, and cursor pagination

* Adding pagination to SaaS connector workflow
Updating documentation and Postman collection

* Fixing Pylint warning

* Updating unwrap postprocessor to accepts lists in addition to dicts
Accounting for the use case where the list of objects is at the root level of the response and does not need a data_path

* Adding missing test case

Co-authored-by: Adrian Galvan <adrian@ethyca.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

run unsafe ci checks Triggers running of unsafe CI checks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[SaaS Connector] Pagination strategies

3 participants