Skip to content

Conversation

@SteveDMurphy
Copy link
Contributor

@SteveDMurphy SteveDMurphy commented Dec 13, 2022

Closes #10

  • Include boto3
  • Instantiate boto3 session
  • Replace db writing with dumping to s3
  • Replace test location with proper config
  • Update deployment and add testing (as required)

To test:
1.modify the endpoint of the SDK to use localhost instead

  1. install locally in a virtual env
  2. write some fake events (example below)
from datetime import datetime, timezone

from fideslog.sdk.python.client import AnalyticsClient
from fideslog.sdk.python.event import AnalyticsEvent
from fideslog.sdk.python.exceptions import AnalyticsError

try:
    client_id = "internal"
    os = "darwin"
    product_name = "fideslog"
    production_version = "1.1.1"
    analytics_client = AnalyticsClient(client_id, os, product_name, production_version)
    event = "test_event"
    fideslog_event = AnalyticsEvent(
        event="test_event",
        event_created_at=datetime.now(timezone.utc),
        flags=["--dry", "--local"],
        extra_data={"some_key": "some_value"},
    )
    analytics_client.send(fideslog_event)
except AnalyticsError:
    pass

You should then be able to see the events you are creating listed in the s3 bucket

This change replaces the usage of the snowflake db for event data by writing it to an s3 bucket. The files have a random uuid1 for the filename as well as being organized into directories formatted by UTC date
@SteveDMurphy SteveDMurphy self-assigned this Dec 13, 2022
@SteveDMurphy
Copy link
Contributor Author

Heyo @RobertKeyser ! Wanted to get this in front of you a little early (just doing some further validation) so you could flag anything that looks a little off as far as the deployment goes. Let me know if you need any info or have any questions - thanks!

@SteveDMurphy
Copy link
Contributor Author

@seanpreston I'll need some help in here writing some secrets to the repo, I think you might have the power 🪄

@RobertKeyser I think the S3 env vars are what I am most concerned about as far as naming etc. for the deployment to work 🤞🏽

@SteveDMurphy SteveDMurphy marked this pull request as ready for review December 21, 2022 05:23
@ThomasLaPiana
Copy link
Contributor

@SteveDMurphy I understand the changes here are to write out to S3 instead of Snowflake directly, but where is the updated code that describes when/how Snowflake is pulling this data in?

@SteveDMurphy
Copy link
Contributor Author

SteveDMurphy commented Dec 21, 2022

@SteveDMurphy I understand the changes here are to write out to S3 instead of Snowflake directly, but where is the updated code that describes when/how Snowflake is pulling this data in?

Planning on doing that as part of ethyca-analytics - was also briefly thinking about using Snowpipe until remembering about the analytics setup.

Initial focus was on stopping the bleeding of a constantly running Snowflake warehouse tho

@ThomasLaPiana
Copy link
Contributor

@SteveDMurphy I understand the changes here are to write out to S3 instead of Snowflake directly, but where is the updated code that describes when/how Snowflake is pulling this data in?

Planning on doing that as part of ethyca-analytics - was also briefly thinking about using Snowpipe until remembering about the analytics setup.

Initial focus was on stopping the bleeding of a constantly running Snowflake warehouse tho

good point, the event itself will contain the metadata we need to sort by time so it won't matter if it sits around for a bit.

@SteveDMurphy
Copy link
Contributor Author

@SteveDMurphy I understand the changes here are to write out to S3 instead of Snowflake directly, but where is the updated code that describes when/how Snowflake is pulling this data in?

Planning on doing that as part of ethyca-analytics - was also briefly thinking about using Snowpipe until remembering about the analytics setup.
Initial focus was on stopping the bleeding of a constantly running Snowflake warehouse tho

good point, the event itself will contain the metadata we need to sort by time so it won't matter if it sits around for a bit.

That's the plan ! I'm also planning on having the events in a subdirectory by utc date so we can load once per day and delete the directory 🤞🏽 it sounds good in my head at least 😅

@ThomasLaPiana
Copy link
Contributor

@SteveDMurphy I understand the changes here are to write out to S3 instead of Snowflake directly, but where is the updated code that describes when/how Snowflake is pulling this data in?

Planning on doing that as part of ethyca-analytics - was also briefly thinking about using Snowpipe until remembering about the analytics setup.
Initial focus was on stopping the bleeding of a constantly running Snowflake warehouse tho

good point, the event itself will contain the metadata we need to sort by time so it won't matter if it sits around for a bit.

That's the plan ! I'm also planning on having the events in a subdirectory by utc date so we can load once per day and delete the directory 🤞🏽 it sounds good in my head at least 😅

Makes sense to me! Although S3 is cheap enough that I don't think we'd need to worry about that for a long time :)

@ThomasLaPiana
Copy link
Contributor

@SteveDMurphy im running the fides release in a bit but will circle back and fully review this later today :)

@SteveDMurphy
Copy link
Contributor Author

@ThomasLaPiana I'm planning on getting back into this with your first pass review comments today, thanks! 💥

@RobertKeyser I think we might still need the prod s3 info for deployment but would love any help reviewing the deployment changes as well 🙌🏽

Ops ticket for reference -> https://ethyca.atlassian.net/browse/OPS-208

ThomasLaPiana
ThomasLaPiana previously approved these changes Jan 10, 2023
Copy link
Contributor

@ThomasLaPiana ThomasLaPiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

@RobertKeyser RobertKeyser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we’ll need to change a few things:
remove the FIDESLOG__STORAGE_AWS_SECRET_ACCESS_KEY and FIDESLOG__STORAGE_AWS_ACCESS_KEY_ID from being set.
add a taskrolearn that’s the same as the executionrolearn
The credentials should get automatically set: https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-access-aws-services/
More specifically, https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html

I think what that means is that you need to make the request (the curl in the docs) to get the AWS creds on the fly since it uses a role instead of a full user.

@SteveDMurphy SteveDMurphy merged commit 591f3db into main Jan 23, 2023
@SteveDMurphy SteveDMurphy deleted the SteveDMurphy-10-write-to-s3 branch January 23, 2023 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write event data to AWS S3

5 participants