Subset Demo Generator

This package is to facilitate high quality synthetic data ETLs. It works by declaratively structuring dataclasses, which then can be serialized to a csv, pushed to GCS and pushed into multiple databases. It offers methods to generate unique values, run over run, as well as generation based on poisson and other methods.

To create a new table:

@dataclass
class Person(metaclass=Table):
    # will result in a file named person.csv and table named person
    # dataclass fields will become csv columns and table columns with the correct types
    # this column will increment smoothly run over run
    id:int = field(default_factory=itertools.count(1).__next__) 
    #if more logic is needed than a simple default factory func, handle it in the __post_init__ function
    gender: str = field(init=False)
    # if you have a parent object, you can pass it in, for conditionality
    parent_object: InitVar[type] = None

    def __post_init__(self, parent_object:type):
        self.gender = random.choice(['M','F'])
        if self.gender == 'M':
            ...
        else:
            ...
        
        if parent_object.foo > 10:
            #conditional logic based off parent
            self.gender == 'N/A'

#smooth incrementing ids even for adding

    id:int = field(default_factory=itertools.count(1).__next__)

Snowflake Staging Setup

First go into snowflake, in the correct destination schema and create the stage from GCS

create stage my_gcs_stage
  url = 'gcs://name_of_gcs_bucket'
  storage_integration = gcp_int;

Snowflake generates a GCP service account which will need read privledges on the bucket. You can see the service account by running the following command:

DESC STORAGE INTEGRATION GCP_INT;

the service account to enable with GCP storage reader privledges will be in the STORAGE_GCP_SERVICE_ACCOUNT property.

The name of this stage will need to be placed in your environment variables .env file too

SF_STAGE_NAME=my_gcs_stage

Licensing Considerations

considered the licensable data from here: https://www.themoviedb.org/

am using rapid api instead, gives clips from imdb videos

Also use only the population column from https://simplemaps.com/data/us-zips to accurately generate us adresses based on population

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
lib		lib
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
sfdc.py		sfdc.py
vidly_core.py		vidly_core.py
vidly_plan.txt		vidly_plan.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Subset Demo Generator

To create a new table:

Snowflake Staging Setup

Licensing Considerations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

exploreomni/data_generator

Folders and files

Latest commit

History

Repository files navigation

Subset Demo Generator

To create a new table:

Snowflake Staging Setup

Licensing Considerations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages