Skip to content

feat: Baseline Data Cleaning #840

@zogomii

Description

@zogomii

Is your feature request related to a problem?

Subtask of #710

Desired solution

Create method Baseline._clean(table: Table, target_column: str)->TabularDataset for baseline data cleaning

  1. Remove columns with high idness or stability (either above 90%), excluding the target column
  2. Remove columns with high missing value ratio (above 60%)
  3. Impute all remaining columns with missing values using highest (absolute) correlating column
  4. One hot encode all non-numerical columns with less than 20 different values, remove all other non-numerical columns
  5. Remove outliers
  6. Normalise columns with values greater than 100

Possible alternatives (optional)

No response

Screenshots (optional)

No response

Additional Context (optional)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    wontfixThis will not be worked on

    Type

    No type

    Projects

    Status

    ✔️ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions