-
Notifications
You must be signed in to change notification settings - Fork 0
Extraction function
To provide extraction capabilities user has to implement extraction function. This function is responsible for transforming raw data set (probably in .csv format) into Caddo campatible Pandas Dataframe.
Name of file is not important, as it could be set via configuration file. Caddo Data Factory will look-up this file to find single function named extract. Inside of this file could be more functions, however only one named extract.
As mentioned extraction function should be named extract. This function should accept one argument, which is Pandas Dataframe, containing raw data already read from source file. Also this function should return one object, namely Pandas Dataframe. This object will be serialized into csv file and saved into .caddo file, so keep in mind to choose separators in output file wisely.
The result .csv file has one rule: all x columns should be prefixed with x__ and y columns with y__. So the user should take care to about this during extraction. For example if in origin dataset column x has name contents then after extraction it should have name x__contents. It is important, because the Benchmark Tool recognize this way if current column is x or y
Here we have an example of extraction function on dataset which contaqins one x column contents and one y column class_value. In process x column is splited into 9 new columns (each one check if given keyword is part of x). Then we create columns which contains all x columns with x__ prefix (which is important and cannot be skiped). Then for each row in x column we give an id, beacuse the id will describe each row during process of training and testing. Also cannot be skiped. At the end we create the dataframe with columns (our new x values), y__class_value and idx.
import numpy as np
import pandas as pd
def extract(dataset):
keywords = ['"', "sql", "statement", "select", "insert", "delete", "update", "drop", "execute"]
columns = []
for keyword in keywords:
columns.append("x__" + keyword)
result = []
indexes = []
index = 0
for x in dataset["contents"].astype('str'):
result.append(np.array([1 if keyword in x else 0 for keyword in keywords]))
indexes.append(index)
index += 1
data_frame = pd.DataFrame(data=result, columns=columns)
data_frame["y__class_value"] = dataset["class_value"]
data_frame["idx"] = indexes
return data_frameclick here to get function for second example which shows us a multiple y dataset. In this case the only thing done by extraction function is to rewrite all x and y with prefixes, because even if in this step we don't need to extract anything - still we need to have prefixes.