synthesized3.data_interface package#
- class synthesized3.data_interface.DataInterface#
Bases:
ABC
Abstract data interface class.
Implementations provide a common interface for querying dtype metadata from various data sources.
For the entire dataset it is possible to query: - total number of rows - the column names - a tf_dataset yielding values from the datasource
For each column it is possible to query: - unique value count - NaN/missing value count - unique values - quantiles
- key: str#
- __init__(df=None)#
verify that the number of columns is allowed with the current license
- abstract property raw_dataframe#
- abstract property columns: Sequence[str]#
The column names of the connected dataset.
- abstract property num_rows: int#
The total number of rows of the connected dataset.
- get_tf_dataset(meta_collection: MetaCollection) DatasetV2 #
- assert_column_names_exist(column_names: Collection[str], optional_text: str | None = '')#
Will do an assertion that all names in the input collection exist in the data source
- get_safe_name(col: str, name_postfix: str) str #
Find and return a unique column name with a given postfix, not already in the dataset.
Tabular datasets have columns which are named. Given the product has to work generically with all possible column name inputs, when we want to add columns (e.g. augmented columns or columns storing information mid-processing) we need to be sure we add a unique column and do not accidentally override an existing column. In order to do this, we need to get a name for the column that does not exist in the dataset already - that is what this function does.
E.g. If a dataset has columns “col1” and “col1_nan” already, we cannot add a new column called “col1_nan” because we will override the original data in that column. Instead, this function would return something like “col1_nan_1” or similar, which is unique.
- Parameters:
col (str) – Name of the column to be created, without any postfixs
name_postfix (str) – Postfix for the column name (e.g. “nan”)
- Returns:
- Safe name containing the inputted column name and postfix, possibly with other
augmentations to the string to ensure it is unique for the dataset
- Return type:
str
- static impute_missing_values(values: List[Any], missing_value_reprs: List[Any], possible_missing_values: List[Any]) List[Any] #
- static decode_byte_strings(values: List[Any]) List[Any] #
- class synthesized3.data_interface.PandasDataInterface#
Bases:
DataInterface
Data interface for spark dataframes.
Allows the SDK to work with Pandas DataFrames.
Example
>>> from synthesized3.utils.docs import get_example_pandas_df >>> # Util method to get an example pandas DataFrame >>> df = get_example_pandas_df() >>> data_interface = PandasDataInterface(df) >>> data_interface.columns ['x_nans', 'age', 'gender', 'income', 'DOB']
- __init__(df: DataFrame)#
verify that the number of columns is allowed with the current license
- property raw_dataframe: DataFrame#
- property columns: Sequence[str]#
The column names of the connected dataset.
- property num_rows: int#
The total number of rows of the connected dataset.
- class synthesized3.data_interface.SparkDataInterface#
Bases:
DataInterface
Data interface for spark dataframes.
Allows the SDK to work with large datasets that do not fit entirely into memory.
Example
>>> from pyspark.sql import SparkSession >>> import synthesized_datasets >>> df = synthesized_datasets.REGRESSION.biased_data.load() >>> spark = SparkSession.builder.master("local[4]").appName("sdk-spark").getOrCreate() >>> df = spark.createDataFrame(df) >>> data = SparkDataInterface(df, buffer_size=10_000) >>> data.columns ['age', 'gender', 'income']
- __init__(df: DataFrame, buffer_size: int = 0, quantile_error: float = 0.1)#
Initialize a data interface using a spark dataframe.
- Parameters:
df (pyspark.sql.DataFrame) – The spark dataframe.
buffer_size (int) – Maximum number of rows to use as an in memory buffer. This is used to determine the number of partitions for the spark dataframe. A value of zero implies buffer_size = num_rows.
quantile_error (float) – The acceptable fractional error in the bounds of the calculated quantiles. This comes with a trade off between algorithm speed and accuracy.
- property raw_dataframe: DataFrame#
- property columns: List[str]#
The column names of the connected dataset.
- property num_rows: int#
The total number of rows of the connected dataset.
- class synthesized3.data_interface.DataInterfaceFactory#
Bases:
object
Factory class for creating DataInterface objects
- static get_data_interface_from_df(df: DataFrame) DataInterface #
Subpackages#
- synthesized3.data_interface.data_interfaces package
Submodules#
synthesized3.data_interface.data_interface module#
- synthesized3.data_interface.data_interface.pass_through(x)#
- class synthesized3.data_interface.data_interface.DataInterface#
Bases:
ABC
Abstract data interface class.
Implementations provide a common interface for querying dtype metadata from various data sources.
For the entire dataset it is possible to query: - total number of rows - the column names - a tf_dataset yielding values from the datasource
For each column it is possible to query: - unique value count - NaN/missing value count - unique values - quantiles
- key: str#
- __init__(df=None)#
verify that the number of columns is allowed with the current license
- abstract property raw_dataframe#
- abstract property columns: Sequence[str]#
The column names of the connected dataset.
- abstract property num_rows: int#
The total number of rows of the connected dataset.
- get_tf_dataset(meta_collection: MetaCollection) DatasetV2 #
- assert_column_names_exist(column_names: Collection[str], optional_text: str | None = '')#
Will do an assertion that all names in the input collection exist in the data source
- get_safe_name(col: str, name_postfix: str) str #
Find and return a unique column name with a given postfix, not already in the dataset.
Tabular datasets have columns which are named. Given the product has to work generically with all possible column name inputs, when we want to add columns (e.g. augmented columns or columns storing information mid-processing) we need to be sure we add a unique column and do not accidentally override an existing column. In order to do this, we need to get a name for the column that does not exist in the dataset already - that is what this function does.
E.g. If a dataset has columns “col1” and “col1_nan” already, we cannot add a new column called “col1_nan” because we will override the original data in that column. Instead, this function would return something like “col1_nan_1” or similar, which is unique.
- Parameters:
col (str) – Name of the column to be created, without any postfixs
name_postfix (str) – Postfix for the column name (e.g. “nan”)
- Returns:
- Safe name containing the inputted column name and postfix, possibly with other
augmentations to the string to ensure it is unique for the dataset
- Return type:
str
- static impute_missing_values(values: List[Any], missing_value_reprs: List[Any], possible_missing_values: List[Any]) List[Any] #
- static decode_byte_strings(values: List[Any]) List[Any] #
synthesized3.data_interface.data_interface_factory module#
- class synthesized3.data_interface.data_interface_factory.DataInterfaceFactory#
Bases:
object
Factory class for creating DataInterface objects
- static get_data_interface_from_df(df: DataFrame) DataInterface #