synthesized3.data_interface package#

class synthesized3.data_interface.DataInterface#

Bases: ABC

Abstract data interface class.

Implementations provide a common interface for querying dtype metadata from various data sources.

For the entire dataset it is possible to query: - total number of rows - the column names - a tf_dataset yielding values from the datasource

For each column it is possible to query: - unique value count - NaN/missing value count - unique values - quantiles

key: str#
__init__(df=None)#

verify that the number of columns is allowed with the current license

abstract property raw_dataframe#
abstract property columns: Sequence[str]#

The column names of the connected dataset.

abstract property num_rows: int#

The total number of rows of the connected dataset.

get_tf_dataset(meta_collection: MetaCollection) DatasetV2#
assert_column_names_exist(column_names: Collection[str], optional_text: str | None = '')#

Will do an assertion that all names in the input collection exist in the data source

get_safe_name(col: str, name_postfix: str) str#

Find and return a unique column name with a given postfix, not already in the dataset.

Tabular datasets have columns which are named. Given the product has to work generically with all possible column name inputs, when we want to add columns (e.g. augmented columns or columns storing information mid-processing) we need to be sure we add a unique column and do not accidentally override an existing column. In order to do this, we need to get a name for the column that does not exist in the dataset already - that is what this function does.

E.g. If a dataset has columns “col1” and “col1_nan” already, we cannot add a new column called “col1_nan” because we will override the original data in that column. Instead, this function would return something like “col1_nan_1” or similar, which is unique.

Parameters:
  • col (str) – Name of the column to be created, without any postfixs

  • name_postfix (str) – Postfix for the column name (e.g. “nan”)

Returns:

Safe name containing the inputted column name and postfix, possibly with other

augmentations to the string to ensure it is unique for the dataset

Return type:

str

static impute_missing_values(values: List[Any], missing_value_reprs: List[Any], possible_missing_values: List[Any]) List[Any]#
static decode_byte_strings(values: List[Any]) List[Any]#
class synthesized3.data_interface.PandasDataInterface#

Bases: DataInterface

Data interface for spark dataframes.

Allows the SDK to work with Pandas DataFrames.

Example

>>> from synthesized3.utils.docs import get_example_pandas_df
>>> # Util method to get an example pandas DataFrame
>>> df = get_example_pandas_df()
>>> data_interface = PandasDataInterface(df)
>>> data_interface.columns
['x_nans', 'age', 'gender', 'income', 'DOB']
__init__(df: DataFrame)#

verify that the number of columns is allowed with the current license

property raw_dataframe: DataFrame#
property columns: Sequence[str]#

The column names of the connected dataset.

property num_rows: int#

The total number of rows of the connected dataset.

class synthesized3.data_interface.SparkDataInterface#

Bases: DataInterface

Data interface for spark dataframes.

Allows the SDK to work with large datasets that do not fit entirely into memory.

Example

>>> from pyspark.sql import SparkSession
>>> import synthesized_datasets
>>> df = synthesized_datasets.REGRESSION.biased_data.load()
>>> spark = SparkSession.builder.master("local[4]").appName("sdk-spark").getOrCreate()
>>> df = spark.createDataFrame(df)
>>> data = SparkDataInterface(df, buffer_size=10_000)
>>> data.columns
['age', 'gender', 'income']
__init__(df: DataFrame, buffer_size: int = 0, quantile_error: float = 0.1)#

Initialize a data interface using a spark dataframe.

Parameters:
  • df (pyspark.sql.DataFrame) – The spark dataframe.

  • buffer_size (int) – Maximum number of rows to use as an in memory buffer. This is used to determine the number of partitions for the spark dataframe. A value of zero implies buffer_size = num_rows.

  • quantile_error (float) – The acceptable fractional error in the bounds of the calculated quantiles. This comes with a trade off between algorithm speed and accuracy.

property raw_dataframe: DataFrame#
property columns: List[str]#

The column names of the connected dataset.

property num_rows: int#

The total number of rows of the connected dataset.

class synthesized3.data_interface.DataInterfaceFactory#

Bases: object

Factory class for creating DataInterface objects

static get_data_interface_from_df(df: DataFrame) DataInterface#

Subpackages#

Submodules#

synthesized3.data_interface.data_interface module#

synthesized3.data_interface.data_interface.pass_through(x)#
class synthesized3.data_interface.data_interface.DataInterface#

Bases: ABC

Abstract data interface class.

Implementations provide a common interface for querying dtype metadata from various data sources.

For the entire dataset it is possible to query: - total number of rows - the column names - a tf_dataset yielding values from the datasource

For each column it is possible to query: - unique value count - NaN/missing value count - unique values - quantiles

key: str#
__init__(df=None)#

verify that the number of columns is allowed with the current license

abstract property raw_dataframe#
abstract property columns: Sequence[str]#

The column names of the connected dataset.

abstract property num_rows: int#

The total number of rows of the connected dataset.

get_tf_dataset(meta_collection: MetaCollection) DatasetV2#
assert_column_names_exist(column_names: Collection[str], optional_text: str | None = '')#

Will do an assertion that all names in the input collection exist in the data source

get_safe_name(col: str, name_postfix: str) str#

Find and return a unique column name with a given postfix, not already in the dataset.

Tabular datasets have columns which are named. Given the product has to work generically with all possible column name inputs, when we want to add columns (e.g. augmented columns or columns storing information mid-processing) we need to be sure we add a unique column and do not accidentally override an existing column. In order to do this, we need to get a name for the column that does not exist in the dataset already - that is what this function does.

E.g. If a dataset has columns “col1” and “col1_nan” already, we cannot add a new column called “col1_nan” because we will override the original data in that column. Instead, this function would return something like “col1_nan_1” or similar, which is unique.

Parameters:
  • col (str) – Name of the column to be created, without any postfixs

  • name_postfix (str) – Postfix for the column name (e.g. “nan”)

Returns:

Safe name containing the inputted column name and postfix, possibly with other

augmentations to the string to ensure it is unique for the dataset

Return type:

str

static impute_missing_values(values: List[Any], missing_value_reprs: List[Any], possible_missing_values: List[Any]) List[Any]#
static decode_byte_strings(values: List[Any]) List[Any]#

synthesized3.data_interface.data_interface_factory module#

class synthesized3.data_interface.data_interface_factory.DataInterfaceFactory#

Bases: object

Factory class for creating DataInterface objects

static get_data_interface_from_df(df: DataFrame) DataInterface#

synthesized3.data_interface.data_interface_factory_test module#

synthesized3.data_interface.data_interface_test module#