synthesized3.synthesizer package#

class synthesized3.synthesizer.TableSynthesizer#

Bases: object

Synthesizer of tabular data

__init__(meta_collection: MetaCollection, transformer_collection: TransformerCollection, inverse_transformer_collection: TransformerCollection, model_collection: ModelCollection)#
classmethod from_data_interface(data_interface: DataInterface, meta_overrides: Mapping[str, Type[Meta]] | None = None, model_overrides: Mapping[Tuple[str] | str, Dict[str, Type[Model] | Dict[str, Any]]] | None = None) TableSynthesizer#

The primary method users will use to create TableSynthesizer objects

Parameters:
  • data_interface – DataInterface object.

  • meta_overrides – Mapping of column names to Meta subclasses to override the default Meta for that column. Defaults to None.

  • model_overrides – List with mappings of column names to Model subclasses. Defaults to None. This should be a list of dictionaries where the dictionaries have the following structure:

Model overrides structure example:

[
    {<column name or tuple(column names)>:
        {
            'model_type': <model class name>,
            'model_kwargs':<kwargs for the model>
        }
    }
]
Returns:

TableSynthesizer object.

Example:
>>> # Basic usage:
>>> from synthesized3.utils.docs import (
...    get_example_spark_df, get_example_spark_session)
>>> from synthesized3 import SparkDataInterface
>>> # Get an example spark dataframe and spark session using util methods
>>> spark = get_example_spark_session()
>>> df = get_example_spark_df(spark)
>>> data_interface = SparkDataInterface(df)
>>> # Build synthesizer from the data interface
>>> synth = TableSynthesizer.from_data_interface(data_interface)
>>> synth.fit(df=df, epochs=1, steps_per_epoch=1) 
Training epoch ...
>>> synth.sample(10, spark=spark)
DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
>>> # With meta overrides:
>>> from synthesized3.meta.metas import IntegerMeta
>>> # 'age' column is a double but will be treated as integer
>>> meta_overrides = {'age': IntegerMeta}
>>> synth = TableSynthesizer.from_data_interface(
...     data_interface, meta_overrides=meta_overrides
... )
>>> synth.fit(df=df, epochs=1, steps_per_epoch=1) 
Training epoch ...
>>> synth.sample(10, spark=spark)
DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
>>> # With model overrides:
>>> model_overrides = {
...     'gender': {
...         'model_type': 'SamplingModel',
...     }
... }
>>> synth = TableSynthesizer.from_data_interface(
...     data_interface, model_overrides=model_overrides
... )
>>> synth.fit(df=df, epochs=1, steps_per_epoch=1) 
Training epoch ...
>>> synth.sample(10, spark=spark)
DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
classmethod from_meta_collection(meta_collection: MetaCollection, model_overrides: Mapping[Tuple[str] | str, Dict[str, Type[Model] | Dict[str, Any]]] | None = None)#

Method for creating TableSynthesizer objects from a MetaCollection

Parameters:
  • meta_collection – A MetaCollection object.

  • model_overrides – A dictionary of Model overrides, optional. Defaults to None.

Returns:

TableSynthesizer object.

fit(df: DataFrame | DataFrame, batch_size: int = 1024, epochs: int = 400, steps_per_epoch: int = 50, num_workers: int = 1, callbacks: Callback | None = None, verbose: int = 1)#

Train the synthesizer. This will train the synthesizer pulling the data from the data_interface.

Parameters:
  • batch_size – The batch size to use for training. Defaults to 1024.

  • epochs – The maximum number of epochs to run. Defaults to 400.

  • steps_per_epoch – The number of steps to run per epoch. Defaults to 50.

  • num_workers – The number of workers to use for distributed training. Defaults to 1.

  • callbacks – A list of callbacks to use for training. Defaults to None.

  • verbose – The verbosity level to use for training. Defaults to 1.

sample(num_rows: int | None = None, seed: int | None = None, **kwargs)#

Synthesize a specific number of rows of dataframe

Parameters:
  • num_rows – The number of rows to synthesize.

  • seed – The random seed to use for sampling. Defaults to None.

  • **kwargs – Additional keyword arguments to pass.

Returns:

A dataframe (type defined by the data interface) of the synthesized data.

classmethod from_df(df: DataFrame | DataFrame)#

The primary helper method users will use to create TableSynthesizer objects allows users to pass in a dataframe instead of a data_interface

Submodules#

synthesized3.synthesizer.table_synthesizer module#

class synthesized3.synthesizer.table_synthesizer.TableSynthesizer#

Bases: object

Synthesizer of tabular data

__init__(meta_collection: MetaCollection, transformer_collection: TransformerCollection, inverse_transformer_collection: TransformerCollection, model_collection: ModelCollection)#
classmethod from_data_interface(data_interface: DataInterface, meta_overrides: Mapping[str, Type[Meta]] | None = None, model_overrides: Mapping[Tuple[str] | str, Dict[str, Type[Model] | Dict[str, Any]]] | None = None) TableSynthesizer#

The primary method users will use to create TableSynthesizer objects

Parameters:
  • data_interface – DataInterface object.

  • meta_overrides – Mapping of column names to Meta subclasses to override the default Meta for that column. Defaults to None.

  • model_overrides – List with mappings of column names to Model subclasses. Defaults to None. This should be a list of dictionaries where the dictionaries have the following structure:

Model overrides structure example:

[
    {<column name or tuple(column names)>:
        {
            'model_type': <model class name>,
            'model_kwargs':<kwargs for the model>
        }
    }
]
Returns:

TableSynthesizer object.

Example:
>>> # Basic usage:
>>> from synthesized3.utils.docs import (
...    get_example_spark_df, get_example_spark_session)
>>> from synthesized3 import SparkDataInterface
>>> # Get an example spark dataframe and spark session using util methods
>>> spark = get_example_spark_session()
>>> df = get_example_spark_df(spark)
>>> data_interface = SparkDataInterface(df)
>>> # Build synthesizer from the data interface
>>> synth = TableSynthesizer.from_data_interface(data_interface)
>>> synth.fit(df=df, epochs=1, steps_per_epoch=1) 
Training epoch ...
>>> synth.sample(10, spark=spark)
DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
>>> # With meta overrides:
>>> from synthesized3.meta.metas import IntegerMeta
>>> # 'age' column is a double but will be treated as integer
>>> meta_overrides = {'age': IntegerMeta}
>>> synth = TableSynthesizer.from_data_interface(
...     data_interface, meta_overrides=meta_overrides
... )
>>> synth.fit(df=df, epochs=1, steps_per_epoch=1) 
Training epoch ...
>>> synth.sample(10, spark=spark)
DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
>>> # With model overrides:
>>> model_overrides = {
...     'gender': {
...         'model_type': 'SamplingModel',
...     }
... }
>>> synth = TableSynthesizer.from_data_interface(
...     data_interface, model_overrides=model_overrides
... )
>>> synth.fit(df=df, epochs=1, steps_per_epoch=1) 
Training epoch ...
>>> synth.sample(10, spark=spark)
DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
classmethod from_meta_collection(meta_collection: MetaCollection, model_overrides: Mapping[Tuple[str] | str, Dict[str, Type[Model] | Dict[str, Any]]] | None = None)#

Method for creating TableSynthesizer objects from a MetaCollection

Parameters:
  • meta_collection – A MetaCollection object.

  • model_overrides – A dictionary of Model overrides, optional. Defaults to None.

Returns:

TableSynthesizer object.

fit(df: DataFrame | DataFrame, batch_size: int = 1024, epochs: int = 400, steps_per_epoch: int = 50, num_workers: int = 1, callbacks: Callback | None = None, verbose: int = 1)#

Train the synthesizer. This will train the synthesizer pulling the data from the data_interface.

Parameters:
  • batch_size – The batch size to use for training. Defaults to 1024.

  • epochs – The maximum number of epochs to run. Defaults to 400.

  • steps_per_epoch – The number of steps to run per epoch. Defaults to 50.

  • num_workers – The number of workers to use for distributed training. Defaults to 1.

  • callbacks – A list of callbacks to use for training. Defaults to None.

  • verbose – The verbosity level to use for training. Defaults to 1.

sample(num_rows: int | None = None, seed: int | None = None, **kwargs)#

Synthesize a specific number of rows of dataframe

Parameters:
  • num_rows – The number of rows to synthesize.

  • seed – The random seed to use for sampling. Defaults to None.

  • **kwargs – Additional keyword arguments to pass.

Returns:

A dataframe (type defined by the data interface) of the synthesized data.

classmethod from_df(df: DataFrame | DataFrame)#

The primary helper method users will use to create TableSynthesizer objects allows users to pass in a dataframe instead of a data_interface

synthesized3.synthesizer.table_synthesizer_test module#