synthesized3.synthesizer package#
- class synthesized3.synthesizer.TableSynthesizer#
Bases:
object
Synthesizer of tabular data
- __init__(meta_collection: MetaCollection, transformer_collection: TransformerCollection, inverse_transformer_collection: TransformerCollection, model_collection: ModelCollection)#
- classmethod from_data_interface(data_interface: DataInterface, meta_overrides: Mapping[str, Type[Meta]] | None = None, model_overrides: Mapping[Tuple[str] | str, Dict[str, Type[Model] | Dict[str, Any]]] | None = None) TableSynthesizer #
The primary method users will use to create TableSynthesizer objects
- Parameters:
data_interface – DataInterface object.
meta_overrides – Mapping of column names to Meta subclasses to override the default Meta for that column. Defaults to None.
model_overrides – List with mappings of column names to Model subclasses. Defaults to None. This should be a list of dictionaries where the dictionaries have the following structure:
Model overrides structure example:
[ {<column name or tuple(column names)>: { 'model_type': <model class name>, 'model_kwargs':<kwargs for the model> } } ]
- Returns:
TableSynthesizer object.
- Example:
>>> # Basic usage: >>> from synthesized3.utils.docs import ( ... get_example_spark_df, get_example_spark_session) >>> from synthesized3 import SparkDataInterface >>> # Get an example spark dataframe and spark session using util methods >>> spark = get_example_spark_session() >>> df = get_example_spark_df(spark) >>> data_interface = SparkDataInterface(df) >>> # Build synthesizer from the data interface >>> synth = TableSynthesizer.from_data_interface(data_interface) >>> synth.fit(df=df, epochs=1, steps_per_epoch=1) Training epoch ... >>> synth.sample(10, spark=spark) DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
>>> # With meta overrides: >>> from synthesized3.meta.metas import IntegerMeta >>> # 'age' column is a double but will be treated as integer >>> meta_overrides = {'age': IntegerMeta} >>> synth = TableSynthesizer.from_data_interface( ... data_interface, meta_overrides=meta_overrides ... ) >>> synth.fit(df=df, epochs=1, steps_per_epoch=1) Training epoch ... >>> synth.sample(10, spark=spark) DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
>>> # With model overrides: >>> model_overrides = { ... 'gender': { ... 'model_type': 'SamplingModel', ... } ... } >>> synth = TableSynthesizer.from_data_interface( ... data_interface, model_overrides=model_overrides ... ) >>> synth.fit(df=df, epochs=1, steps_per_epoch=1) Training epoch ... >>> synth.sample(10, spark=spark) DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
- classmethod from_meta_collection(meta_collection: MetaCollection, model_overrides: Mapping[Tuple[str] | str, Dict[str, Type[Model] | Dict[str, Any]]] | None = None)#
Method for creating TableSynthesizer objects from a MetaCollection
- Parameters:
meta_collection – A MetaCollection object.
model_overrides – A dictionary of Model overrides, optional. Defaults to None.
- Returns:
TableSynthesizer object.
- fit(df: DataFrame | DataFrame, batch_size: int = 1024, epochs: int = 400, steps_per_epoch: int = 50, num_workers: int = 1, callbacks: Callback | None = None, verbose: int = 1)#
Train the synthesizer. This will train the synthesizer pulling the data from the data_interface.
- Parameters:
batch_size – The batch size to use for training. Defaults to 1024.
epochs – The maximum number of epochs to run. Defaults to 400.
steps_per_epoch – The number of steps to run per epoch. Defaults to 50.
num_workers – The number of workers to use for distributed training. Defaults to 1.
callbacks – A list of callbacks to use for training. Defaults to None.
verbose – The verbosity level to use for training. Defaults to 1.
- sample(num_rows: int | None = None, seed: int | None = None, **kwargs)#
Synthesize a specific number of rows of dataframe
- Parameters:
num_rows – The number of rows to synthesize.
seed – The random seed to use for sampling. Defaults to None.
**kwargs – Additional keyword arguments to pass.
- Returns:
A dataframe (type defined by the data interface) of the synthesized data.
- classmethod from_df(df: DataFrame | DataFrame)#
The primary helper method users will use to create TableSynthesizer objects allows users to pass in a dataframe instead of a data_interface
Submodules#
synthesized3.synthesizer.table_synthesizer module#
- class synthesized3.synthesizer.table_synthesizer.TableSynthesizer#
Bases:
object
Synthesizer of tabular data
- __init__(meta_collection: MetaCollection, transformer_collection: TransformerCollection, inverse_transformer_collection: TransformerCollection, model_collection: ModelCollection)#
- classmethod from_data_interface(data_interface: DataInterface, meta_overrides: Mapping[str, Type[Meta]] | None = None, model_overrides: Mapping[Tuple[str] | str, Dict[str, Type[Model] | Dict[str, Any]]] | None = None) TableSynthesizer #
The primary method users will use to create TableSynthesizer objects
- Parameters:
data_interface – DataInterface object.
meta_overrides – Mapping of column names to Meta subclasses to override the default Meta for that column. Defaults to None.
model_overrides – List with mappings of column names to Model subclasses. Defaults to None. This should be a list of dictionaries where the dictionaries have the following structure:
Model overrides structure example:
[ {<column name or tuple(column names)>: { 'model_type': <model class name>, 'model_kwargs':<kwargs for the model> } } ]
- Returns:
TableSynthesizer object.
- Example:
>>> # Basic usage: >>> from synthesized3.utils.docs import ( ... get_example_spark_df, get_example_spark_session) >>> from synthesized3 import SparkDataInterface >>> # Get an example spark dataframe and spark session using util methods >>> spark = get_example_spark_session() >>> df = get_example_spark_df(spark) >>> data_interface = SparkDataInterface(df) >>> # Build synthesizer from the data interface >>> synth = TableSynthesizer.from_data_interface(data_interface) >>> synth.fit(df=df, epochs=1, steps_per_epoch=1) Training epoch ... >>> synth.sample(10, spark=spark) DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
>>> # With meta overrides: >>> from synthesized3.meta.metas import IntegerMeta >>> # 'age' column is a double but will be treated as integer >>> meta_overrides = {'age': IntegerMeta} >>> synth = TableSynthesizer.from_data_interface( ... data_interface, meta_overrides=meta_overrides ... ) >>> synth.fit(df=df, epochs=1, steps_per_epoch=1) Training epoch ... >>> synth.sample(10, spark=spark) DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
>>> # With model overrides: >>> model_overrides = { ... 'gender': { ... 'model_type': 'SamplingModel', ... } ... } >>> synth = TableSynthesizer.from_data_interface( ... data_interface, model_overrides=model_overrides ... ) >>> synth.fit(df=df, epochs=1, steps_per_epoch=1) Training epoch ... >>> synth.sample(10, spark=spark) DataFrame[x_nans: double, age: double, gender: string, income: double, DOB: string]
- classmethod from_meta_collection(meta_collection: MetaCollection, model_overrides: Mapping[Tuple[str] | str, Dict[str, Type[Model] | Dict[str, Any]]] | None = None)#
Method for creating TableSynthesizer objects from a MetaCollection
- Parameters:
meta_collection – A MetaCollection object.
model_overrides – A dictionary of Model overrides, optional. Defaults to None.
- Returns:
TableSynthesizer object.
- fit(df: DataFrame | DataFrame, batch_size: int = 1024, epochs: int = 400, steps_per_epoch: int = 50, num_workers: int = 1, callbacks: Callback | None = None, verbose: int = 1)#
Train the synthesizer. This will train the synthesizer pulling the data from the data_interface.
- Parameters:
batch_size – The batch size to use for training. Defaults to 1024.
epochs – The maximum number of epochs to run. Defaults to 400.
steps_per_epoch – The number of steps to run per epoch. Defaults to 50.
num_workers – The number of workers to use for distributed training. Defaults to 1.
callbacks – A list of callbacks to use for training. Defaults to None.
verbose – The verbosity level to use for training. Defaults to 1.
- sample(num_rows: int | None = None, seed: int | None = None, **kwargs)#
Synthesize a specific number of rows of dataframe
- Parameters:
num_rows – The number of rows to synthesize.
seed – The random seed to use for sampling. Defaults to None.
**kwargs – Additional keyword arguments to pass.
- Returns:
A dataframe (type defined by the data interface) of the synthesized data.
- classmethod from_df(df: DataFrame | DataFrame)#
The primary helper method users will use to create TableSynthesizer objects allows users to pass in a dataframe instead of a data_interface