Environment Details
Please indicate the following details about the environment in which you found the bug:
SDV: 1.19.0
python: 3.9.22
'linux-x86_64' for WSL2 Ubuntu 24.04.2
Error Description
When trying to create a sample from a categorical variable using GaussianCopulaSynthesizer, I notice extremely unlikely outcomes. See example below.
Steps to reproduce
from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer
from IPython.display import display
import numpy as np
import pandas as pd
# size of sample
nsamp = 10_000
# since SDV-API does not expose its RNG I can only fix the input sample
rng = np.random.default_rng(seed=4711)
# create categorical
df = pd.DataFrame(pd.Categorical([0, 1, 2, 3]))
# create sample
smp_orig = df.sample(n=nsamp, weights=[0.25, 0.25, 0.25, 0.25],
replace=True, random_state=rng, ignore_index=True)
# single "category" variable
display(smp_orig.dtypes)
# value counts look credible ~2500 each
display(smp_orig.value_counts())
# create synthetic sample
metadata = Metadata.detect_from_dataframe(
data=smp_orig,
table_name="smp_orig")
# single categorical variable detected => looks OK
print(metadata)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(smp_orig)
smp_synth = synthesizer.sample(num_rows=nsamp)
# looks formally OK on first sight
display(smp_synth.head())
# but the value counts are off the charts e.g 1:5691, 3: 3725, 2:584, 0: 0
display(smp_synth.value_counts())
# version information
from sdv import version
print(version.public) # SDV: 1.19.0
import sys
print(sys.version) # python: 3.9.22
import sysconfig
sysconfig.get_platform() # 'linux-x86_64' for WSL2 Ubuntu Ubuntu 24.04.2
Environment Details
Please indicate the following details about the environment in which you found the bug:
SDV: 1.19.0
python: 3.9.22
'linux-x86_64' for WSL2 Ubuntu 24.04.2
Error Description
When trying to create a sample from a categorical variable using
GaussianCopulaSynthesizer, I notice extremely unlikely outcomes. See example below.Steps to reproduce