Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improvements to the doppelganger model #302

Merged
merged 1 commit into from
Sep 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33,601 changes: 33,601 additions & 0 deletions data/fcc_mba.csv

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/examples/doppelganger_example.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ DoppelGANger is a model that uses a Generative Adversarial Network (GAN) framewo

- 📑 **Paper:** [Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions](https://dl.acm.org/doi/pdf/10.1145/3419394.3423643)

Here’s an example of how to synthetize time-series data with DoppelGANger using the [Yahoo Stock Price](https://www.kaggle.com/datasets/arashnic/time-series-forecasting-with-yahoo-stock-price) dataset:
Here’s an example of how to synthetize time-series data with DoppelGANger using the [Measuring Broadband America](https://www.fcc.gov/reports-research/reports/measuring-broadband-america/raw-data-measuring-broadband-america-seventh) dataset:


```python
--8<-- "examples/timeseries/stock_doppelganger.py"
--8<-- "examples/timeseries/mba_doppelganger.py"
```


Expand Down
35 changes: 21 additions & 14 deletions docs/getting-started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,23 +35,30 @@ The following example showcases how to synthesize the [Yahoo Stock Price](https:
```python
# Import the necessary modules
import pandas as pd
from ydata_synthetic.synthesizers import ModelParameters
from ydata_synthetic.synthesizers.timeseries import TimeGAN
from ydata_synthetic.preprocessing.timeseries.utils import real_data_loading
from ydata_synthetic.synthesizers.timeseries import TimeSeriesSynthesizer
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters

# Load and preprocess data
stock_data_df = pd.read_csv("stock_data.csv")
processed_data = real_data_loading(stock_data_df.values, seq_len=24)

# Define model and training parameters
gan_args = ModelParameters(batch_size=128, lr=5e-4, noise_dim=128, layers_dim=128)
synth = TimeGAN(model_parameters=gan_args, hidden_dim=24, seq_len=24, n_seq=6, gamma=1)
# Define model parameters
gan_args = ModelParameters(batch_size=128,
lr=5e-4,
noise_dim=32,
layers_dim=128,
latent_dim=24,
gamma=1)

# Train the generator model
synth.train(data=processed_data, train_steps=50000)
train_args = TrainParameters(epochs=50000,
sequence_length=24,
number_sequences=6)

# Read the data
stock_data = pd.read_csv("stock_data.csv")

# Training the TimeGAN synthesizer
synth = TimeSeriesSynthesizer(modelname='timegan', model_parameters=gan_args)
synth.fit(stock_data, train_args, num_cols=list(stock_data.columns))

# Generate new synthetic data
synth_data = synth.sample(len(stock_data_df))
# Generating new synthetic samples
synth_data = synth.sample(n_samples=500)
```

## Running the Streamlit App
Expand Down
63 changes: 63 additions & 0 deletions examples/timeseries/mba_doppelganger.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""
DoppelGANger architecture example file
"""

# Importing necessary libraries
import pandas as pd
from os import path
import matplotlib.pyplot as plt
from ydata_synthetic.synthesizers.timeseries import TimeSeriesSynthesizer
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters

# Read the data
mba_data = pd.read_csv("../../data/fcc_mba.csv")
numerical_cols = ["traffic_byte_counter", "ping_loss_rate"]
categorical_cols = [col for col in mba_data.columns if col not in numerical_cols]

# Define model parameters
model_args = ModelParameters(batch_size=100,
lr=0.001,
betas=(0.2, 0.9),
latent_dim=20,
gp_lambda=2,
pac=1)

train_args = TrainParameters(epochs=400, sequence_length=56,
sample_length=8, rounds=1,
measurement_cols=["traffic_byte_counter", "ping_loss_rate"])

# Training the DoppelGANger synthesizer
if path.exists('doppelganger_mba'):
model_dop_gan = TimeSeriesSynthesizer.load('doppelganger_mba')
else:
model_dop_gan = TimeSeriesSynthesizer(modelname='doppelganger', model_parameters=model_args)
model_dop_gan.fit(mba_data, train_args, num_cols=numerical_cols, cat_cols=categorical_cols)
model_dop_gan.save('doppelganger_mba')

# Generate synthetic data
synth_data = model_dop_gan.sample(n_samples=600)
synth_df = pd.concat(synth_data, axis=0)

# Create a plot for each measurement column
plt.figure(figsize=(10, 6))

plt.subplot(2, 1, 1)
plt.plot(mba_data['traffic_byte_counter'].reset_index(drop=True), label='Real Traffic')
plt.plot(synth_df['traffic_byte_counter'].reset_index(drop=True), label='Synthetic Traffic', alpha=0.7)
plt.xlabel('Index')
plt.ylabel('Value')
plt.title('Traffic Comparison')
plt.legend()
plt.grid(True)

plt.subplot(2, 1, 2)
plt.plot(mba_data['ping_loss_rate'].reset_index(drop=True), label='Real Ping')
plt.plot(synth_df['ping_loss_rate'].reset_index(drop=True), label='Synthetic Ping', alpha=0.7)
plt.xlabel('Index')
plt.ylabel('Value')
plt.title('Ping Comparison')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()
35 changes: 0 additions & 35 deletions examples/timeseries/stock_doppelganger.py

This file was deleted.

Loading