A highly customizable event data generator, created by the team at Imply.
Run the generator.py script from the command line to create synthetic data in JSON format.
python generator.py \
-c <generator specification file> \
-t <target specification file> \
-f <pattern specification file> \
-s <start timestamp> \
-m <generator workers limit> \
-n <record limit> \
-r <duration limit in ISO8610 format>| Argument | Description |
|---|---|
-c |
The name of the file in the config_file folder containing the generator specification. |
-t |
The name of the file that contains the target definition. This over-rides any target specified in the generator specification. If neither is provided, stdout will be used. |
-f |
A file that contains a pattern that can be used to format the output records. If not specified, JSON is used. |
-s |
Use a simulated clock starting at the specified ISO time, rather than using the system clock. This will cause records to be produced instantaneously (batch) rather than with a real clock (real-time). |
-m |
The maximum number of workers to create. Defaults to 100. |
-n |
The number of records to generate. Must not be used in combination with -r. |
-r |
The length of time to create records for, expressed in ISO8601 format. Must not be used in combination with -n. |
The data generator requires Python 3.
apt-get install python3
apt-get update
apt-get install -y python3-pipInstall dependencies using the requirements.txt file:
pip install -r requirements.txtRun the following example to test the generator script:
python3 generator.py -c conf/gen/apache_access_combined.json -m 1 -n 10 -t conf/tar/stdout.jsonThis command generates logs in the format of Apache access combined logs.
It uses a single worker to generate 10 records, and it outputs the results to the standard output stream, such as the terminal window.
For more examples and test cases, see test.sh.
For additional configurations, see the following directories:
./conf/gen: Type of the generated data, such as Apache logs./conf/tar: Format for target output, such asfileorstdout./conf/form: Format of the generated data, such as TSV
The generator specification is a JSON document that sets how the data generator will execute. When the -f option is used, the generation specification will be read from a file, otherwise the generator specification will be read from stdin.
A generator specification follows this structure:
{
"states": [ ... ],
"emitters": [ ... ],
"target": { },
"interarrival": { }
}The sections of the JSON document concern what each data generator worker will do.
- A list of
statesthat a worker can transition through. - A list of
emitters, listing the dimensions that will be output by a worker and what data they will contain. - A
targetdefinition (optional), stating where records should be written. When not provided inside a generator specification, a separate JSON file can be specified using the-oargument. This allows for the same generator to be used with different targets. - The
interarrivaltime, controlling how often a new worker is spawned. The default maximum number of workers is 100, unless the-margument is used.
Set the output of the data generator by setting the target object.
Use the -o option to designate a target definition file name. The target defines where the generated messages are sent.
A text file with key names in braces ({{ and }}) where emitter dimensions will be inserted.
This allows for formats other than JSON to be generated, such as CSV or TSV.
When the key relates to a dimension containing a datetime-type, like clock or timestamp, you can apply an strftime pattern by using a | symbol. For example, the following will apply an "access_combined"-style date and time format to the time dimension:
[{{time|%d/%b/%Y:%H:%M:%S %z}}]
This becomes:
[23/Sep/2023:14:30:00 +0000]
Use either -n or -r to limit how long generation executes for. If neither option is present, the script will run indefinitely.
Time durations may be specified in ISO8601 format.
For example, specify 30 seconds as follows:
python generator.py -f generator_spec.json -o target_spec.json -r PT30SSpecify 10 minutes as follows:
python generator.py -f generator_spec.json -o target_spec.json -r PT10MOr, specify 1 hour as follows:
python generator.py -f generator_spec.json -o target_spec.json -r PT1HUse -n to limit generation to a number of records.
python generator.py -f generator_spec.json -o target_spec.json -n 1000Specify a start time in ISO format to instruct the driver to use simulated time instead of the system clock time (the default).
In the following example, the constraint is the number of records.
python3 generator.py -f conf/gen/example.json -o conf/tar/stdout.json -n 20 -s "2001-12-20T13:13"example.jsongenerator specification is used.- The
targetinstdout.jsondetermines where the JSON records will be output. -nrequires that only 20 rows are output.- The synthetic
timeclock will start on 20th December 2001 at 13:13pm.
This results in:
{"time":"2001-12-20T13:13:12.132","server":"127.0.0.5","client":"63.211.68.115","endpoint":"GET /api/users/73/contributions","response_time_ms":326}
{"time":"2001-12-20T13:13:17.464","server":"127.0.0.3","client":"79.58.216.203","endpoint":"GET /api/search?q=quantum-mechanics","response_time_ms":262}
{"time":"2001-12-20T13:13:20.776","server":"127.0.0.4","client":"96.54.85.35","endpoint":"GET /api/categories","response_time_ms":75}
{"time":"2001-12-20T13:13:28.023","server":"127.0.0.4","client":"96.54.85.35","endpoint":"GET /api/articles/56/contributors","response_time_ms":41}
{"time":"2001-12-20T13:13:28.077","server":"127.0.0.5","client":"18.202.244.47","endpoint":"POST /api/feedback","response_time_ms":179194}In the next example, the constraint is duration. This will cause the generator to create as many JSON records as would fit into a given duration (see -t below).
python3 generator.py -f conf/gen/example.json -o conf/tar/stdout.json -t 1h -s "2027-03-12"- The
-sflag sets a synthetic clock start of 12th March 2027. - Since
-tis set to1h, the generator creates an hour's worth of data.
The result is a list of events spanning an hour from the time given in -s. This is therefore recommended when generating large volumes of data.
{"time":"2027-03-12T00:00","server":"127.0.0.6","client":"60.138.23.232","endpoint":"GET /api/articles/102/history","response_time_ms":405}
{"time":"2027-03-12T00:00:06.157","server":"127.0.0.6","client":"73.198.96.12","endpoint":"GET /api/articles","response_time_ms":210}
{"time":"2027-03-12T00:00:06.623","server":"127.0.0.4","client":"87.21.26.43","endpoint":"GET /api/articles/42","response_time_ms":445}
:
:
{"time":"2027-03-12T00:59:59.961","server":"127.0.0.4","client":"87.21.26.43","endpoint":"GET /api/users/73/contributions","response_time_ms":489}
{"time":"2027-03-12T00:59:59.965","server":"127.0.0.4","client":"62.155.215.104","endpoint":"POST /api/users/login","response_time_ms":97521}
{"time":"2027-03-12T00:59:59.973","server":"127.0.0.5","client":"87.21.26.43","endpoint":"GET /api/articles/56/contributors","response_time_ms":118}