Skip to main content

learn

Overview

So you need some synthetic data? Maybe ShadowTraffic can help. This page explains in detail what ShadowTraffic is and how it works. We'll progressively build up a simple example to introduce each of the major features.


Let's dive in: ShadowTraffic is a product that helps you rapidly simulate production traffic to your backend—primarily to Apache Kafka, Postgres, S3, webhooks, and a few others.

Specifically, it's a container that you deploy on your own infrastructure. You give it a single JSON configuration file which tells it what you want your data to look like, and it uses that to produce streams of data to your backend. You don't write any code to make this work.

Architectural diagram


The API

Functions

So what does that JSON configuration file look like? Let's illustrate it with an example.

Imagine that you want to generate a stream of sensor readings to a Kafka topic. Perhaps the data is as simple as this:

{
"sensorId": "a94729b9-c375-45a3-8a1b-58b96f6b77dc",
"reading": 60.53,
"timestamp": 1716321759
}

The guiding principle of ShadowTraffic is that you replace concrete values with functions. A function is a map with a key named _gen.

With that in mind, here's how you'd generate more data like the above:

{
"sensorId": { "_gen": "uuid" },
"reading": { "_gen": "normalDistribution", "mean": 60, "sd": 5 },
"timestamp": { "_gen": "now" }
}

When you run that (with a fully assembled configuration, which we'll get to in a second), it will produce data like this:

[
{
"sensorId" : "c67473fb-46bf-92f3-1323-691ff7307e98",
"reading" : 54.62383104172507,
"timestamp" : 1716322323316
},
{
"sensorId" : "18883008-9cc7-318a-5f59-d964cc3ba36e",
"reading" : 61.93954024611097,
"timestamp" : 1716322323323
},
{
"sensorId" : "7f25b88f-71c0-1bcc-7f9d-c2069b9bd54e",
"reading" : 58.074918200513636,
"timestamp" : 1716322323342
}
]

What is really going on here?

ShadowTraffic scans your configuration file and looks for any maps with a key named _gen. If it finds one, it looks at the function name and compiles code at that location. Then at runtime, ShadowTraffic repeatedly invokes those functions to produce synthetic data.

This approach is powerful because the API is a mirror image of your data. You can put functions anywhere you like, and ShadowTraffic will mimic the structure. For example, if the sensor data was more nested like:

{
"sensorPayload": {
"identifiers": {
"v4": "a94729b9-c375-45a3-8a1b-58b96f6b77dc"
},
"value": [60.53, "degrees"],
"time": {
"recordedAt": 1716321759
}
}
}

Then all you'd need to do is shape your function calls in the same way:

{
"sensorPayload": {
"identifiers": {
"v4": { "_gen": "uuid" }
},
"value": [
{ "_gen": "normalDistribution", "mean": 60, "sd": 5 },
"degrees"
],
"time": {
"recordedAt": { "_gen": "now" }
}
}
}

Funny looking, isn't it? Just remember: replace concrete values with functions. That is all that's happening here.

Before we move on, one thing you might notice is that each event represents a brand new sensor. Those UUIDs are nearly never going to overlap, so each sensor only seems to emit one reading. If you can, ignore this for a moment: we'll show a little later how to model individual sensors over time.

Function modifiers

Did you notice something a little off about the example above? When we started, we wanted sensor readings that looked neat with two decimal places, like 60.53, but our generator spits out long values like 54.62383104172507. How can you control that?

Many functions take specific parameters to control their behavior, but there are a few general parameters that can be used on almost any function. These are called function modifiers.

For instance, on any numeric function, you can use the decimals function modifier to trim the number of decimal places. Using decimals, we adjust our function call like so:

{
"_gen": "normalDistribution",
"mean": 60,
"sd": 5,
"decimals": 2
},

If we rerun ShadowTraffic, we'll now get reading values in exactly the shape we want them:

[
{
"sensorId" : "5a5be9a6-6245-e6c3-9d9d-a67a0121607e"
"reading" : 65.0,
"timestamp" : 1716393799097
},
{
"sensorId" : "1ef26028-4e5d-e57f-171a-ae256a1c498f"
"reading" : 69.44,
"timestamp" : 1716393799106
},
{
"sensorId" : "5a6798a9-3792-6b1f-b888-74c87170d00a"
"reading" : 56.41,
"timestamp" : 1716393799107
}
]

decimals works with other functions like uniformDistribution, divide, and anything else that returns numbers.

Other useful function modifiers include path, selectKeys, and keyNames. You might want to spend a minute or two browsing these since they come in handy so often.

Generators

So far, we've introduced just enough abstraction to make new data. But we haven't actually run ShadowTraffic to see it producing that data to a particular backend like Kafka or Postgres. To do that, we'll need another concept: generators.

A generator describes the backend-specific attributes of the data. Continuing our example, we want to get our sensor data into a Kafka topic. When you send an event to Kafka, you need to supply a topic, key, value, and perhaps other information.

Slightly reshaping our function calls from above, the generator for Kafka looks like this:

{
"topic": "sensorReadings",
"key": {
"sensorId": { "_gen": "uuid" }
},
"value": {
"reading": { "_gen": "normalDistribution", "mean": 60, "sd": 5 },
"timestamp": { "_gen": "now" }
}
}

Where did this JSON structure come from? Each backend type has a schema for what its generator must look like. Here's the Kafka generator schema, and here's the schema for Postgres if that's what you're working with.

Moving our example along, a ShadowTraffic configuration file requires an array of generators, so the nearly-complete configuration file looks like:

{
"generators": [
{
"topic": "sensorReadings",
"key": {
"sensorId": { "_gen": "uuid" }
},
"value": {
"reading": { "_gen": "normalDistribution", "mean": 60, "sd": 5 },
"timestamp": { "_gen": "now" }
}
}
]
}

When you have more than one generator, ShadowTraffic executes them in a round-robin fashion unless otherwise specified. But we'll get to that soon!

Connections

By now, you can probably guess the last concept we need to tie this all together: connections. Connections describe where, specifically, the data will go. They're a top-level construct in the configuration file, mapping a connection name to its details.

To send our sensor data to a local Kafka cluster, our connection will look something like this:

{
"generators": [
{
"topic": "sensorReadings",
"key": {
"sensorId": { "_gen": "uuid" }
},
"value": {
"reading": { "_gen": "normalDistribution", "mean": 60, "sd": 5, "decimals": 2 },
"timestamp": { "_gen": "now" }
}
}
],
"connections": {
"dev-kafka": {
"kind": "kafka",
"producerConfigs": {
"bootstrap.servers": "localhost:9092",
"key.serializer": "io.shadowtraffic.kafka.serdes.JsonSerializer",
"value.serializer": "io.shadowtraffic.kafka.serdes.JsonSerializer"
}
}
}
}

A few things to call out:

  1. Each connection needs a name, dev-kafka in this case. When you have just one connection, ShadowTraffic automatically understands that your generators should use that connection, so you don't need to tell it anything else - like in this example.
  2. If you have multiple connections, each generator has to specify a connection field to tell it what connection name it should bind to.
  3. The kind field specifies which connection type this is, and therefore what fields are required for a valid connection. For instance, if you set it to postgres, ShadowTraffic would require fields like host and port instead of bootstrap.servers.
  4. For you Kafka readers with a keen eye, you'll have noticed the JSON serializers with the ShadowTraffic package name. ShadowTraffic ships this out of the box for your convenience. You're welcome.

See the reference pages for Kafka, Postgres, and S3 to learn more about each connection type.

Now, with that much configuration, you can run ShadowTraffic to send sensor data to Kafka. Read on to the next section.

Dry run

Before you actually send data off to Kafka, you might want to check that the data looks how you'd expect.

ShadowTraffic lets you perform a dry run: you can see exactly what it's going to do, but instead of having the data sent to your connection, it gets printed to standard output on your terminal. Invoke ShadowTraffic with this command, noting --stdout and --sample.

docker run --env-file license.env -v $(pwd)/your-config-file.json:/home/config.json shadowtraffic/shadowtraffic:latest --config /home/config.json --stdout --sample 10

You shoud see it print 10 sensor readings similiar to the below, and then exit.

{
"topic" : "sensorReadings",
"key" : {
"sensorId" : "ddc789f7-c9be-7883-da97-ce5759e526f6"
},
"value" : {
"reading" : 64.14,
"timestamp" : 1716909643832
}
}
{
"topic" : "sensorReadings",
"key" : {
"sensorId" : "b46da8cb-6cf4-c10a-2426-70e1a10e366d"
},
"value" : {
"reading" : 57.65,
"timestamp" : 1716909643840
}
}
{
"topic" : "sensorReadings",
"key" : {
"sensorId" : "e4ce8f14-6b42-36db-c467-6428173f49ff"
},
"value" : {
"reading" : 65.36,
"timestamp" : 1716909643841
}
}

What's even more useful during development is the --watch flag. By running ShadowTraffic with --stdout --sample 10 --watch, ShadowTraffic will print 10 sample events to standard output every time your configuration file changes. This is incredibly useful for iterating on your configuration.

Runtime

When all looks well, you can drop the development flags and run ShadowTraffic like so:

docker run --env-file license.env -v $(pwd)/your-config-file.json:/home/config.json shadowtraffic/shadowtraffic:latest --config /home/config.json

It should print some logs about connecting to Kafka and block, producing data indefinitely. You can stop it with Control-C or stop the container with the Docker CLI.

If you want a quick way to verify how ShadowTraffic is behaving, you can peek at its Prometheus metrics.

Generator configuration

If checked Kafka after you ran ShadowTraffic, the first thing you'll probably notice is that it produced quite a lot of messages in a short span of time. How can you slow it down? Or better yet, how can you generally change an entire generator's behavior?

There are two ways to configure a generator: either locally for a specific generator, or globally for all of them.

For example, to make a generator produce an event at most every 200 milliseconds, you could configure its local throttleMs parameter:

{
"topic": "sensorReadings",
"key": {
"sensorId": { "_gen": "uuid" }
},
"value": {
"reading": { "_gen": "normalDistribution", "mean": 60, "sd": 5, "decimals": 2 },
"timestamp": { "_gen": "now" }
},
"localConfigs": {
"throttleMs": 200
}
}

If you had many generators and you wanted all of them to produce events no more than 500 milliseconds each, you could instead use the top-level global configuration field:

{
"generators": [
{
"topic": "sensorReadings",
"key": {
"sensorId": { "_gen": "uuid" }
},
"value": {
"reading": { "_gen": "normalDistribution", "mean": 60, "sd": 5, "decimals": 2 },
"timestamp": { "_gen": "now" }
}
}
],
"globalConfigs": {
"throttleMs": 500
}
"connections": {
"dev-kafka": { ... }
}
}

If you supply the same configuration in both global and local, the local parameter takes precedence.

Similarly, you can use generator configuration to cap the number of generated events (maxEvents), delay a portion of events from being written to the connection (delay), among other things.

Notably, you can use functions inside the configuration. If you wanted to generate events anywhere between 100-200 ms each, you could use a function to create that variance:

{
"globalConfigs": {
"throttleMs": {
"_gen": "uniformDistribution",
"bounds": [100, 200]
}
}
}

Lookups

Until now, we've generated data to only one Kafka topic. But it's often the case that you have multiple streams of data which share a common identifier - in other words, a join key.

Imagine, for example, customer and order data. Both data sets usually share a common customerId field that links rows in both together.

How would you do that with ShadowTraffic? There's a function just for this purpose: lookup.

When ShadowTraffic runs a generator, it retains a window of history about the events it recently produced. lookup is a function that queries that history, picking out a random event to use.

Let's continue our sensor example and imagine there's another stream of data for maintenance requests. Every ~10 seconds, an event is generated which requests that a random sensor get checked for repairs:

{
"generators": [
{
"topic": "sensorReadings",
"key": {
"sensorId": { "_gen": "uuid" }
},
"value": {
"reading": { "_gen": "normalDistribution", "mean": 60, "sd": 5, "decimals": 2 },
"timestamp": { "_gen": "now" }
},
"localConfigs": {
"throttle": {
"ms": 200
}
}
},
{
"topic": "maintainenceNotifications",
"value": {
"sensorId": {
"_gen": "lookup",
"topic": "sensorReadings",
"path": [ "key", "sensorId" ]
},
"status": "needs repair"
},
"localConfigs": {
"throttle": {
"ms": 10000
}
}
}
],
"connections": {
"dev-kafka": { ... }
}
}

The generator for maintainenceNotifications calls the lookup function, asking it for events previously generated to the sensorReadings topic. Notice the path function modifier. lookup returns an entire event that was previously generated, but we only want the sensorId. Using path lets us drill directly to the value we want.

When you run ShadowTraffic, you'll see common identifiers link up, like the following. Notice how the id a2c57eea-0589-70f8-b557-25e7ebb399c4 is shared in both events.

{
"topic" : "sensorReadings",
"key" : {
"sensorId" : "a2c57eea-0589-70f8-b557-25e7ebb399c4"
},
"value" : {
"reading" : 57.91,
"timestamp" : 1716330021188
}
}
{
"topic" : "maintainenceNotifications",
"value" : {
"sensorId" : "a2c57eea-0589-70f8-b557-25e7ebb399c4",
"status" : "needs repair"
}
}

By default, the last 1000000 generated events are available for lookups before being purged from memory, but you can raise or lower this value with the history generator configuration.

Variables

When your generators get complex enough, you'll probably want to share data across multiple fields. Variables are an obvious abstraction for this.

To show how that works, let's extend our example and imagine that each sensor event also contains a URL to see a chart about its recent activity. Perhaps that URL contains the sensor ID itself, so we'll need to reference it in two places.

{
"topic": "sensorReadings",
"vars": {
"sensorId": { "_gen": "uuid" }
},
"key": {
"sensorId": { "_gen": "var", "var": "sensorId" }
},
"value": {
"reading": { "_gen": "normalDistribution", "mean": 60, "sd": 5, "decimals": 2 },
"timestamp": { "_gen": "now" },
"url": {
"_gen": "string",
"expr": "http://mydomain.com/charts/#{sensorId}"
}
}
}

You declare variables in a top-level vars field, mapping names to expressions. To reference a variable, you simply use the var function with the name you want to bind. Notice further how the string function can reference variables through #{} templating.

When you run it, each event will share the same identifier in its sensorId and url fields:

{
"topic" : "sensorReadings",
"key" : {
"sensorId" : "6d95332c-fcc2-ff43-d48f-67053ecb5609"
},
"value" : {
"reading" : 60.02,
"timestamp" : 1716330583941,
"url" : "http://mydomain.com/charts/6d95332c-fcc2-ff43-d48f-67053ecb5609"
}
}

There might be some cases where you want to randomly generate a variable, but only do so once and lock its value for the lifetime of the generator. To do that, you can use the top-level varsOnce field, which works exactly like vars, but only evaluates once:

{
"topic": "sensorReadings",
"varsOnce": {
"originalHardware": {
"_gen": "boolean"
}
},
"vars": {
"sensorId": { "_gen": "uuid" }
},
"key": {
"sensorId": { "_gen": "var", "var": "sensorId" }
},
"value": {
"reading": { "_gen": "normalDistribution", "mean": 60, "sd": 5, "decimals": 2 },
"timestamp": { "_gen": "now" },
"url": {
"_gen": "string",
"expr": "http://mydomain.com/charts/#{sensorId}"
},
"original": {
"_gen": "var",
"var": "originalHardware"
}
}
}

In the output data, original will always be true or false depending on when you run ShadowTraffic.

And yes, variables can reference other variables.

Seeding

There are many cases, like correctness testing, where you'll want to generate the exact same data every time. ShadowTraffic can do this with the --seed parameter:

docker run --env-file license.env -v $(pwd)/your-config-file.json:/home/config.json shadowtraffic/shadowtraffic:latest --config /home/config.json --seed 42

When you set a seed, ShadowTraffic will not only generate the same data, but use the same throttle values, delay values, or any other configuration you've provided.

If you want to replicate a particular run, look at the top of ShadowTraffic's logs to see what the seed was:

✔ Running with seed 720838425. You can repeat this run by setting --seed 720838425.

For example, in the previous section, we discussed how varsOnce can lock a boolean value for the lifetime of one ShadowTraffic run, though it may be true or false the next time you run it. Using --seed will lock it to one or the other forever.

State machines

Earlier, we mentioned something a little weird about this data: every event comes from a new sensor ID. That isn't very realistic. What we want to create is a set of sensors, each of which sends updates over time. In the following two sections, we'll introduce the constructs you need to do it, starting with state machines.

To complete our example, let's imagine that we want 10 sensors sending updates, each at 1 second intervals. Each sensor's reading will start will a value of about 60, and each subsequent reading will be the previous reading plus a random value between -1 and 1. This will create a nice drift effect for each sensor.

A state machine is the perfect construct for modeling this. It's what it sounds like: you have a set of states and transitions. Each state describes how to override the base generator.

The best way to understand this is to just dive into the example. Notice how sensorId has been temporarily set to a specific UUID. We'll come back and fix this in the next section, but for now this makes sense: we're modeling the lifecycle of a single sensor.

{
"generators": [
{
"topic": "sensorReadings",
"key": {
"sensorId": "2d9549cc-ac0f-b899-0b8b-cca3fc0691d3"
},
"value": {
"timestamp": { "_gen": "now" }
},
"stateMachine": {
"_gen": "stateMachine",
"initial": "start",
"transitions": {
"start": "update",
"update": "update"
},
"states": {
"start": {
"value": {
"reading": {"_gen": "normalDistribution", "mean": 60, "sd": 5, "decimals": 2 }
}
},
"update": {
"value": {
"reading": {
"_gen": "add",
"args": [
{ "_gen": "uniformDistribution", "bounds": [-1, 1] },
{ "_gen": "previousEvent", "path": [ "value", "reading" ] }
],
"decimals": 2
}
}
}
}
},
"localConfigs": { "throttle": { "ms": 1000 } }
}
],
"connections": {
"dev-kafka": { ... }
}
}

Notice a few things:

  1. If you run it, you'll see one event written per second. The first event will be from the start state, and subsequent events will be from the update state.
  2. The reading field has been moved from the top-level value field into each state. The states merge, or override, its configuration into the base generator. You can observe this by adding more fields besides reading to the update state. All events but the first will contain your new field.
  3. previousEvent is a function that grabs the latest event from this generator's history. The path function modifier drills into the last sensor reading so it can be added to a random value as described.

We're almost there. In the next section, we'll remove that hardcoded sensor ID, 2d9549cc-ac0f-b899-0b8b-cca3fc0691d3, and generate data for 10 distinct sensors.

Forks

If you think about the ways you could take the previous example and generalize it to 10 sensors, one thing you could do is copy and paste that generate 9 more times, altering sensorId each time. That could work, but at best it's clumsy. What if you need 1,000 sensors?

Fork is a construct that dynamically clones a generator, running it in parallel many times. You create a top-level field, fork and provide at least a key field. key describes the "identity" of each fork: is this sensor 1, sensor 2, or sensor 3? This is easiest to understand by looking at the configuration:

{
"generators": [
{
"topic": "sensorReadings",
"fork": {
"key": { "_gen": "uuid" },
"maxForks": 10
},
"key": {
"sensorId": { "_gen": "var", "var": "forkKey" }
},
"value": {
"timestamp": { "_gen": "now" }
},
"stateMachine": { ... },
"localConfigs": { "throttle": { "ms": 1000 } }
}
],
"connections": {
"dev-kafka": { ... }
}
}

Notice how:

  1. The key field in fork has been set to the uuid function, which is what it originally was at the start of this page. This will create random sensor IDs for us.
  2. There's also a maxForks field that's been set to 10. When unset, fork will generate as many instances as possible. Since we want only 10 sensors, maxForks puts an upper bound on it.
  3. The hardcoded sensor ID has been removed and replaced with a reference to a variable called forKey. When you use fork, you can use this variable to identify which fork this one is.

If you run this, you'll now see 10 different UUIDs, each of which updates every 1 second.

You might also notice that all the sensors appear to update at nearly the same time: a burst of 10 updates, then nothing for 1 second, and repeat. This is because by default, forks are spawned as fast as possible. You can stagger how quickly forks start with the aptly named stagger field.

Take care to either put an upper bound on the number of forks you start, with maxForks, or ensure that forks eventually stop running through either maxEvents or a state machine with a terminal state. Each additional fork consumes memory, so if you start an unbounded number of forks, you'll eventually run out of memory.

Intervals

You may want to make your synthetic streams a little more realistic by controlling how they behave depending on the day or time. ShadowTraffic provides a construct for this: intervals.

intervals is a function that maps a Cron string to an expression. When the current wallclock times overlaps with one of the Cron strings, the mapped expression is used in its place.

For example, imagine that you wanted sensors to normally emit updates once per 4 seconds. But on every 5th minute of the hour, you want updates every 50 milliseconds, and on every 2nd minute of the hour, you want updates every 1000 milliseconds. You can adjust your throttle to look like this:

{
"topic": "sensorReadings",
"value": { ... },
"localConfigs": {
"throttle": {
"ms": {
"_gen": "intervals",
"intervals": [
[ "*/5 * * * *", 50 ],
[ "*/2 * * * *", 1000 ]
],
"defaultValue": 4000
}
}
}
}

Stateful functions

As we get towards the end of this overview, let's revisit what we started with: functions. Throughout this tour, we've used stateless functions like uuid and normalDistribution. Each time you invoke them, they return a completely random value.

But sometimes, you may find yourself in situations where you need to generate a series of values, where each one is a progression of the last. ShadowTraffic ships a few functions that behave this way. They're called stateful functions because they retain state between calls. You'll know you're working with a stateful function because it's indicated on the function's reference page.

Let's build on our example using the stateful sequentialInteger and sequentialString functions:

{
"topic": "sensorReadings",
"key": {
"sensorId": { "_gen": "sequentialString", "expr": "sensor-~d" }
},
"value": {
"i": { "_gen": "sequentialInteger", "startingFrom": 50 },
"reading": { "_gen": "normalDistribution", "mean": 60, "sd": 5, "decimals": 2 },
"timestamp": { "_gen": "now" }
}
}

Each time this generator runs, the internal state for these functions advances, and the values automatically progress in the output:

{
"topic" : "sensorReadings",
"key" : {
"sensorId" : "sensor-0"
},
"value" : {
"i" : 50,
"reading" : 63.31,
"timestamp" : 1716909120807
}
}
{
"topic" : "sensorReadings",
"key" : {
"sensorId" : "sensor-1"
},
"value" : {
"i" : 51,
"reading" : 61.9,
"timestamp" : 1716909120816
}
}
{
"topic" : "sensorReadings",
"key" : {
"sensorId" : "sensor-2"
},
"value" : {
"i" : 52,
"reading" : 64.75,
"timestamp" : 1716909120816
}
}

Preprocessors

We'll round the overview out by focusing on a concept that becomes useful the more you use ShadowTraffic: preprocessors. As your configuration files grow, you'll probably want some modularity, and even the ability to parameterize them at launch-time.

Preprocessors are special functions that help you do these things. As their name suggests, they run first before all other functions and transform your configuration.

We can use them, for example, to put your connection information in a file that can be shared across other ShadowTraffic configurations. If you were to make a file called connections.json:

{
"dev-kafka": {
"kind": "kafka",
"producerConfigs": {
"bootstrap.servers": "localhost:9092",
"key.serializer": "io.shadowtraffic.kafka.serdes.JsonSerializer",
"value.serializer": "io.shadowtraffic.kafka.serdes.JsonSerializer"
}
}
}

You could include the connection block in your main configuration like so:

{
...
"connections": {
"_gen": "loadJsonFile",
"file": "connections.json"
}
}

ShadowTraffic will expand the contents on connections.json and inline them into the spot where you called loadJsonFile, and then proceed with normal validation.

Another thing you might want to do is inject variables from your environment. You can use the env function to do that, perhaps to parameterize your bootstrap server URL:

{
"_gen": "env",
"var": "BOOTSTRAP_SERVERS"
}

While this example showed using preprocessors in your connection settings, you can use them anywhere in your configuration file.


And with that, you have a solid understanding of the main constructs in ShadowTraffic.

Want to learn more? Try the video guides or the cheat sheet.