Overview
fodder is a command line tool to generate data according to a supplied schema. It is predominantly aimed at rapid prototyping and testing, allowing you to generate data quickly and easily so you can focus on the important stuff! It's feature set is not comprehensive, but it is sufficient for most use cases.
It follows the Unix philosophy of doing just one thing, and doing it well. Output always goes to standard output so it can be piped to another command or redirected to a file.
Features
- Supports CSV, JSON and NDJSON output formats
- Field values can be a lookup in an external file, see Maps
- Field values can be selected from a list defined in an external file, see Categories
- Schema files can be dynamically linted when using Visual Studio Code with a little set up, see Linting and Autocompletion
- Single binary distribution, easily installed on any platform, see Installation
Use cases
- Prototyping (POC's)
- Supplying data to development environments
- End-to-end pipeline testing
Goals
The eventual goal is to make it so easy to generate fake data that it becomes a standard part of the development process and it is no longer necessary to use real data outside of production environments.
Why fodder?
There are a number of data generation tools and options out there but there wasn't anything that quite fit. So here's the user story:
As a data engineer, I'd like to quickly generate data for my tables or message stream.
Here are some key drivers for fodder...
- A single binary. Easy to install and update.
- Scales with hardware so it can utilise bigger and/or more machines.
- Straight-forward tool focused on just data generation.
- Compose your specific solution with other tools.
- Fairly fast and efficient so useful for generating large data sets.
- No exposing your schemas to third parties.
- No external service so can run in secured, isolated environments.
- Get going quickly. Minimal dependencies means minimal impediments
You can use it in your shell as part of your local development workflow. You can use it in your CICD pipelines for testing. Remember to bake in failure modes, not just the happy path ;) You can put it in a container and run it at scale for integration & performance testing.
Other tools
Faker (Javascript, Python, Ruby etc)
These libraries are great but you need to write your custom generator as it's own thing. They're powerful, flexible and full featured but require a lot of time to use & maintain effectively. The expertise needed can also be daunting. We wanted something easier to get going and address the common cases. There is a lot of boilerplate just to get up and running. This doesn't scale so well across projects and the software development expertise and time required is a barrier for many teams/organisations.
Mockaroo, GenRocket etc
These require commercial arrangements. They use a service outside our environment which can be a complete showstopper for some organisations. Using them at scale, especially around performance testing can be problematic/expensive.
JSONPlaceholder, MockServer, Mockbin
These are designed for mocking out an API, not really flexible data generation at scale.
Installation
Pre-compiled binaries (MacOS only)
-
Download the binary for your system from the GitLab releases page.
-
Place the binary in your
PATHand make it executable.cp fodder /usr/local/bin chmod +x /usr/local/bin/fodder -
Run the
foddercommand to verify that it is working. The first time you run it you will get a warning about the binary being from and unidentified developer. You will need to go toSystem Preferences > Security & Privacy > Generaland clickOpen Anywayto allow the binary to run.
Install from source
This should work for any system that Rust supports. You will need to have Rust installed. See the Rust installation guide for more information.
-
Clone the repository
-
Build the binary
cargo build --release -
Copy the binary to your
PATHcp target/release/fodder /usr/local/bin -
Run the
foddercommand to verify that it is working.
Usage
This section outlines the basic features of fodder and how to use them.
CLI
fodder --help
A data generation tool
Usage: fodder [OPTIONS] --schema <SCHEMA>
Options:
-s, --schema <SCHEMA> Path to the schema file
-n, --nrows <NROWS> Number of rows to generate [default: 3]
-f, --format <FORMAT> Output format [default: json] [possible values: csv, json, ndjson]
-d, --definition Show the JSON schema definition for allowed inputs, useful for autocompletion
-h, --help Print help
-V, --version Print version
Quickstart
Create a schema file
Here's an example schema file displaying some of the features THERE ARE MANY MORE.
cat > schema.fodder.yaml <<EOF
fields:
# Generate a number between 0 and 10
- name: A
type: IntegerInRange
args:
min: 0
max: 10
# Generate a number between 0 and 20
- name: B
type: IntegerInRange
args:
max: 20
# That is greater than A
constraints:
- type: GreaterThan
name: A
# Generate a datetime
- name: C
type: DateTime
# With a 90% probability of being null
null_probability: 0.9
# Generate a sentence if C is null
- name: D
type: String
args:
subtype: Sentences
constraints:
- type: IfNull
name: C
EOF
Generate some data
fodder --schema schema.yaml --format csv --nrows 3
| A | B | C | D |
|---|---|---|---|
| 9 | 18 | aut nostrum quod vero ratione in numquam qui temporibus. | |
| 7 | 14 | 2592-10-28T02:43:00+00:00 | |
| 4 | 14 | accusantium omnis aperiam velit est ea in et. |
Schema
The schema is a YAML file that defines the structure of your data. It is used to control the generation of the data.
There are three major components to the schema:
A complete example
This is a mostly complete example of schema, it is missing some field-specific arguments. For more information on the arguments for each field, see the Fields documentation.
It is also missing external data. For more information on external data, see the External Data documentation.
As with all schema's this can be used to output JSON or CSV data. However, in this case JSON is used as it is easier to show the complexity of the nested fields in the output.
Definition
fields:
# Generate a number between 0 and 10.
- name: A
type: IntegerInRange
args:
min: 0
max: 10
# Generate a number between 0 and 20 that is greater than A.
- name: B
type: IntegerInRange
args:
max: 20
constraints:
- type: GreaterThan
name: A
# Generate a sentence if E is null.
- name: C
type: String
args:
subtype: Sentences
constraints:
- type: IfNull
name: E
# Generate a random boolean.
- name: D
type: Boolean
# Generate a datetime with a 10% probability (90% chance of being null).
- name: E
type: DateTime
null_probability: 0.9
# Generate a string from a list of choices with unequal probability.
- name: F
type: WeightedCategory
args:
choices:
- ["FOO", 2]
- ["BAR", 1]
# Generate a string from a list of choices with equal probability.
- name: G
type: WeightedCategory
args:
choices:
- "FOO"
- "BAR"
# Generate a string with random substitutions.
- name: H
type: Bothify
args:
format: "RANDOM_ID: ??-##-??"
# Generate a nested object.
- name: I
type: Nested
fields:
- name: J
type: IntegerInRange
args:
min: 0
max: 10
- name: K
type: IntegerInRange
args:
min: 0
max: 11
constraints:
- type: GreaterThan
name: A
- name: L
type: IntegerInRange
args:
min: 0
max: 13
- name: M
type: Nested
fields:
- name: N
type: IntegerInRange
args:
min: 0
max: 12
- name: O
type: IntegerInRange
args:
min: 0
max: 13
# Reference a nested field.
constraints:
- type: GreaterThan
name: I.M.N
# Generate a list of nested objects.
- name: P
type: Nested
args:
subtype: List
fields:
- name: Q
type: Nested
fields:
- name: R
type: IntegerInRange
args:
min: 0
max: 10
- name: S
type: Nested
fields:
- name: T
type: IntegerInRange
args:
min: 0
max: 10
# Demonstrate templating and referencing other fields.
- name: U
type: FormattedString
args:
format: |
This is the value of A: {{ refs['A'].raw }}
This is the value of A + 10: {{ refs['A'].raw + 10 }}
# Generate and array of integers of varying length.
- name: V
type: Array
args:
min: 0
max: 5
field:
name: W
type: IntegerInRange
args:
min: 0
max: 10
# Generate a "FreeEmail" by deferring the `fake-rs` library.
- name: X
type: Fakers
args:
subtype: FreeEmail
# Calculate a duration between two dates, generally used by referencing other fields.
- name: Y
type: Duration
args:
start: "2010-01-01T00:00:00Z"
end: "2020-01-01T00:00:00Z"
component: Years
# Look up a value in a map. Used by referencing other fields, Map's are
# defined in the `maps` section and are reference by name.
# - name: Z
# type: Map
# args:
# from_map: ID_TO_NAME
# key: A
# default: "No value found"
Output
fodder -s schema.yaml -n 1 -f json
[
{
"A": 3,
"B": 4,
"C": "dolorem praesentium rerum vel ipsum dolorum veritatis.",
"D": false,
"E": null,
"F": "BAR",
"G": "BAR",
"H": "RANDOM_ID: zS-98-TO",
"I": {
"J": 3,
"K": 4,
"L": 1,
"M": {
"N": 3,
"O": 9
}
},
"P": [
{
"R": 3
},
{
"T": 1
}
],
"U": "This is the value of A: 3\nThis is the value of A + 10: 13\n",
"V": [],
"X": "[email protected]",
"Y": "10"
}
]
Fields
Fields are the core of the schema, they define the structure of the data that will be generated. They are defined in the fields section of the schema.
A simple example
Below is a simple example of a schema with a single field. This field contains most of the possible options that can be set for a field, excluding Constraints and some features that are only available for certain field types, e.g Nested fields and fields that allow Reference constraints.
Some details about this example:
- This field is named
IDand this is how it will be named in the output. - It is of type
IntegerInRangemeaning it will generate a random integer within a range. - It has two (optional) arguments,
minandmax. These are used to define the range of the integers that will be generated. In this case it is showing the default values of0and9223372036854775807(i64::MAX). - It has a
null_probabilityof0meaning that it will never generate anullvalue.
fields:
- name: ID
type: IntegerInRange
args:
min: 0
max: 9223372036854775807
null_probability: 0
Running the above schema through fodder will generate the following output (in JSON format):
fodder -s schema.yaml
[
{
"ID": 4350876185243800642
},
{
"ID": 3998117975203216203
},
{
"ID": 2709943470313341799
}
]
A more complex example (with constraints)
Below is a more complex example of a schema with multiple fields. This example shows how to use Constraints to ensure that the generated data is representative of the real world.
Some details about this example:
- All fields will generate a random
DateTime. - The
CreatedAtfield will generate a randomDateTimebetween 3 days ago and today. - The
ModifiedAtfield will generate a randomDateTimethat is greater thanCreatedAtand between 3 days ago and today. - The
DeletedAtfield will generate a randomDateTimethat is greater thanModifiedAtand between 3 days ago and today. It will also have anull_probabilityof0.9meaning that it will have a 90% chance of beingnull.
fields:
- name: CreatedAt
type: DateTime
args:
start: -3d
end: today
format: "%Y-%m-%d %H:%M:%S"
- name: ModifiedAt
type: DateTime
args:
start: -3d
end: today
format: "%Y-%m-%d %H:%M:%S"
constraints:
- type: GreaterThan
name: CreatedAt
- name: DeletedAt
type: DateTime
args:
start: -3d
end: today
format: "%Y-%m-%d %H:%M:%S"
null_probability: 0.9
constraints:
- type: GreaterThan
name: ModifiedAt
Running the above schema through fodder will generate the following output (this time in CSV format):
fodder -f csv -s schema.yaml
| CreatedAt | ModifiedAt | DeletedAt |
|---|---|---|
| 2023-01-30 12:49:54 | 2023-01-31 19:45:54 | 2023-02-01 22:50:54 |
| 2023-01-30 11:57:54 | 2023-01-31 22:59:54 | |
| 2023-02-01 10:27:54 | 2023-02-01 23:22:54 |
IntegerInRange
The IntegerInRange field generates a random integer between a minimum and maximum value.
Schema
fields:
- name: Zero
type: IntegerInRange
args:
min: -20
max: 20
null_probability: 0
- name: One
type: IntegerInRange
null_probability: 0
args:
min: 0
max: 500
constraints:
- type: GreaterThan
name: Zero
Output
| Zero | One |
|---|---|
| -20 | 56 |
| 1 | 66 |
| -13 | 294 |
Arguments
| Name | Type | Description | Default |
|---|---|---|---|
| min | int | The minimum value | 0 |
| max | int | The maximum value | i64::MAX |
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| GreaterThan | The value must be greater than the value of another field. |
| IfNull | The value must only be non-null if another field is null. |
String
The String field generates a random string of various types and lengths.
Schema
fields:
- name: Words
type: String
args:
subtype: Words
range:
start: 0
end: 10
null_probability: 0.5
- name: Paragraph
type: String
args:
subtype:
range:
start: 1
end: 3
constraints:
- type: IfNull
name: Name
Ouput
| Words | Paragraph |
|---|---|
| quo repellat qui voluptatem dolor. perspiciatis sapiente aut voluptatibus molestias qui a placeat. dicta animi distinctio est. consequuntur fugit praesentium vero. natus omnis reiciendis officia. quia sequi esse qui. est animi voluptas deleniti id. sint quia cumque eum. illo incidunt quo adipisci recusandae. temporibus molestiae rerum culpa. | |
| perspiciatis voluptas qui | |
| sint illum itaque totam |
Arguments
| Name | Type | Description | Default |
|---|---|---|---|
| subtype | string | The type of string to generate. One of Words, Sentences, Paragraphs. | Words |
| range.start | Range | The minimum length of 'things' to generate. | 1 |
| range.end | Range | The maximum length of 'things' to generate. | 2 |
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
Boolean
The Boolean field is used to generate a random boolean value.
Schema
fields:
- name: BoolMaybeNull
type: Boolean
null_probability: 0.5
- name: Bool
type: Boolean
constraints:
- type: IfNull
name: BoolMaybeNull
Output
| BoolMaybeNull | Bool |
|---|---|
| true | |
| true | |
| true | |
| true | |
| false |
Arguments
| Name | Type | Description | Default |
|---|
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
DateTime
The DateTime field generates a random date and time between a start and end.
Schema
fields:
- name: Created
type: DateTime
args:
start: "-10d"
end: "yesterday"
timezone: Australia/Perth
format: "%+"
- name: Deleted
type: DateTime
args:
start: "-10d"
end: "today"
timezone: Australia/Sydney
format: "%+"
constraints:
- type: GreaterThan
name: Created
- name: Refs
type: DateTime
args:
# This is how you reference another field.
start: "{{ refs['Choice'].raw }}"
end: "2010-01-01T00:00:00Z"
format: "%A, %B %e, %Y"
# You must define the fields that you want to reference.
- name: Choice
type: WeightedCategory
args:
choices:
- "2000-01-01T00:00:00Z"
- "1900-01-01T00:00:00Z"
Output
| Created | Deleted | Refs | Choice |
|---|---|---|---|
| 2023-02-01T13:29:35.622865+08:00 | 2023-02-02T08:37:35.622865+11:00 | Saturday, August 12, 1911 | 1900-01-01T00:00:00Z |
| 2023-01-27T21:37:35.622865+08:00 | 2023-02-02T01:53:35.622865+11:00 | Saturday, July 26, 1986 | 1900-01-01T00:00:00Z |
| 2023-01-26T03:38:35.622865+08:00 | 2023-01-30T09:13:35.622865+11:00 | Wednesday, September 26, 2007 | 2000-01-01T00:00:00Z |
Arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| start | string | The start date. This will accept a variety of inputs, details can be found here. | 2000-01-01T00:00:00Z |
| end | string | The end date. This will accept a variety of inputs, details can be found here. | 3000-01-01T00:00:00Z |
| timezone | string | The timezone to use. | UTC |
| format | string | The format to use. The complete list of specifiers can be found here. | "%+" |
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| GreaterThan | The value must be greater than the value of another field. |
| IfNull | The value must only be non-null if another field is null. |
Digit
The Digit field type generates a random digit.
Schema
fields:
- name: Digit One
type: Digit
null_probability: 0.5
- name: Digit Two
type: Digit
constraints:
- type: IfNull
name: Digit One
Output
| Digit One | Digit Two |
|---|---|
| 1 | |
| 3 | |
| 8 | |
| 6 | |
| 4 |
Arguments
| Name | Type | Description | Default |
|---|
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
WeightedCategory
The WeightedCategory field type allows you to specify a list of choices, and an optional weight for each choice.
You can specify categories in-line or externally in a CSV file, with the preference being that the CSV file is used.
Schema
# Load categories from CSV files.
categories:
- name: LETTERS
file: "data/LETTERS.csv"
- name: LETTERS_WEIGHTED
file: "data/LETTERS_WEIGHTED.csv"
fields:
# Simple choices, with equal probability.
- name: simple
type: WeightedCategory
null_probability: 0.5
args:
choices:
- "FOO"
- "bar"
# Simple choices, with weighted probability.
- name: simple_weighted
type: WeightedCategory
args:
choices:
- ["FOO", 1.0]
- ["bar", 0.5]
# Choices from a file, with equal probability.
- name: file
type: WeightedCategory
args:
from_category: LETTERS
# Choices from a file, with weighted probability.
- name: file_weighted
type: WeightedCategory
args:
from_category: LETTERS_WEIGHTED
# data/LETTERS.csv
LETTER
H
E
L
L
O
# data/LETTERS_WEIGHTED.csv
LETTER,WEIGHT
H,4
E,1
L,1
L,1
O,1
Output
| simple | simple_weighted | file | file_weighted |
|---|---|---|---|
| FOO | bar | L | |
| bar | FOO | O | |
| bar | H | H | |
| bar | FOO | E | |
| bar | H | L |
Arguments
| Name | Type | Description | Default |
|---|---|---|---|
| choices | list | A list of choices to select from. They can be specified both with and without weights as per the schema above. | [] |
| from_category | string | The name of a category to use. This is a reference to a name defined in the categories section. See Categories for more information. | "" |
Note:
choicesandfrom_categoryare mutually exclusive.
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
Bothify
The Bothify field generates a string by replacing selected symbols with random characters.
Schema
fields:
- name: Bothify One
type: Bothify
args:
format: "^^ ## ??"
null_probability: 0.5
- name: Bothify Two
type: Bothify
args:
format: "^^ ## ??"
constraints:
- type: IfNull
name: Bothify One
- name: Bothify Three
type: Bothify
args:
format: |
{%- set one = refs['Bothify One']['raw'] -%}
{%- set two = refs['Bothify Two']['raw'] -%}
{%- if one -%}
Bothify One: {{ one }}
{%- elif two -%}
Bothify Two: {{ two }}
{%- else -%}
{%- endif -%}
Output
| Bothify One | Bothify Two | Bothify Three |
|---|---|---|
| 53 09 UR | Bothify One: 53 09 UR | |
| 74 91 aR | Bothify One: 74 91 aR | |
| 35 79 jq | Bothify One: 35 79 jq | |
| 85 89 bv | Bothify Two: 85 89 bv | |
| 92 55 tY | Bothify One: 92 55 tY | |
| 24 60 xq | Bothify Two: 24 60 xq |
Arguments
| Name | Type | Description | Default |
|---|---|---|---|
| format | string | The format of the string to generate. |
Format
The format string is a string that contains symbols that will be replaced with random characters. The following symbols are supported:
| Symbol | Description |
|---|---|
| ^ | A random digit [1-9] |
| # | A random digit [0-9] |
| ? | A random letter [a-Z] |
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
Nested
The Nested field allow you to generate data that is nested within other data. This is particularly useful when outputting data in JSON format.
Schema
fields:
- name: A Top
type: Nested
constraints:
- type: IfNull
name: B Top.B Nested
- name: B Top
type: Nested
fields:
- name: B Nested
type: Nested
null_probability: 0.5
- name: C List
type: Nested
args:
subtype: List
fields:
- name: C Nested 1
type: Digit
- name: C Nested 2
type: String
Output
[
{
"A Top": null,
"B Top": {
"B Nested": {}
},
"C List": [1, "reiciendis"]
}
]
Arguments
| Name | Type | Description | Default |
|---|---|---|---|
| subtype | string | The type of the nested field. One of Object, List. | Object |
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| fields | list | A list of child fields of any type. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
FormattedString
The FormattedString field allows you to generate a string that is formatted according to a template, see Templating. Templates are quite powerful and allow you to generate a wide variety of data. For example, you could generate a string that is a combination of a first name, last name and a random number, e.g. John Smith 1234.
Schema
fields:
- name: A
type: Digit
- name: B
type: FormattedString
args:
format: |
Doing math (A + 4): {{ refs['A'].raw + 4}}!
Accessing values: {{ refs['C'].raw }}!
Formatting dates: {{ refs['D'].raw | date(format="%Y-%m-%d") }}
constraints:
- type: IfNull
name: C
- name: C
type: Digit
null_probability: 0.5
- name: D
type: DateTime
Output
| A | B | C | D |
|---|---|---|---|
| 4 | 4 | 2667-11-04T03:14:00+00:00 | |
| 9 | Doing math (A + 4): 13! Accessing values: ! Formatting dates: 2603-05-24 | 2603-05-24T15:36:00+00:00 | |
| 0 | 4 | 2312-12-06T02:40:00+00:00 | |
| 0 | Doing math (A + 4): 4! Accessing values: ! Formatting dates: 2910-06-24 | 2910-06-24T13:18:00+00:00 | |
| 0 | Doing math (A + 4): 4! Accessing values: ! Formatting dates: 2807-02-22 | 2807-02-22T18:16:00+00:00 |
Arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| format | string | The format of the string. See Templating for more info. |
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
Array
The Array field type allows you to generate an array of a particular field. The array can be of varying length, but being an array, all values will be of the same type.
Schema
fields:
- name: A
type: Array
args:
min: 0
max: 10
field:
name: B
type: IntegerInRange
args:
min: 0
max: 10
- name: field_a
type: IntegerInRange
null_probability: 0.5
- name: array
type: Array
args:
min: 1
max: 4
field:
name: num
type: Digit
constraints:
- type: IfNull
name: field_a
Output
[
{ "A": [4, 9, 5, 2], "field_a": 5185946464695284972, "array": [null] },
{
"A": [5, 0, 2, 7, 0, 4, 3],
"field_a": 7503849539415973306,
"array": [null, null, null],
},
{ "A": [1, 9, 0], "field_a": null, "array": [5, 6] },
]
Arguments
| Name | Type | Description | Default |
|---|---|---|---|
| min | int | The minimum length of the array. | 0 |
| max | int | The maximum length of the array. | 10 |
| field | Field | The field to generate the array of. |
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
Fakers
The Fakers field type generates fake data by passing directly through to the fake-rs crate.
This is mostly useful for generating your own categories if you don't want to hand craft the content.
Schema
fields:
- name: Fakers IP
type: Fakers
args:
subtype: IP
null_probability: 0.5
- name: Fakers Buzzword
type: Fakers
args:
subtype: Buzzword
constraints:
- type: IfNull
name: Fakers IP
- name: Fakers CC Number
type: Fakers
args:
subtype: CreditCardNumber
Output
| Fakers IP | Fakers Buzzword | Fakers CC Number |
|---|---|---|
| Distributed | 5156398210874490 | |
| 188.123.216.220 | 372329793829177 | |
| 155.111.76.88 | 4374470709955 |
Arguments
| Name | Type | Description | Default |
|---|---|---|---|
| subtype | string | The type of fake data to generate. See below for types (#subtypes). |
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that this field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to this field. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
Subtypes
This uses the fake-rs crate under the hood. The following values are supported.
| Subtype |
|---|
| FirstName |
| LastName |
| Title |
| Suffix |
| Name |
| NameWithTitle |
| FreeEmailProvider |
| DomainSuffix |
| FreeEmail |
| SafeEmail |
| Username |
| IPv4 |
| IPv6 |
| IP |
| MACAddress |
| UserAgent |
| RfcStatusCode |
| ValidStatusCode |
| HexColor |
| RgbColor |
| RgbaColor |
| HslColor |
| HslaColor |
| Color |
| CompanySuffix |
| CompanyName |
| Buzzword |
| BuzzwordMiddle |
| BuzzwordTail |
| CatchPhase |
| BsVerb |
| BsAdj |
| BsNoun |
| Bs |
| Profession |
| Industry |
| CurrencyCode |
| CurrencyName |
| CurrencySymbol |
| CreditCardNumber |
| CityPrefix |
| CitySuffix |
| CityName |
| CountryName |
| CountryCode |
| StreetSuffix |
| StreetName |
| TimeZone |
| StateName |
| StateAbbr |
| SecondaryAddressType |
| SecondaryAddress |
| ZipCode |
| PostCode |
| BuildingNumber |
| Latitude |
| Longitude |
| Isbn |
| Isbn13 |
| Isbn10 |
| PhoneNumber |
| CellNumber |
| Time |
| Date |
| DateTime |
| FilePath |
| FileName |
| FileExtension |
| DirPath |
| Bic |
Duration
The Duration field calculates a duration between a start and an end value, with the output being in the specified unit (component of the duration). As this field does not generate a random value it is most useful when used to reference other field/s.
Schema
fields:
- name: Age
type: Duration
args:
start: "{{ refs['Birthdate'].raw }}"
end: now
component: Years
constraints:
- type: IfNull
name: Seconds
- name: Birthdate
type: DateTime
args:
format: "%Y-%m-%d"
start: "1900-01-01T00:00:00Z"
end: "2010-01-01T00:00:00Z"
- name: Seconds
type: Duration
null_probability: .5
args:
start: now
end: 120s
Output
| Age | Birthdate | Seconds |
|---|---|---|
| 36 | 1986-09-11 | |
| 21 | 2001-11-24 | |
| 1995-12-06 | 120 | |
| 1906-07-29 | 120 | |
| 1907-05-14 | 120 |
Arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| start | string | The start of the duration. This will accept a variety of inputs, details can be found here. | |
| end | string | The end of the duration. This will accept a variety of inputs, details can be found here. | |
| component | string | The component to use. This can be one of: Years, Months, Weeks, Days, Hours, Minutes, Seconds. | Seconds |
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
Map
The Map field is used to select a value from a map of key/value pairs. The maps are defined in the maps section of the schema and referenced by name in the Map field. They always refer to externally defined data.
Schema
categories:
- name: POSTCODE
file: "data/POSTCODE.csv"
maps:
- name: POSTCODE_SUBURB_MAP
file: "data/POSTCODE_SUBURB.csv"
fields:
- name: Suburb
type: Map
args:
key: Postcode
from_map: POSTCODE_SUBURB_MAP
null_probability: 0.5
- name: Postcode
type: WeightedCategory
args:
from_category: POSTCODE
- name: Suburb2
type: Map
args:
key: Postcode2
from_map: POSTCODE_SUBURB_MAP
default: "N/A"
constraints:
- type: IfNull
name: Suburb
- name: Postcode2
type: Bothify
args:
format: "####"
# data/POSTCODE.csv
PCODE
6157
6000
6100
6101
6530
PCODE,SUBURB
6157,Palmyra
6000,Perth
6100,Victoria Park
6101,East Victoria Park
6530,Geraldton
Output
| Suburb | Postcode | Suburb2 | Postcode2 |
|---|---|---|---|
| East Victoria Park | 6101 | 4128 | |
| Perth | 6000 | 4642 | |
| 6000 | N/A | 4847 | |
| 6101 | N/A | 9223 | |
| Victoria Park | 6100 | 1988 |
Arguments
| Name | Type | Description | Default |
|---|---|---|---|
| key | string | The name of the field to use as the key in the map. | |
| from_map | string | The name of the map to use. This is a reference to a name defined in the maps section. See Maps for more information. |
Field arguments
| Name | Type | Description | Default Value |
|---|---|---|---|
| null_probability | float | The probability that the field will be null. | 0.0 |
| constraints | list | A list of constraints to apply to the field. | [] |
Supported constraints
| Name | Description |
|---|---|
| IfNull | The value must only be non-null if another field is null. |
Constraints
At some point, you will want to generate data that is representative of the real world. For example, you may want to ensure that the CreatedAt field is always less than the ModifiedAt field. This is where constraints come in.
There are a number of constraints that can be applied to fields, NOT ALL FIELD TYPES SUPPORT ALL CONSTRAINTS.
At present, the following constraints are available:
Usage
Constraints are applied directly to fields in the schema. The field that they are defined on is the field that the constraint will be applied to.
GreaterThan
The GreaterThan constraint ensures that the value of the field is greater than the value of another field.
The following will result in the B field being greater than the A field.
fields:
- name: A
type: Digit
- name: B
type: Digit
constraints:
- type: GreaterThan
name: A
IfNull
The IfNull constraint ensures that the value of the field is only generated if the value of another field is null. This has the effect of making two fields mutually exclusive.
The following will result in the B field being generated if the A field is null which should result in each field being generated 50% of the time.
fields:
- name: A
type: Digit
null_probability: 0.5
- name: B
type: Digit
constraints:
- type: IfNull
name: A
External Data
There are multiple types of external data that can be used to assist in the generation of data. These include:
You can generate data using one schema and feed it in to another schema. This is useful for creating multiple tables of data that are related to each other in some way that makes them useful. For example if you reference the same list of IDs in multiple tables, you can generate a list of IDs and then use that list in multiple tables.
Categories
Categories are a way to define a list of things that can be referenced by name in some fields. They are defined in the categories section of the schema.
The contents of the file are expected to be in CSV format. The first row is expected to be a header row by default. The first column is expected to be the value to be used. The second column is expected to be the weight to be used. If the second column is not present, the weight is assumed to be 1.0.
Schema
The below schema will make the contents of data/LETTERS.csv available as a category called LETTERS.
categories:
- name: LETTERS
file: "data/LETTERS.csv"
header: true
Arguments
| Name | Type | Description | Default |
|---|---|---|---|
| name | string | The name of the category. This is the name by which it can be referenced in other parts of the schema. | |
| file | string | The path to the file containing the category data. | |
| header | bool | Whether the first row of the file is a header row. | true |
Example field using a category
categories:
- name: LETTERS
file: "data/LETTERS.csv"
fields:
- name: letter
type: WeightedCategory
args:
from_category: LETTERS
Example category file
# data/LETTERS.csv
LETTER
H
E
L
L
O
Example category file with weights
# data/LETTERS_WEIGHTED.csv
LETTER,WEIGHT
H,4
E,1
L,1
L,1
O,1
Maps
Maps are a way to define a list of things that can be referenced by name in some fields. They are defined in the maps section of the schema.
The contents of the file are expected to be in CSV format. The first row is expected to be a header row by default. The first column is expected to be the key to be used. The second column is expected to be the value to be used.
This allows for interesting use cases such as mapping consistently between randomly selected categories and their corresponding attributes.
Note: There is a small amount of awkwardness here in that you will probably want to select a category from the first column of the map and then later map that category to the value in a separate column.
The limitation we currently have is that to do this you will need to copy the first column from the map file and create a new category file with the same contents.
Schema
The below schema will make the contents of data/POSTCODE_SUBURB.csv available as a map called POSTCODE_SUBURB_MAP.
maps:
- name: POSTCODE_SUBURB_MAP
file: "data/POSTCODE_SUBURB.csv"
header: true
Arguments
| Name | Type | Description | Default |
|---|---|---|---|
| name | string | The name of the map. This is the name by which it can be referenced in other parts of the schema. | |
| file | string | The path to the file containing the map data. | |
| header | bool | Whether the first row of the file is a header row. | true |
Example field using a map
categories:
- name: POSTCODE
file: "data/POSTCODE.csv"
maps:
- name: POSTCODE_SUBURB_MAP
file: "data/POSTCODE_SUBURB.csv"
fields:
- name: Suburb
type: Map
args:
key: Postcode
from_map: POSTCODE_SUBURB_MAP
- name: Postcode
type: WeightedCategory
args:
from_category: POSTCODE
Example map file
# data/POSTCODE_SUBURB.csv
POSTCODE,SUBURB
7000,Hobart
6000,Perth
5000,Adelaide
4000,Brisbane
3000,Melbourne
2000,Sydney
1000,Canberra
Examples
Note: This section attempts to document examples in the repository. It will in-line the code where relevant but the code is the source of truth so it may be out of date.
Basic schema examples
There is a good source of examples schemas in the field documentation. These are located in the fields section.
There are also more examples in the repository. These are located in the schemas/ directory.
'Real' usage examples
A simple example
This is a simple example of how to use fodder to generate some data. It contains a schema.yaml file and some hand crafted input data in data/.
Schema
maps:
- name: BUYER_NAME
file: data/BUYER_NAME.csv
- name: SELLER_NAME
file: data/SELLER_NAME.csv
categories:
- name: BUYER
file: data/BUYER.csv
- name: SELLER
file: data/SELLER.csv
fields:
# SALESID INTEGER Primary key, a unique ID value for each row. Each row represents a sale of one or more tickets for a specific event, as offered in a specific listing.
- name: SALESID
type: Bothify
args:
format: "#########"
# SELLERID INTEGER Foreign-key reference to the USERS table (the user who listed the tickets).
- name: SELLERID
type: WeightedCategory
args:
from_category: SELLER
# SELLERNAME VARCHAR(50) The name of the user who listed the tickets.
- name: SELLERNAME
type: Map
args:
from_map: SELLER_NAME
key: SELLERID
# BUYERID INTEGER Foreign-key reference to the USERS table (the user who bought the tickets).
- name: BUYERID
type: WeightedCategory
args:
from_category: BUYER
# BUYERNAME VARCHAR(50) The name of the user who bought the tickets.
- name: BUYERNAME
type: Map
args:
from_map: BUYER_NAME
key: BUYERID
# QTYSOLD SMALLINT The number of tickets that were sold, from 0 to 9. (A maximum of 8 tickets can be sold in a single transaction.)
- name: QTYSOLD
type: Digit
# PRICEPAID DECIMAL(8,2) The total price paid for the tickets, such as 75.00 or 488.00. The individual price of a ticket is PRICEPAID/QTYSOLD.
- name: PRICEPAID
type: Bothify
args:
format: "^##.##"
# SALETIME TIMESTAMP The full date and time when the sale was completed, such as 2008-05-24 06:21:47.
- name: SALETIME
type: DateTime
args:
start: 2023-01-01T00:00:00Z
end: 2023-01-31T00:00:00Z
format: "%Y-%m-%d %H:%M:%S"
Data
# data/BUYER.csv
ID
0000001
0000002
0000003
0000004
0000005
# data/BUYER_NAME.csv
ID,NAME
0000001,JOHN DOE
0000002,JOHN SMITH
0000003,ALICE DOE
0000004,SALLY SMITH
0000005,MARY JONES
# data/SELLER.csv
ID
0000001
0000002
0000003
0000004
0000005
# data/SELLER_NAME.csv
ID,NAME
0000001,COMPANY INC.
0000002,LOL INC.
0000003,123 INC.
0000004,ANOTHER INC.
0000005,ZZZ PTY LTD.
Output
# tables/SALES.csv
SALESID,SELLERID,SELLERNAME,BUYERID,BUYERNAME,QTYSOLD,PRICEPAID,SALETIME
541725862,0000002,JOHN SMITH,0000003,123 INC.,0,563.54,2023-01-09 21:50:00
751725001,0000004,SALLY SMITH,0000002,LOL INC.,3,889.20,2023-01-05 18:58:00
868507369,0000004,SALLY SMITH,0000004,ANOTHER INC.,7,764.86,2023-01-06 05:00:00
306917643,0000002,JOHN SMITH,0000005,ZZZ PTY LTD.,9,553.21,2023-01-21 17:47:00
731805330,0000005,MARY JONES,0000005,ZZZ PTY LTD.,6,183.24,2023-01-23 19:52:00
A more complex example
This is a more complex example of how to use fodder to generate some data. It contains multiple schema files in the schemas/ directory, it utilises the data/ directory as a temporary location for data that is generated by one schema and used by another schema, finally it outputs the data to the tables/ directory.
The above is all controlled by two short bash scripts, initialise and gen. These two scripts utilise the fodder CLI along with some standard Unix tools to generate the data.
initialise
This script is used to initialise the data by generating the categories that will remain static in future runs. In a way this can be thought of as generating our DIM tables in a data warehouse.
#!/usr/bin/env bash
#
# Generate our primary tables
set -eo pipefail
if [ -z "$1" ]; then
ROWS=5;
else
ROWS="$1";
fi
echo "GENERATING SELLER DATA"
fodder -s schemas/ID.fodder.yaml -n "$ROWS" -f csv > data/SELLER_ID.csv
fodder -s schemas/COMPANY.fodder.yaml -n "$ROWS" -f csv > data/SELLER_COMPANY.csv
paste -d "," data/SELLER_ID.csv data/SELLER_COMPANY.csv > tables/SELLER_ID_COMPANY.csv
cp data/SELLER_ID.csv tables/SELLER_ID.csv
rm data/SELLER_ID.csv data/SELLER_COMPANY.csv
echo "GENERATING BUYER DATA"
fodder -s schemas/ID.fodder.yaml -n "$ROWS" -f csv > data/BUYER_ID.csv
fodder -s schemas/COMPANY.fodder.yaml -n "$ROWS" -f csv > data/BUYER_COMPANY.csv
paste -d "," data/BUYER_ID.csv data/BUYER_COMPANY.csv > tables/BUYER_ID_COMPANY.csv
cp data/BUYER_ID.csv tables/BUYER_ID.csv
rm data/BUYER_ID.csv data/BUYER_COMPANY.csv
gen
The gen script is used to generate the main data. This is the data that will change each time the script is run. In a way this can be though of as generating our FACT tables in a data warehouse.
In the scenario that we are generating data for a data warehouse, we would run the initialise script once and then run the gen script each time we want to generate new data.
This could potentially be automated by running the gen script as a cron job, or similar (more complicated) mechanism.
#!/usr/bin/env bash
#
# Generate latest sales data
set -eo pipefail
if [ -z "$1" ]; then
ROWS=20;
else
ROWS="$1";
fi
echo "GENERATING SALES"
fodder -s schemas/SALES.fodder.yaml -f csv -n "$ROWS" > tables/SALES.csv
Templating
Templating language
Where templating has been made available in fodder, it uses Tera as the templating language. Tera is a template engine for Rust that is based on Jinja2. It is a full-featured template engine with a lot of functionality. For more information on Tera, please see the Tera documentation.
References
When we make reference to a field in a template, we are referring to the name of the field. For example, if we have a field called Name and we want to reference it in a template, we would use {{ refs['Name'].raw }}. This will return the raw value of the field. If we wanted to use the formatted value, we would use {{ refs['Name'].formatted }}. Depending on your particular use-case, you may want to use one or the other.
Linting and Autocompletion
This is a way to validate your fodder schema files and to provide helpful hints to you as you write them, such as what fields are available and what arguments they take.
Setting up linting in your IDE
Output the JSON schema to the root of your project, this is what is used to provide linting and autocompletion.
fodder -d > lint.json
VSCode
Configure your IDE to use this JSON schema. For VSCode, ensure you have the correct extension installed (YAML) and add the following to your settings.json:
{
"yaml.schemas": {
"./lint.json": "*.fodder.yaml"
}
}
Neovim
If you are using Neovim with nvim-lspconfig and lazy.nvim you can use the below snippet. If you aren't, you should still be able to use the JSON schema to configure your IDE - but you will have to go on that adventure by yourself!
return {
{
"neovim/nvim-lspconfig",
opts = {
servers = {
yamlls = {
settings = {
yaml = {
schemas = {
["./lint.json"] = "*.fodder.yaml",
},
},
},
},
},
},
},
}
Provided you have set everything up as detailed above and named your fodder schema file with the .fodder.yaml suffix you should get linting and autocompletion!