Datasets configuration can be provided in a yaml
file or as a nested list. Below you can find a detailed description of possible options.
A single YAML file can include multiple data frames. Entry for each will be used as name of the data frame when it comes to generating data.
first_data_frame:
...
second_data_frame:
...
third_data_frame:
...
Data frame configuration includes two sections:
columns
- where you can describe columns of your data frame.default_size
- optional value that describes default size of the data frame.Each column of your data frame should be described in a separate entry in columns section. Entry name will be used as column name.
Currently there are three major types of columns implemented:
Type of column is set by choosing a proper type
value in column description. Check following sections for more details.
The order of columns will be the same as the order of entries in the configuration.
Basic column types. For an example YAML configuration check this
Random integers from a range
Parameters:
type: integer
- column typeunique
(optional, default: FALSE) - boolean, should values be uniquemin
(optional, default: 0) - integer, minimum value to occur in the column.max
(optional, default: 999999) - integer, maximum value to occur in the column.Example:
data_frame:
columns:
integer_column:
type: integer
min: 2
max: 10
Random float numbers from a range
Parameters:
type: numeric
- column typeunique
(optional, default: FALSE) - boolean, should values be uniquemin
(optional, default: 0) - numeric, minimum value to occur in the column.max
(optional, default: 999999) - numeric, maximum value to occur in the column.Example:
data_frame:
columns:
numeric_column:
type: numeric
min: 2.12
max: 10.3
Random string that follows given pattern
Parameters:
type: string
- column typeunique
(optional, default: FALSE) - boolean, should values be uniquelength
(optional, default: NULL) - integer, string length. If NULL, string length will be random (see next parameters).min_length
(optional, default: 1) - integer, minimum length if length is random.max_length
(optional, default: 15) - integer, maximum length if length is random.pattern
(optional, default: “[A-Za-z0-9]”) - string pattern, for details check this.Example:
data_frame:
columns:
string_column:
type: string
length: 3
pattern: "[ACGT]"
Random boolean
Parameters:
type: boolean
- column typeExample:
data_frame:
columns:
boolean_column:
type: boolean
Column with elements from a set
Parameters:
type: set
- column typeset
(optional, default: NULL) - set of possible values, if NULL, will use a random set.set_type
(optional, default: NULL) - type of random set, can be “integer”, “numeric” or “string”.set_size
(optional, default: NULL) - integer, size of random setExample:
data_frame:
columns:
set_column_one:
type: set
set: ["aardvark", "elephant", "hedgehog"]
set_column_two:
type: set
set_type: integer
set_size: 3
min: 2
max: 10
Column with dates
Parameters:
type: date
- column typemin_date
- beginning of the time interval to sample frommax_date
- end of the time interval to sample fromformat
(optional, default: NULL) - date format, for details check thisExample:
data_frame:
columns:
date_column:
type: date
min_date: 2012-03-31
max_date: 2015-12-23
Column with times
Parameters:
type: time
- column typemin_time
(optional, default: “00:00:00”) - beginning of the time interval to sample frommax_time
(optional, default: “23:59:59”) - end of the time interval to sample fromresolution
(optional, default: “seconds”) - one of “seconds”, “minutes”, “hours”, time resolutionExample:
data_frame:
columns:
time_column:
type: time
min_time: "12:23:00"
max_time: "15:48:32"
resolution: "seconds"
Column with datetimes
Parameters:
type: datetime
- column typemin_date
- beginning of the time interval to sample frommax_date
- end of the time interval to sample fromdate_format
(optional, default: NULL) - date format, for details check thismin_time
(optional, default: “00:00:00”) - beginning of the time interval to sample frommax_time
(optional, default: “23:59:59”) - end of the time interval to sample fromtime_resolution
(optional, default: “seconds”) - one of “seconds”, “minutes”, “hours”, time resolutiontz
(optional, default: “UTC”) - time zone nameExample:
data_frame:
columns:
time_column:
type: datetime
min_date: 2012-03-31
max_date: 2015-12-23
min_time: "12:23:00"
max_time: "15:48:32"
time_resolution: "seconds"
Special predefined types of columns. For an example YAML configuration check this
Id column - ordered integer that starts from defined value (default: 1).
Parameters:
type: id
- column typestart
(optional, default: 1) - first valueExample:
data_frame:
columns:
id_column:
type: id
start: 2
Column filled with values that follow given statistical distribution. You can use one of the distributions available here. You can use function name (e.g. rnorm
) or regular distribution name (e.g. “Normal”). For available names, check this file.
Parameters:
type: distribution
- column typedistribution_type
- distribution name...
- all arguments required by distribution functionExample:
data_frame:
columns:
normal_distribution:
type: distribution
distribution_type: Gaussian
bernoulli_distribution:
type: distribution
distribution_type: binomial
size: 1
prob: 0.5
poisson_distribution:
type: distribution
distribution_type: Poisson
lambda: 3
beta_distribution:
type: distribution
distribution_type: rbeta
shape1: 20
shape2: 30
cauchy_distribution:
type: distribution
distribution_type: Cauchy-Lorentz
There are two levels of custom generator that can be used. You can provide a function that generates a single value or a function that provides a whole column. For examples check this configuration and this R script with functions.
Generate column values using custom function available in your environment. Function should return a single value.
Parameters:
type: custom
- column typecustom_generator
- name of the function that will provide values.Example:
function(vector_of_values) {
return_sample_paste <- sample(vector_of_values, 2)
values <-paste(values, collapse = "_")
}
data_frame:
columns:
custom_column:
type: custom
custom_generator: return_sample_paste
vector_of_values: ["a", "b", "c", "d"]
Generate column using custom function available in your environment. Function should accept argument size
and return a vector of length equal to it.
Parameters:
type: custom_column
- column typecustom_column_generator
- name of the function that will generate column.size
.Example:
function(size, value) {
return_repeated_value <-rep(value, times = size)
}
data_frame:
columns:
custom_column:
type: custom_column
custom_column_generator: return_repeated_value
value: "Ask me about trilobites!"
Calculate columns that depend on other columns. For examples check this configuration and this R script with functions.
Parameters:
type: calculated
- column typeformula
- calculation that has to be performed to obtain columnIn general, formula can be a simple expression or a call of more complex function. In both cases formula has to include names of the columns required for the calculations. When using a function, make sure that it returns a vector of the same size as inputs.
Example:
function(column) {
check_column <-::map_lgl(column, ~.x >= 10)
purrr }
data_frame:
columns:
basic_column:
type: integer
min: 1
max: 10
second_basic_column:
type: integer
min: 1
max: 10
calculated_column:
type: calculated
formula: basic_column + second_basic_column
second_calculated_column:
type: calculated
formula: check_column(calculated_column)
Data frame can have a default number of rows that will be returned if size argument is not provided. Default size can be one of:
Example:
data_frame:
columns:
...
default_size: 10
random_integer
function. Result can be a static value (if static: TRUE
provided) or a random number generator. The first one will generate a number of rows just once ant that number will be used when data is refreshed (without providing a specific size).Example:
random_number_of_rows:
columns:
...
default_size:
arguments:
min: 10
max: 20
static_random_number_of_rows:
columns:
...
default_size:
arguments:
min: 5
max: 10
static: TRUE
For sample YAML configuration check this.
Data frame can be arranged by columns by providing a list of column names as arange
field.
Example:
data_frame:
columns:
a:
...
b:
...
c:
...
d:
...
arrange: [a, c]