Documenting and Exchanging Simulation Specifications: A Language-Agnostic Approach

author:

Alan G. Isaac

organization:

American University

date:

2023-06-29

abstract:

The documentation of simulation experiments should ensure their replicability. Two crucial components common to many simulation experiments require particularly detailed documentation: baseline parameterization, and experimental design. This paper explores the adequacy to these needs of the TOML file format. We demonstrate how to use this format to produce easily human-readable documentation of baseline parameterizations that additionally support automated, programming-language agnostic information exchange. With some qualifications, the TOML format also proves useful for documenting certain experimental designs.

Introduction and Background

The production of adequate documentation for simulation models and their associated experimental designs is an evolving challenge. In principle, documentation should expedite the sharing of models and facilitate the replication of simulation experiments. This paper focuses on two components of simulation research that require particularly careful documentation: baseline parameterization, and experimental design. This paper shows how to provide this documentation in the TOML file format, which is a language-agnostic format for information-exchange. The format proves particularly useful for the documentation of a baseline parameterization, but it may also help with the documentation of experimental designs.

The TOML Format

The TOML file format was developed for use in configuration files. Despite its youth—the formal specification was finalized in 2021—it is already widely used. The format of simple TOML files resembles the older and much more familiar INI format—another human-readable plain-text file format. In contrast to the INI format, however, a formal specification underpins TOML. The format is also more flexible, and it supports type inference [TOML-2021]. These features have made the TOML format a standard replacement for INI files, particularly in the role of project configuration files.

As argued in the next section, these same features make TOML suitable for documenting the baseline parameterizations of simulation models and—to a more limited extent—the accompanying experimental designs. As a particular virtue, since TOML libraries already exist in dozens of programming languages, TOML facilitates the language-agnostic exchange of detailed simulation documentation.

Background: The INI Format

The benefits of the TOML format are perhaps most easily understood in the context of the INI format---a long-standing, widely used, plain-text file format. This format is associated with the configuration files of early Microsoft operating systems and applications. It was quickly adapted to other configuration needs, due to its stark simplicity for both humans and computers to read and write. Unfortunately, Microsoft did not provide a formal specification. Indeed, current Microsoft documentation refers to the Wikipedia entry for the format, and that page discusses many variants.

Due to the lack of a formal specification, characterizations of the INI format refer to standard practices rather than to a standards document. Nevertheless, generalizations about the format are possible. Most fundamentally, an INI file is a collection of keys and values, divided into sections. At the most basic level, a key and its value occur on a single line, separated by an equal sign.

key01=foo
key02=10
key03=10.0

An obvious attraction of the INI format is its simplicity and readability. When confronted with such a file, a programmer will tend to read the values as representing different data types: a string (foo), an integer (10), and a float (10.0). However, INI parsers conventionally treat all values as strings. While in principle a parser might provide some support for inferring the type of the values, there is no INI specification that requires it. [1]

TOML Goals

Three major features of the INI format are ease of creation, human readability, and simplicity for parsing. Common alternatives to the INI format, such as JSON or YAML, are substantially more complicated and typically more difficult to read. (See the appendix for a comparison.) One goal of the TOML format is to be nearly as easy as the INI format for humans to read and write. Yet it additionally strives to be unambiguously specified, extremely flexible, and supportive of type inference. The careful formal specification along with the emergence of TOML test suites supports the goal of an unambiguous specification [TOML-2021].

TOML vs INI

For the purposes of this paper, the two core deviations of the TOML format from the INI format are the case sensitivity of keys and the support for type inference. While INI parsers typically assume that keys are not case sensitive, the TOML specification requires case sensitivity. Additionally, the TOML specification requires the use of TOML syntax to indicate the intended type of values. For example, to support type inference, TOML strings must be quoted. Very roughly speaking, an uncommented INI file that uses only bare keys and section names can be parsed as a TOML file once string values are quoted.

One additional deviation of TOML from INI, which might affect some users, is that removes any ambiguity about the file encoding by specifying it. All TOML files must be UTF-8 encoded Unicode [Unicode-2019-v12.1]. There are a few additional noticeable deviations. Most obvious is the choice of comment character and the support of half-line comments. While the conventional INI comment-line marker is a semicolon, in TOML files it is specified to be an octothorpe (#). INI parsers typically treat a beginning-of-line semicolon as marking a comment line, although some support the octothorpe as a comment symbol. INI parsers generally do not support half-line comments. Finally, some INI parsers allow use of a colon instead of an equal sign to associate keys with values. This is not allowed by the TOML specification.

Type Inference

To conform with the TOML specification, a parser must infer the types of values from their syntax. The supported TOML types are String, Integer, Float, Boolean (true or false), Datetime, Array, and Table. The specification describes the inference rules, and they are unsurprising. For example, compare the following example to the INI example above.

key01='foo'
key02=10
key03=10.0
key04=['foo', true, [10, 10.0]]

If we apply a typical INI parser to this example, the resulting keys and values are all strings. A TOML parser produces very different results. The value 'foo' represents a literal String, since quoted text is always parsed as a string in the TOML format. The value true represents a literal Boolean, which must be lower case in the TOML format. The value 10 is parsed as a literal Integer and the value 10.0 is parsed as a literal Float. [2] Additionally, TOML is capable of representing flexible arrays explicitly. Items in an array are bracketed and comma-separated, and they may be of mixed types (including Array and Table). So ['foo', true, [10, 10.0]] is parsed as an array containing a string, a boolean, and an array, where the nested array contains an integer and a float. (The datatype used to represent this will of course be language-specific.)

TOML for Baseline Documentation and Information Exchange

This section demonstrates that for many simulation parameterizations, the TOML format easily supports a simple yet language-agnostic exchange of information. Furthermore, appropriately structured TOML files render this documentation programmatically accessible.

The Concept of a Baseline Parameterization

Researchers recognize that a computational simulation model is essentially a somewhat complicated computational function. Running the simulation transforms inputs, in the form of values for the model parameters, into outputs, in the form of recorded data. In the absence of deliberate randomization or variations in exogenous input data, this transformation should be fully deterministic. When combined with hardware and software specifications, a specified parameterization should permit exact replication. Ideally, the focal simulation outputs will be exactly reproducible even when there is substantial variation in the hardware chosen, the implementation language, or the simulation toolkit. This desirable replicability is facilitated by sharing the model parameterization in a format that is portable across platforms.

Documenting a Parameterization

All too often, however, the parameterization of a published simulation model must be prised from the body of the publication. Often the tables in published papers contain only partial details, and sometimes crucial details are pushed into footnotes or figure notes. This is one important reason why access to the original source code is so often needed for exact replication of a simulation [Wilensky.Rand-2007-JASSS].

Desirable Features of Documentation

Even access to source code is not a panacea. For one thing, code is not a lingua franca: few people know dozens of programming languages or simulation toolkits. A researcher may be interested in a model parameterization yet have no familiarity with the code chosen for implementation. Ideally, the documentation of a model’s parameterization would have the following key features.

Expressive:

Documentation should be easily written, even by nonprogrammers, not just by machines.

Understandable:

Documentation should be easily read, even by nonprogrammers, not just by trained programmers or machines.

Unambiguous:

The meaning of the documentation should be constrained by a publicly available specification.

Technically informative:

The specification should determine the intended types of parameter values.

Maintainable:

The parameterization should be maintained only where it is documented and not in additional locations.

Portable:

Documentation should facilitate information exchange by being agnostic about the programming language used to access the parameterization.

Maintainability is a crucial consideration. This feature is sometimes referred to as having a single source of truth or as implementing the DRY principle [Hunt.Thomas-1999-AWP]. It might seem that maintainability requires that the single source of truth for a parameterization be in the source code, which in turn would seem to conflict with the goal of portability. However, the following discussion demonstrates that there need not be any conflict.

Illustration via a Simple ABM

For the purpose of illustration, consider developing a baseline parameterization for a completely trivial multi-agent model. This will be a simplified variant of a well-known textbook model, which this paper calls Gift World. [3] Although the model has only introductory pedagogical uses, its utter simplicity renders transparent the broad utility of TOML for documentation of baseline parameterizations.

In the conceptual model, a population of agents begins with identical initial wealth. The population remains fixed over time, but individual wealth is changed by gift-giving behavior. Each period, each agent picks a completely random agent to be a gift recipient. Then each agent who has positive wealth gives one unit of wealth to the recipient. A computational implementation of this conceptual model must specify two model parameters: the number of agents (nAgents), and the initial wealth of each agent (initalWealth). These parameters are positive integers.

Digression on Simulation Parameters

The boundary between a core model specification and model parameters is porous. For example, the gift size might reasonably be considered a model parameter. Similarly, the initial wealth distribution could be parameterized. To reduce presentational clutter, the present example treats these as nonparametric parts of the model specification. Furthermore, this paper focuses only on the documentation of the baseline parameterization and experiments, not of the core model. There already exists a large literature on the latter, notably including the ODD literature in ecological modeling [Grimm.etal-2010-EcolModel].

The boundary between simulation parameters and model parameters is potentially porous. The illustrative specification of this simple Gift World example will intermix model parameters and simulation parameters. For example, as part of the specification of the simulation model, the simulation will stop after a specified maximum number of iterations. [4] Since the conceptual model does not imply a natural stopping point, we will consider the maximum number of iterations (maxiter) to be a simulation parameter.

Simple, deterministic simulations are often exactly reproducible across hardware and software platforms, especially if restricted to integer computations. Stochastic simulations present additional challenges. The complete parameterization should suffice for platform-specific reproducibility, and with enough knowledge of the pseudo-random number generators involved, may suffice for exact cross-platform reproducibility. In a stochastic simulation, a complete parameterization must therefore specify an initial seed for the pseudo-random number generator. (The present example does not specify the rule for updating the seed across replicates, but see below.)

The inclusion of randomness also means that a single run is usually inadequately informative about the model results. We therefore include the number of replicates (nReplicates) as a simulation parameter, in addition to an initial seed (seed). Here the three simulation parameters are restricted to be positive integers.

Code as Documentation

It is certainly possible for well organized and fully commented source code to provide good, accessible documentation of the baseline parameterization. For example, a Python or Ruby implementation of Gift World might include lines like the following. (In these languages, program comments begin with an octothorpe, and half-line comments are supported.)

# WR2015ch2 references Wilensky and Rand (2015, ch.2)
# model parameters:
nAgents=1000       #The number of agents.  (source:  WR2015ch2)
initialWealth=100  #Initial agent wealth.  (source:  WR2015ch2)
# simulation parameters:
maxiter=40000      #Maximum iterations.    (source: pretest)
seed=314159        #Initial seed for PRNG. (source: arbitrary)
nReplicates=100    #Number of replicates.  (source: conventional)

Sharing Parameterizations

This simple example demonstrates that in specific cases a parameterization can be usefully self-documenting even when it is actually source code. Such documentation efforts are good practice. Furthermore, the code in this listing may be placed in a separate file to be shared with those interested in replicating the results, either in an independent implementation or for docking [Axtell.Axelrod.Epstein.Cohen-1996-CMOT].

Nevertheless, as a basis for sharing a baseline parameterization among researchers, this example appears to have a clear drawback: it is not language agnostic. This has two large consequences. First of all, readability is heavily affected by the choice of language, especially for non-programmers. In this sense, the simple example code illustrates a least challenging case. Even so, not all languages use the octothorpe for comments, and not all languages use the equal sign for assignment. Language-agnostic documentation of the baseline parameterization would be preferable if it could still meet the other criteria listed above.

TOML to the Rescue

Particularly problematic is the tension between the maintainability and portability criteria. When implementation languages are manifold, parameterizations that exist only in source code cannot provide a single source of truth. For a skilled programmer, the challenge of converting the example parameterization to a preferred language is perhaps a small barrier. For others, the barrier is larger. However, reliance on a TOML parser largely eliminates this barrier.

Accessing the Parameterization

In the above example, the code listing is already valid TOML syntax. A file with this content can be imported by any of the dozens of programming languages for which a TOML parser already exists. For example, the May 2023 PYPL Popularity of Programming Language index lists the following as the fifteen most popular languages (in order): Python, Java, JavaScript, C#, C/C++, PHP, R, TypeScript, Swift, Objective-C, Rust, Go, Kotlin, Matlab, and Ruby. The official TOML wiki lists implementations for all of these programming languages (and many more). [5]

A Language-Specific Illustration

To illustrate the use of TOML to access a parameterization, actual code is needed. This paper chooses Python for two reasons: it is arguably the most popular programming language (see above), and the resulting source code is almost as readable as pseudocode. Reflecting the batteries-included philosophy of the Python standard library, Python includes the tomllib TOML parser in the standard library. [6]

Using this package to parse the simple example file above is very simple. In Python, the result is a dict object (i.e., an associative array), mapping keys to values. To illustrate parameter access, consider the following example. (Replace fnameTOML with the path to your file containing the Gift World parameterization in the listing above.)

import tomllib                           # access parser package
#load the file:
with open(fnameTOML, 'rb') as fin:       # open the TOML file for reading
    baseline = tomllib.load(fin)         # parse file to dict
#use the result:
for (name, val) in baseline.items():     # for each key-value pair ...
    print(f"{name}, {val}, {type(val)}") # print name, value, and type

Although this example uses language-specific facilities, it should be understandable with minimal guidance to anyone with basic programming experience. Since the items method of a Python dict produces an iterator over the key-value pairs, this example shows how to iterate through the resulting associative array and print the keys and values along the way. Just for illustration, it also prints the type of each value. The type is ensured by the TOML specification: conformant parsers do type inference, so the parameters values are sure to have the anticipated type. (In this simplified example, the type is int every time.) The point is not that this is easy to do in Python; far from it. Rather the point is that the use of TOML makes access to a language-agnostic parameterization almost trivial.

Better Documentation of the Parameterization

The foregoing discussion demonstrates that TOML files offer a simple format for documenting and sharing baseline parameterizations in a natural, human-readable format that is agnostic about the implementation language of the simulation. This is already quite helpful. However, one somewhat disappointing aspect of the illustrated approach is that the comments in the file contain useful information that remains inaccessible. This section demonstrates one way to use TOML to provide fuller documentation of a baseline parameterization.

Inaccessible Information

In the simple example above, all of the documentation is in comments, which the TOML parser ignores. With this particularly simplified approach, it follows that a TOML parser cannot provide the first step in automated documentation production. For example, we cannot put the documentation comments into automated tables. However, this shortcoming may be remedied with minimal additional work by taking advantage of TOML tables.

Sections

The TOML specification supports tables, which hierarchically associate keys to arbitrary values. Tables naturally provide support for hierarchical configuration. The basic table syntax in the TOML format closely resembles the syntax for sections in the INI format: section names are surrounded by brackets. By means of such section names, the INI format supports a single level of hierarchy. An INI file with sections represents a collection of key-value pairs, where some values may again be a collection of key-value pairs. Effectively, an INI file is human-readable textual representation of a hierarchical associative array.

TOML tables can use a similar syntax and thereby fulfill the role of INI sections. A table name can be bracketed on a line by itself and then followed by a collection of key-value pairs. In addition, TOML tables are more flexible than INI sections, since they may be inlined or nested to any level. The following example illustrates the syntax for an inline table and a standard table.

sectionA={key01='foo', key02=10} #inline table
[sectionB]                       #standard table
key01='foo'
key02=10

The Simple Gift World Example Redux

Using TOML’s standard table syntax, we can add documentation data for each parameter in a very readable way. Of course, the structure of such a table is determined only by convention. One reasonable convention is that each parameter specification must include value, description, and source fields. (Call this the value-description-source convention.) This VDS convention mandates the specification not just of the value but also a meaningful description as well as a source for the baseline value. Although it is perhaps a bit verbose, this convention is nicely explicit.

Applying the VDS convention produces an easy to read, portable, language-agnostic, and self-documenting parameterization. Additionally, the entire documentation becomes programmatically accessible. (For example, it might be used to produce formatted tables of documentation, possibly for a specified subset of parameters.)

The following listing illustrates the Gift World baseline parameterization in value-description-source format. Just for illustration, one of the descriptions is a multiline string. (Multiline strings must be triple quoted; an unescaped backslash at the end of a line suppresses the end-of-line, allowing uninterrupted line continuation.) Also purely for illustration, this example includes underscores in large integers to make them easier to read. (Some programming languages allow this syntax for integer literals to enhance readability, but all conformant TOML parsers support it.)

# File: xmplBaselineGW.toml
# 'WR2015ch2' references Wilensky and Rand (2015, ch.2).

[nAgents]
value = 1_000
description = 'The number of agents.'
source = 'WR2015ch2'

[initialWealth]
value = 100
description = 'Initial agent wealth.'
source = 'WR2015ch2'

[maxiter]
value = 40_000
description = 'Maximum number of iterations.'
source = 'pretest'

[seed]
value = 314_159
description = """Initial seed for PRNG; \
  increment this seed for each replicate."""
source = 'arbitrary'

[nReplicates]
value = 100
description = 'Number of replicates.'
source = 'convention'

Limitations

The VDS convention has several virtues. It still satisfies the initial goals of a documentation format, listed above. In particular, it is still highly readable and language-agnostic. The meaning of the documentation is clear to a human reader, and the entire documentation becomes available programmatically when the file is parsed. Against that, the availability of the value, description, and source fields is enforced only by convention. This must be considered a barrier when sharing the parameterization. (However, the barrier is often small, as illustrated by the language-specific example below.)

As another possible downside, the VDS style is considerably more verbose. One source of this verbosity is the need to repeat the source. This may appear irksome when all the model parameters derive from a single source. When parameters derive from multiple sources, however, this approach is admirably explicit. In addition, multiline documentation is easy to add and resists becoming cluttered.

Accessing the Parameter Values

Applying a TOML parser to this information will again produce an associative array where the keys are the parameter names. [7] However, this time each value in turn is an associative array, one of whose keys is "value". As a result, access to a parameter's value requires a very modest indirection.

Language-Specific Illustration

Although the TOML specification is language-agnostic, programmatic access to a TOML file is of course language specific. To illustrate parameter access in the new example, this paper once again chooses Python for its readability. This time the TOML file uses the VDS format, described above. After importing the TOML parser, the code required to produce the baseline changes only slightly. Once again, open the TOML file for reading and then parse it to produce an associative array. This time, however, the hash table maps parameter names to another associative array that holds all of the parameter documentation: not just its value, but also its description and its source. This necessitates a small extra step in order to associate the parameter name with the parameter value. In this example, that extra step is handled by a for loop, using concepts from the previous example. [8]

with open(fnameTOML, 'rb') as fin: # open the TOML file for reading
    docs = tomllib.load(fin)       # parse file to dictionary named `docs`
baseline = dict()                  # new, empty dictionary named `baseline`
for (key, info) in docs.items():   # for each parameter name and its info
    baseline[key] = info['value']  # ... map name to value

Assessment of TOML for Baseline Parameterization

This example shows that the use of TOML files along with the VDS convention has multiple virtues. The documentation of the baseline parameterization becomes not only feasible but also attractive. The plain-text format is readily human-readable, making it easy to understand, create, maintain, and share. The format is also readily machine readable, facilitating error free reuse and information exchange. The requirement that TOML parsers perform type inference according to a formal specification is a major advance over the INI format for this application.

The TOML format has wide support across programming languages, so it allows the easy development, maintenance, and sharing of the parameterization and its documentation in a format that is agnostic about the simulation language. Most importantly, this approach to documentation of a baseline parameterization promotes the goal of facilitating simulation replicability among researchers and across platforms. In addition, if researchers follow the VDS convention, programmatic access to the full documentation becomes trivial.

Use of TOML for Experiment Documentation

With somewhat less elegance, it is possible to use TOML files to document and share the specifications of certain kinds of simulation experiments. This section offers a few brief illustrations.

Scenario Contrast Experiments and Rectangular Grid Specifications

In the context of simulation experiments, the term scenario is a rough synonym for parameterization. A scenario specifies a value for each model parameter. A typical simulation experiment compares baseline results to the results produced by one or more alternative scenarios.

Scenario-contrast experiments are particularly simple. The contrast is between the baseline results and those from a very small number of alternative parameter configurations, perhaps only one. Contrasting scenarios are chosen for their salience in the context of the simulation model’s target application. A scenario-contrast experiment does not systematically explore the parameter space [Railsback.Grimm-2019-PrincetonUP]. (The parameter space comprises all of the combinations of parameter values considered by the researcher to be potentially relevant to the model.) The entire parameterization of a contrasting scenario is typically specified in terms of its deviation from a baseline scenario. It thereby retains most of the baseline parameterization, while specifying new values for one or more of the model’s parameters.

A parameter sweep more systematically attempts to explore some region of the parameter space. A univariate sweep systematically varies a single model parameter. A multivariate sweep systematically varies some subset of the parameters—the set of focal parameters for the experiment. A deterministic sweep once again specifies a discrete set of parameter values for consideration. However, the set of values may be large.

This paper uses the term parameter sweep to denote deterministic grid-based explorations of the parameter space. [9] A typical sweep specification begins by specifying, for each focal parameter, a subset of candidate values.

The term ‘grid’ need not imply any uniformity in the subset of values considered for a focal variable. Nevertheless, for a continuous variable, the subset often derives from a regular (linear or log-linear) subdivision of the variable’s allowed range. Regardless of the sampling strategy, the grid will be constructed from a finite set of values for each focal parameter. The completed grid for the parameter space includes every possible combination of these values. A complete parameter sweep considers every point in this grid. A selective parameter sweep adopts some strategy to sample this grid. This section focuses on documenting the grid specification, not the process of choosing it or the selection of a sweep strategy.

Arrays for a Grid Specification

Recall that the TOML specification supports arrays of values. An array may have a single member, so arrays can specify both simple scenario contrast experiments and very large grids for parameter sweeps. The following listing of the contents of a very simple TOML file illustrates this possibility. This file documents four different experiments for the Grid World model: simpleContrast, uniSweep, contrastSweep, and multiSweep. Note that these experiments depend on the baseline parameterization, documented as in the previous section. Therefore the documentation of the experiments does not need to (and should not) repeat the documentation of the model parameters.

# Experiments as deviations from xmplBaselineGW00.toml
[simpleContrast]             # scenario contrast for small population
nAgents=[100]
[uniSweep]                   # vary the initial wealth of agents
initialWealth=[50, 100, 150]
[contrastSweep]              # vary the initial wealth in a small world
nAgents=[100]
initialWealth=[50, 100, 150]
[multiSweep]                 # interact world size and initial wealth
nAgents=[100, 1_000, 10_000]
initialWealth=[50, 100, 150]

A parameter sweep typically embeds the baseline value (or one nearby) for each swept parameter. A scenario contrast experiment, however, need document only the deviation from the baseline. For maximum uniformity, in this example, the documentation of both types of experiment uses TOML arrays. This a grid format for experiment specification. This format allows nearly trivial generation of all of the scenarios for an experiment, even for a multivariate parameter sweep.

Language-Specific Illustration of Parameter Sets

Once again, the specification is language-agnostic. Its practical use can be illustrated with some simple code. For researchers with even a little coding experience, reading simple Python code is often as easy as reading pseudocode. The following illustration therefore demonstrates how simply Python code can parse the above listing of experiments and produce all of the scenarios for any experiment. Somewhat arbitrarily, this illustration considers the multiSweep experiment (above). Recall that the previous section illustrated the import of tomllib and the extraction of the baseline parameterization. These two steps remain in the background of the present example.

from itertools import product         # import Cartesian product utility
#get the experiment specification:
with open(fnameXpmts, 'rb') as fin:   # open the experiments file for reading
    xpmts = tomllib.load(fin)         # load all experiments into dict
xpmt = xpmts['multiSweep']            # extract the `multiSweep` experiment
#create the experiment's scenarios:
names, lists = zip(*xpmt.items())     # prm names & their lists of sweep values
scenarios = list()                    # an empty list to hold the scenarios
for vals in product(*lists):          # for each item in Cartesian product ...
    scen = dict(zip(names,vals))      # new values for one scenario
    scenarios.append(baseline | scen) # merge new values into baseline

This time the source code provides a bit less guidance to readers unfamiliar with Python. Extensive comments on the right should offset that. [10] Although the example code is necessarily language specific, the steps involved are simple and are easily achievable in most programming languages. As before, the initial step is to read and parse the language-agnostic TOML file, thereby producing a language-specific representation of the experiments. Then, the researcher can programmatically transform the specification of any particular experiment into a collection of scenario specifications. This particular code illustrates the production of a complete sweep, so it produces one scenario specification for each point in the experimental grid. The entire grid is the direct product (or “Cartesian” product) of the individual value sets provided for each focal variable. The baseline is retained for all other parameters.

Advantages of the Grid Format

As illustrated by this example, in addition to being useful for documenting a baseline parameterization, TOML can provide documentation support to grid-based approaches to experimental design. Using widely available libraries, is simple to parse the language-agnostic TOML file and retrieve the grid specifications stored in TOML arrays. Therefore, the TOML format can be useful for configuring some common types of simulation experiments, such as scenario-contrast or parameter-sweep experiments.

It is worth reiterating that the TOML format is completely agnostic about the implementation language of the simulation. This facilitates sharing experimental configurations across languages and platforms. Such grid-based specifications are readily human-readable and practically self-documenting. Since TOML supports comments, additional documentation comments may be added as needed. [11] This approach to documenting and sharing experimental designs therefore has great practicability as well as wide applicability. Nevertheless it has some important limitations.

Limitation: Absence of a Range Type

For an easy configuration of simple parameter-sweep experiments, the most evident shortcoming of TOML is the absence of a range literal. In contrast, many simulation frameworks include a simple method for providing a start-stop-step range specification. (See for example the BehaviorSpace dialog in NetLogo [Tisue.Wilensky-2004a-ICCS], or the ParameterSweep window in Repast Simphony [North.etal-2013-CASM].) This indicates the desirability of such functionality. This shortcoming is shared by more complex and less readable configuration formats such as YAML and JSON, and the situation with INI files is much worse since these are most often parsed without any type inference at all. (See the appendix for details.) Nevertheless, this remains a significant shortcoming. Perhaps TOML will eventually include some kind of range literal, or at least introduce schema that support their specification. Currently this does not seem likely.

Fairly simple workarounds are possible, but with associated sacrifices. As one example, TOML files may be computer generated rather than handwritten. With this approach, the researcher programmatically transforms an experimental design into a machine generated TOML configuration file. The generated TOML file can then be used as the documentation of simulation experiments, the input for conducting the experiments, and the language-agnostic format for sharing experimental designs. If the enumerated ranges are long, however, the resulting TOML files will be less readable than if a range literal existed.

An alternative workaround is to introduce a convention to represent ranges with TOML’s inline tables, with say start, stop and step keys. This is fine for development within a single small project group. It even serves reasonably well when the documentation is shared more broadly. However, at the parsing stage, it requires each user of the TOML file to know the convention and accordingly implement code to convert the parsed values. While this is certainly feasible, it undermines an important feature of the TOML format, which is to rely on any conformant parser to produce directly an experiment’s specification. Furthermore, the proliferation of such conventions undermines the easy interpretation across researchers of these specifications. Proposals to include schema in the TOML specification would ameliorate this issue, but they remain speculative.

Limitation: Fractional Factorial Designs

Another limitation of the grid format for the TOML documentation of experimental designs is that it cannot directly support sampled grids. This is not always a relative limitation, since it is shared by the toolkit facilities mentioned above. Nevertheless, it is not ideal, since sampled designs are often mandated by the curse of dimensionality [Bellman-1957-PrincetonUP].

Complete sweeps quickly become computationally prohibitive. For example, consider a grid involving a mere three parameters taking on just 10 values each. This already implies 1000 simulations, even before considering replicates. For these reasons, multivariate sweeps are often filtered or sampled [Bach-2019-arXiv]. Specifying such filtering or sampling is a crucial part of experimental design and so requires careful documentation. This can be partially addressed by helpful comments or possibly by introducing a conventionally defined constraints parameter for each experiment. For now, however, there is no adequate language-agnostic workaround for this limitation. [12] The sampling technique must be documented in the simulation model’s source code and in any reports on the results.

Conclusion

The TOML format is a plain-text format that was developed for configuration files. It is intended to be as trivially human-readable as the venerable INI format but more useful and flexible. The formal specification of the TOML format, and in particular its support for type inference, has made it the configuration format of choice in a variety of computational settings. This paper considers whether TOML might prove useful for supporting the documentation and configuration of two key components of simulation models: baseline parameterization, and experimental design.

Simulation experiments should be documented in ways that facilitate understanding and support replication. This necessitates easy ways to share baseline parameterizations and experimental designs. Core goals for a documentation format include simplicity of generation, easy readability by human, simple automation of information sharing, and lack of ambiguity. Core goals for a configuration format include ready portability across programming languages and platforms.

This paper provides a very simple demonstration of the utility of TOML as a documentation and configuration format for a baseline parameterization. TOML files provide a useful, easy to write, easy to read, language-agnostic information-exchange format for the documentation and sharing of baseline parameterizations. Reliance on a value-description-source convention can make this documentation particularly explicit and amenable to programmatic access.

We also show that TOML files can provide an extremely simple, language-agnostic information-exchange format for the documentation and sharing of the grid-based specifications required by some common experimental designs. However, this paper also describes key shortcomings of this format for the configuration of such experimental designs: the lack of a range literal (or formal schema), and the inability to easily specify sampled grids. The latter limitation is shared by common simulation toolkits, and partial workarounds for these limitations are possible. Finally, this paper does not propose the use of TOML to document and share experimental designs that are not grid-based. TOML is clearly useful for the documentation of baseline parameterizations, and it additionally holds some attraction for the documentation of scenario-contrast experiments and grid-based sweep experiments.

References

[Angle-1986-SocialForces]

Angle, John. (1986) The Surplus Theory of Social Stratification and the Size Distribution of Personal Wealth. Social Forces 65, 293--326. http://www.jstor.org/stable/2578675

[Axtell.Axelrod.Epstein.Cohen-1996-CMOT]

Axtell, Robert, et al. (1996) Aligning Simulation Models: A Case Study and Results. Computational and Mathematical Organization Theory 1, 123--141. https://doi.org/10.1007/BF01299065

[Bach-2019-arXiv]

Bach, Eviatar. (2019) parasweep: A Template-based Utility for Generating, Dispatching, and Post-processing of Parameter Sweeps. arXiv Computer Science arXiv [cs.DC], http://arxiv.org/abs/1905.03448. http://arxiv.org/abs/1905.03448

[Bellman-1957-PrincetonUP]

Bellman, Richard Ernest. (1957) Dynamic Programming. Princeton, NJ: Princeton University Press.

[Oren.Evans.Net-2009-YAML]

Ben-Kiki, Oren, Clark Evans, and Ingy dot Net. (2009) YAML Ain't Markup Language Version 1.2.

[Dragulescu.Yakovenko-2000-EurPhysJB]

Dragulescu, Adrian A., and Victor M. Yakovenko. (2000) Statistical Mechanics of Money. The European Physical Journal B 17, 723--729. https://arxiv.org/abs/cond-mat/0001432

[Gramacy-2020-CRC]

Gramacy, Robert B. (2020) Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences. Boca Raton, Florida: Chapman Hall/CRC.

[Grimm.etal-2010-EcolModel]

Grimm, V., et al. (2010) The ODD Protocol: A Review and First Update. Ecological Modelling 221, 2760--2768.

[Hunt.Thomas-1999-AWP]

Hunt, Andrew, and David Thomas. (1999) The Pragmatic Programmer: From Journeyman to Master. Boston, MA: Addison-Wesley Professional.

[North.etal-2013-CASM]

North, M.J., N.T. Collier, and J. Ozik. (2013) Complex Adaptive Systems Modeling with Repast Simphony. Complex Adaptive Systems Modeling 1, Article 3.

[TOML-2021]

Preston-Werner, Tom, and Pradyun Gedam. (2021) TOML version 1.0.0.

[Railsback.Grimm-2019-PrincetonUP]

Railsback, Steven F., and Volker Grimm. (2019) Agent-Based and Individual-Based Modeling: A Practical Introduction. Princeton, NJ: Princeton University Press.

[Tisue.Wilensky-2004a-ICCS]

Tisue, Seth, and Uri Wilensky. (2004) "NetLogo: A Simple Environment for Modeling Complexity". In (Eds.) International Conference on Complex Systems, May 16--21, : .

[Wilensky.Rand-2007-JASSS]

Wilensky, Uri, and William Rand. (2007) Making Models Match: Replicating an Agent-Based Model. Journal of Artificial Societies and Social Simulation 10, Article 2. http://jasss.soc.surrey.ac.uk/10/4/2.html

[Wilensky.Rand-2015-MIT]

Wilensky, Uri, and William Rand. (2015) An Introduction to Agent-Based Modeling: Modeling Natural, Social, and Engineered Complex Systems with NetLogo. Cambridge, MA: MIT Press.

[ECMA-2017-JSON]

ECMA International,. (2017) The JSON Data Interchange Syntax.

[PSF-2019-StandardLibrary]

Python Software Foundation,. (2019) The Python Standard Library 3.8.

[Unicode-2019-v12.1]

The Unicode Consortium,. (2019) The Unicode Standard: Version 12.0 - Core Specification. Mountain View, CA: Unicode Consortium. http://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf

Appendix: INI, JSON, and YAML

This appendix briefly describes three other formats that might plausibly aspire to the role proposed in this paper for the TOML format. To allow easy comparison, the baseline parameterization presented above in TOML format is converted to INI, JSON, and YAML formats.

Further Comparison with the INI Format

First, consider the closely related INI format. The following listing provides a possible representation of the baseline parameterization in the INI format.

; File: xmplBaselineGW.ini
[nAgents]
value=1000
description=The number of agents.
source=Wilensky and Rand (2015, ch.2)
type=Integer
[initialWealth]
value=100
description=The value of initial wealth for each agent.
source=Wilensky and Rand (2015, ch.2)
type=Integer
[maxiter]
value=40000
description=Maximum number of iterations.
source=pretest
type=Integer
[seed]
value=314159
description=Seed for PRNG; increment this seed for each replicate.
source=arbitrary
type=Integer
[nReplicates]
value=100
description=Number of replicates.
source=conventional
type=Integer

At first glance, this appears to very closely resemble the TOML format. The INI format is approximately as easy to read and to write by hand. Visual clutter is slightly reduced because string values need not be quoted. However, this is because all values are parsed as strings. So---more than offsetting this gain---clutter is somewhat increased by the general need to state the type of each parameter. Although this has no effect on the parsing of the value, we might subsequently use it for conditional casting of a values to the needed type. Accordingly, support for this type depends on establishing a convention among those who make use of the INI file, undermining the goal of relying on the parser for type inference.

import configparser
cfg = configparser.ConfigParser()                   #create a parser
cfg.read(fnameINI)                                  #parse the file
baseline = dict()                                   #create empty hashtable
for section in cfg.sections():                      #for each parameter ...
    baseline[section] = cfg.get(section,'value')    #... add key-value pair

Consider a language-specific illustration, using the configparser module in the Python standard library. This module, which is already familiar to many Python programmers, was designed to parse INI files (and some common variants). The configparser module provides a pretty good solution, but it has some serious drawbacks. Most importantly, there is no type inference: all the values are stored as strings. In the current simple example, this is a fairly small consideration: since all the value are integers, they may easily be converted. (Indeed, one may simply replace get with getint.) However, when values may have more varied types, workarounds are needed. If we adopt a convention that every parameter definition includes a type attribute, then we can do type conversion based on this field. In contrast, the TOML specification ensures type inference, obviating this difficulty.

Another important limitation implied by the lack of type inference is a lack of support for arrays of values, which proves important in the documentation of experiments. In contrast, the TOML specification supports arrays and does type inference on the elements, obviating this difficulty.

In sum, the INI format cannot compete with TOML for the documentation and exchange of simulation parameterizations. The remaining two formats, JSON and YAML, offer more competition. Each has a formal specification that provides for type inference and includes support for arrays.

Comparison with the JSON Format

The JSON format was explicitly developed for lowest common denominator data exchange. There is a very well documented formal standard [ECMA-2017-JSON]. This is the most widely used format for platform-independent data exchange over the internet, so it is a particularly obvious candidate for our purposes. The following listing provides a possible representation of the baseline parameterization in the JSON format.

{
    "nAgents": {
        "value": 1000,
        "description": "The number of agents.",
        "source": "WR2015ch2"
    },
    "initialWealth": {
        "value": 100,
        "description": "Initial agent wealth.",
        "source": "WR2015ch2"
    },
    "maxiter": {
        "value": 40000,
        "description": "Maximum number of iterations.",
        "source": "pretest"
    },
    "seed": {
        "value": 314159,
        "description": "Seed for PRNG; increment this seed for each replicate.",
        "source": "arbitrary"
    },
    "nReplicates": {
        "value": 100,
        "description": "Number of replicates.",
        "source": "conventional"
    }
}

Support across programming languages for JSON is extremely good, so it is plausible as a language-agnostic format. It would be difficult to find a widely used programming language that lacks support for this format. For example, the Python standard library includes the json module [PSF-2019-StandardLibrary]. This module makes programmatic access to the JSON version of the parameterization about as simple as with TOML. The following listing provides a Python-based illustration of JSON parsing and access. (It is structured to mirror the TOML illustration in the paper. As before, in order to keep the code simple, there is no error handling.)

import json                            # import the parser module
with open(fnameJSON,'r') as fin:       # open the TOML file for reading
    info = json.load(fin)              # parse the file
baseline = dict((key, fields['value'])
                for (key, fields) in info.items()) # create baseline

Nevertheless, the JSON format is visually more cluttered than the TOML format. Four key reasons for this are the need for nested braces, the corresponding need for commas as separators, the need to quote keys (which incidentally must be strings), and the lack of clear visual clues (aside from optional indentation) about the extent of object nesting. The JSON format was not designed for humans to easily read or write, and it shows.

This example includes no comments because JSON does not support comments; it is a data-only format. Although a very simple JSON file can be written (as this one was) to minimize the visual challenges to a reader, this formatting is entirely optional, and the result is still not quite as easy to read. More importantly, it is much more challenging to create such a file by hand without introducing errors such as missing commas, quotes, or braces. So as a single source of truth for a baseline parameterization, a JSON file is inferior to a TOML file.

Comparison with the YAML Format

Last but not least, the YAML format is the most plausible contender for the role that this paper proposes for the TOML format. Like TOML but in contrast with JSON, YAML files may contain comments. YAML is formally a superset of JSON, which implies substantial complexity. Correspondingly, YAML can be much harder than JSON to generate and parse.

However, the YAML specification adds a readable syntax for easy creation by hand [Oren.Evans.Net-2009-YAML]. With this syntax, a simple YAML file becomes almost as readable as as a simple TOML file. However, it achieves this by introducing significant white space. To illustrate the potential readability of YAML, the following listing provides one possible representation of our baseline parameterization of Gift World in the YAML format.

initialWealth:
    description: Initial agent wealth.
    source: WR2015ch2
    value: 100
maxiter:
    description: Maximum number of iterations.
    source: pretest
    value: 40000
nAgents:
    description: The number of agents.
    source: WR2015ch2
    value: 1000
nReplicates:
    description: Number of replicates.
    source: conventional
    value: 100
seed:
    description: Seed for PRNG; increment this seed for each replicate.
    source: arbitrary
    value: 314159

Although the standard libraries of programming languages generally do not support YAML parsing, third-party support for YAML across programming languages is quite good. (See https://yaml.org/ for parsers for a list of many languages.) As a language-specific example, the yaml package for Python makes programmatic access to the YAML version of the parameterization about as simple as with TOML. (Most users simply use pip to install this package; e.g., python -m pip install pyyaml). The following listing provides a Python-based illustration of YAML parsing and access, written to mirror the TOML illustration in the present paper. (As before, to keep the code simple, there is no error handling.) One important drawback of the YAML format is that it may be possible to load and execute arbitrary code. This parsing example therefore uses the safe_load method, which eliminates this problem.

import yaml                                        # import the parser package
info = yaml.safe_load(open(fnameJSON,'r'))         # parse the file
baseline = dict((key, fields['value'])
                for (key, fields) in info.items()) # create baseline

The YAML format is a reasonable choice for the goals of this paper. However, one may argue that the reliance on significant indentation makes this format slightly visually more cluttered and, for human writers, more error prone than the TOML format. The specification is also much more complicated than the TOML specification. For example, the YAML illustration above eschews multiline strings in order to avoid discussing the several different YAML syntaxes for them. Nevertheless, a very simple YAML file can be handwritten (as this one was) to minimize the challenges to readability and easy understanding. And as with TOML, some of this visual convenience is encouraged by the specification. As a single source of truth for a baseline parameterization, a YAML file is nearly as simple, readable, maintainable, and portable as a TOML file.

In sum, YAML is a reasonably good option. It can be almost as easy to read as TOML, particularly given the limited needs addressed in the present paper. However, YAML is a bit harder to write by hand, may have greater security concerns, and involves a specification that is quite complicated and difficult to master. This suggests that TOML has an edge in readability, writability, maintainability, and simplicity.


version:

2023-06-29