Plain Text Data IO in Python

Creating and reading data files is a common needs in ABMS. The following discussion provides a basic introduction to file input and output in Python. The discussion focuses on storing and reading data in plain-text files.

Basic File I/O

File IO with Python begins with the builtin open function. This function opens a file and returns a handle to the file. The handle may be used for reading or writing, as shown below. A typical usage provides three arguments: the file path, the mode, and the encoding. The file mode can be write (w), append (a), or read (r). Be cautious if you use the following example code: if /temp/temp.txt already exists, writing to that path replaces it without warning.

#open temp.txt for writing:
fout = open('/temp/temp.txt', mode='w', encoding='utf8')
fout.write("Write first string to file.\n")        #write to the file
fout.close()                                       #close the file

#open temp.txt for appending:
fout = open('/temp/temp.txt', mode='a', encoding='utf8')
fout.write("Write a second string to file.\n")     #write to the file
fout.close()                                       #close the file

#open temp.txt for reading:
fin = open('/temp/temp.txt', mode='r', encoding='utf8')
text = fin.read()                #read the entire file into a string variable
fin.close()                      #close the file

#print the string that we read from the file:
print(text)

Sometimes it is convenient to write a list of strings to file. Perhaps you can apply the writelines method of the file handle directly to the list. However, as with the write method, remember that Python does not add an end-of-line (\n) to each string. That is left to the user, as already illustrated above.

Automate File Cleanup Using with

One problem with the IO method shown above is that it leaves cleanup to the user. Most obviously, the use must remember to close each file when finished with it. We can turn this file management over to Python by using open inside a with statement. Here is an example:

#using `with` for file management:
with open('/temp/temp.txt', mode='r', encoding='utf8') as fin:
   text = fin.read()
print(text)

read vs readlines

Often you may with to control reading of the file. Use read(n) to read at most n bytes from a file. (By omitting the number of bytes in our example, we read the entire file). Use readline to read one line at a time. Size can be restricted as with read(n), but then you may read an incomplete line. Use readlines() to read all lines from a file, which are returned as a list of lines.

with open('/temp/temp.txt', mode='r', encoding='utf8') as fin:
   lines = fin.readlines()    #read the file lines into a list
print(lines)

Note that Python does not strip the end-of-line characters. When they are not wanted, you can remove them (and any adjacent white space with the rstrip method of strings.

Writing a Numerical Sequence as Text

When numerical datasets are small in size, it is common to store the data in text files. Consider the goal of writing numerical data to a plain-text file as a single column. Here are a few ways to do that. In every case, work with the following data.

xs = [1.0, 2.0, 3.0]
  1. Use Python’s builtin print function, which converts each number to a string. Use its sep option to add an end-of-line after each number. Use its file option to print to a file instead of the console.

    with open('/temp/temp.txt', mode='w', encoding='utf8') as fout:
       print(*xs, sep='\n', file=fout)
  2. Make the string conversion up front and then write to file. The following illustration additionally makes use of the pathlib module.

    import pathlib
    mypth = pathlib.Path('/temp/temp.txt')
    mystr = '\n'.join(str(x) for x in xs)
    mypth.write_text(mystr, encoding='utf8')
  3. Use the savetxt function of NumPy. This illustration uses the fmt option to specify a floating-point format for the output.

    import numpy as np
    np.savetxt(mypth, xs, fmt='%f')
  4. Use the write_csv method of a Polars dataframe. Although this is perhaps overkill for a single sequence of numbers, it makes it particularly simple to add a column header.

    import polars as pl
    df = pl.DataFrame([xs], schema=["myheader"])
    df.write_csv(mypth)

Retrieving the Data

If we read data from a text file, the data is initially held in strings. We therefore often need to convert strings to numbers. Additionally, Python has these numeric types: int, float, and complex. When we convert strings to numbers, we have to decide which type of number we want. Here are some example conversions:

>>> int('012')
12
>>> float('012')
12.0
>>> complex('012')
(12+0j)

Note that the type conversion functions int, float, and complex produce the expected result: leading zeros in your data will not create a problem. Leading white space is not a problem either. Naturally, leading zeros or white space are not required.

In a text file, each number received a string representation, so the data must be converted into numbers upon import. Here are a few ways to read a single column of numerical data into a sequence type, assuming the first line is a header.

  1. Use ordinary Python file reading, with the programmer handling the conversion. Note that Python can iterate over the lines of an open text file. this example illustrates the use of next to skip the header line.

    with open(mypth, mode='r', encoding='utf8') as fin:
        _ = next(fin) #discard header line
        xs = tuple(float(line) for line in fin)
  2. Use the loadtxt function in NumPy, letting it handle the conversion. This example uses the skiprows option to skip the header line.

    xs = np.loadtxt(mypth, skiprows=1)
  3. Use the read_csv function in Polars, letting it handle the conversion. The result is a Polars dataframe, and the header row automatically becomes the name of the single column.

    xs = pl.read_csv(mypth)

Columnar CSV Files

When possible, separate the analysis of your model from the running of the model. This requires writing output files at runtime. A common file for the exchange of simulation data is the comma-separated values (CSV_) filetype, which is a specially formatted plain-text file. This typically begins with a header line, which provides names for the fields of the subsequent records. Each subsequent line is a single record, providing a value for each field.

CSV files stores data as plain-text. A general standard for the CSV file format does not exist, but a widely accepted description is given by RFC 4180. The first line is often a header line, containing field names, separated by commas. Each subsequent line records a value for each field, with the values separated by commas. By convention, CSV file names include a .csv extension. For example, temp.csv might contain the following data.

name, age, wage
Dave,19,10.50
Jean,21,11.75
Chris,20,7.85

The intended types of values must be inferred. Here one might guess that the name field holds a string, the age field holds an integer, and the wage field holds a floating point number. Many applications perform such inference when reading a CSV file.

Writing CSV Files

Consider the goal of writing multiple equal-length sequences of numerical data to a plain text file in CSV_ format, with one sequence in each column. Here are a few ways to do that. Note that NumPy and Polars must be installed before use (e.g., with pip). In every case, work with the following data.

xs = [1.0, 2.0, 3.0]
ys = [2.0, 4.0, 6.0]
seqs = [xs, ys]
  1. Use Python’s builtin print function, which converts each number to a string. Use its sep option to add a comma after each number. Use its file option to print to a file instead of the console.

    with open('/temp/temp.txt', mode='w', encoding='utf8') as fout:
       for row in zip(*seqs):
          print(*row, sep=',', file=fout)
  2. Make the string conversion up front and then write to file. The following illustration additionally makes use of the pathlib module.

    import pathlib
    mypth = pathlib.Path('/temp/temp.txt')
    mystr = '\n'.join(','.join(map(str,row)) for row in zip(*seqs))
    mypth.write_text(mystr, encoding='utf8')
  3. Use Python’s csv module, and let it handle the conversion. An important detail here is the handling of newlines: the writerows procedure provides them, but unless suppressed so does open.

    import csv
    mypth = pathlib.Path('/temp/temp.txt')
    with mypth.open(mode='w', encoding='utf8', newline='') as fout:
       writer = csv.writer(fout)
       writer.writerows(zip(*seqs))
  4. Use the savetxt function of NumPy. This illustration uses the fmt option to specify a floating-point format for the output.

    import numpy as np
    np.savetxt(mypth, np.transpose(seqs), fmt='%f', delimiter=',')
  5. Use the write_csv method of a Polars dataframe. Here we provide only the data, so Polars will automatically generate column headers. (Alternatively, use the schema option to provid headers.)

    import polars as pl
    df = pl.DataFrame(seqs)
    df.write_csv(mypth)

Incremental Writing of CSV Files

In order to write data as a CSV file, one may handle the details or rely on a library. Here is one common approach.

  1. Open the file for writing, write a header line, and close the file.

  2. Repeat the following as needed:

    1. Produce the data for a new record (i.e., row).

    2. Open the CSV file for appending, write the data, and close the file.

For example, suppose we want to write the temp.csv file mentioned above, one line at a time. Given the desired field names, proceed as follows. This approach uses a with statement, which will ensure that the file is properly closed. (In fact, as a nice bonus, the with statement will ensure the file is properly closed even if our attempt to write to the file fails for some reason.)

fields = ["name","age","wage"]
header = ",".join(fields)
with open('/temp/temp.csv', mode='w', encoding='utf8') as fout:
    fout.write(header)

The open function opens the file temp.csv for writing (since the mode argument is w). Be careful: this completely overwrites any existing temp.csv. The open function returns a file object, which can be used to write to the file. Here, the name is fout. Create the string to write by joining the header strings into a single string of comma-separated headers. Use the write method of this file object to write a string to the file. After we write this string to the file, we close the file object.

Next, each time we receive a new data record (i.e., row), we need to append it to the same file. Suppose for example we have record=["Dave",19,10.50]. Using Python’s builtin map function, we can proceed as follows. (Note that we must provide the line endings.

record=["Dave",19,10.50]
with open('/temp/temp.csv', mode='a', encoding='utf8') as fout:
    row = ','.join(map(str,record))  #interlace commas
    fout.write(f"\n{row}")

Note that our record was initially a list containing diverse data types. We first convert the fields to strings, which we can then combine into a single string a write to our file. Note that we also write an end-of-line (\n) before writing each record to files, since the records of CSV files are on separate lines.

In sum, it is simple to write data to a CSV file. If you want, you can open this file in your favorite spreadsheet. However, we did have to handle stringification ourselves, and we had to remember to start a newline each time we wrote a record to file. We could use Python’s csv module to handle such details for us. Here is a quick illustration, writing one row at a time. This example additionally illustrates a nice feature of the CSV writer. With the quoting option set to QUOTE_NONNUMERIC, the writer adds quotation marks around non-numeric values.

import csv
data=[["name","age","wage"], ["Dave",19,10.50], ["Jean",21,11.75]]
with open('/temp/temp.csv', mode='w', encoding='utf8', newline='') as fout:
    writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
    for row in data:
        writer.writerow(row)

Reading Columnar CSV Files

Consider the goal of reading multiple equal-length sequences of numerical data from columnar CSV_ file, with one sequence in each column. Here are a few ways to do that.

  1. Use the reader function from the csv module. This produces and iterator over the lines of the file, returning each line as a list of strings. As a nice feature, if your non-numeric data is quoted, one may optionally assign QUOTE_NONNUMERIC to the quoting option of the CSV reader. This coerces unquoted fields to floating point values.

    import csv
    with open('/temp/temp.csv', mode='r', encoding='utf8') as fin:
        reader = csv.reader(fin, delimiter=",", quoting=csv.QUOTE_NONNUMERIC)
        for line in reader:
            print(*map(type, line))
  2. Use the DictReader class from the csv module. This produces and iterator over the lines of the file, returning each line as a dict, where the field names are determined from the header. (Alternatively, field names may be specified with the fieldnames parameter.)

    import csv
    with open('/temp/temp.csv', mode='r', encoding='utf8') as fin:
        reader = csv.DictReader(fin, delimiter=",", quoting=csv.QUOTE_NONNUMERIC)
        for record in reader:  #a record is a dict
            print(record)
  3. If the data is entirely numeric, NumPy’s loadtxt function can import it as a array of floating-point numbers.

    import numpy as np
    a2d = np.loadtxt('/temp/temp.csv', skiprows=1, delimiter=',', usecols=(1,2))
  4. Probably the best option for reading a columnar CSV file is the read_csv function of Polars. It cleverly provides automatic semantic inference. (It also accepts help with that inference.) The result is a Polars dataframe, a particularly useful datatype for data science.

    import polars as pl
    df = pl.read_csv('/temp/temp.csv')