Plain Text Data IO in Python
Creating and reading data files is a common needs in ABMS. The following discussion provides a basic introduction to file input and output in Python. The discussion focuses on storing and reading data in plain-text files.
Basic File I/O
File IO with Python begins with the builtin open
function.
This function opens a file and returns a handle to the file.
The handle may be used for reading or writing,
as shown below.
A typical usage provides three arguments:
the file path, the mode, and the encoding.
The file mode can be write (w
), append (a
), or read (r
).
Be cautious if you use the following example code:
if /temp/temp.txt
already exists,
writing to that path replaces it without warning.
#open temp.txt for writing:
fout = open('/temp/temp.txt', mode='w', encoding='utf8')
fout.write("Write first string to file.\n") #write to the file
fout.close() #close the file
#open temp.txt for appending:
fout = open('/temp/temp.txt', mode='a', encoding='utf8')
fout.write("Write a second string to file.\n") #write to the file
fout.close() #close the file
#open temp.txt for reading:
fin = open('/temp/temp.txt', mode='r', encoding='utf8')
text = fin.read() #read the entire file into a string variable
fin.close() #close the file
#print the string that we read from the file:
print(text)
Sometimes it is convenient to write a list of strings to file.
Perhaps you can apply the writelines
method of the file handle directly to the list.
However, as with the write
method,
remember that Python does not add an end-of-line (\n
) to each string.
That is left to the user, as already illustrated above.
Automate File Cleanup Using with
One problem with the IO method shown above
is that it leaves cleanup to the user.
Most obviously, the use must remember
to close each file when finished with it.
We can turn this file management over to Python
by using open
inside a with
statement.
Here is an example:
#using `with` for file management: with open('/temp/temp.txt', mode='r', encoding='utf8') as fin: text = fin.read() print(text)
read
vs readlines
Often you may with to control reading of the file.
Use read(n)
to read at most n bytes from a file.
(By omitting the number of bytes in our example,
we read the entire file).
Use readline
to read one line at a time.
Size can be restricted as with read(n)
,
but then you may read an incomplete line.
Use readlines()
to read all lines from a file,
which are returned as a list of lines.
with open('/temp/temp.txt', mode='r', encoding='utf8') as fin: lines = fin.readlines() #read the file lines into a list print(lines)
Note that Python does not strip the end-of-line characters.
When they are not wanted,
you can remove them (and any adjacent white space
with the rstrip
method of strings.
Writing a Numerical Sequence as Text
When numerical datasets are small in size, it is common to store the data in text files. Consider the goal of writing numerical data to a plain-text file as a single column. Here are a few ways to do that. In every case, work with the following data.
xs = [1.0, 2.0, 3.0]
Use Python’s builtin
print
function, which converts each number to a string. Use itssep
option to add an end-of-line after each number. Use itsfile
option to print to a file instead of the console.with open('/temp/temp.txt', mode='w', encoding='utf8') as fout: print(*xs, sep='\n', file=fout)
Make the string conversion up front and then write to file. The following illustration additionally makes use of the
pathlib
module.import pathlib mypth = pathlib.Path('/temp/temp.txt') mystr = '\n'.join(str(x) for x in xs) mypth.write_text(mystr, encoding='utf8')
Use the
savetxt
function of NumPy. This illustration uses thefmt
option to specify a floating-point format for the output.import numpy as np np.savetxt(mypth, xs, fmt='%f')
Use the
write_csv
method of a Polars dataframe. Although this is perhaps overkill for a single sequence of numbers, it makes it particularly simple to add a column header.import polars as pl df = pl.DataFrame([xs], schema=["myheader"]) df.write_csv(mypth)
Retrieving the Data
If we read data from a text file, the data is initially held in strings. We therefore often need to convert strings to numbers. Additionally, Python has these numeric types: int, float, and complex. When we convert strings to numbers, we have to decide which type of number we want. Here are some example conversions:
>>> int('012') 12 >>> float('012') 12.0 >>> complex('012') (12+0j)
Note that the type conversion functions int
,
float
, and complex
produce the expected result:
leading zeros in your data will not create a problem.
Leading white space is not a problem either.
Naturally, leading zeros or white space are not required.
In a text file, each number received a string representation, so the data must be converted into numbers upon import. Here are a few ways to read a single column of numerical data into a sequence type, assuming the first line is a header.
Use ordinary Python file reading, with the programmer handling the conversion. Note that Python can iterate over the lines of an open text file. this example illustrates the use of
next
to skip the header line.with open(mypth, mode='r', encoding='utf8') as fin: _ = next(fin) #discard header line xs = tuple(float(line) for line in fin)
Use the
loadtxt
function in NumPy, letting it handle the conversion. This example uses theskiprows
option to skip the header line.xs = np.loadtxt(mypth, skiprows=1)
Use the
read_csv
function in Polars, letting it handle the conversion. The result is a Polars dataframe, and the header row automatically becomes the name of the single column.xs = pl.read_csv(mypth)
Columnar CSV Files
When possible, separate the analysis of your model from the running of the model. This requires writing output files at runtime. A common file for the exchange of simulation data is the comma-separated values (CSV_) filetype, which is a specially formatted plain-text file. This typically begins with a header line, which provides names for the fields of the subsequent records. Each subsequent line is a single record, providing a value for each field.
CSV files stores data as plain-text.
A general standard for the CSV file format does not exist,
but a widely accepted description is given by RFC 4180.
The first line is often a header line,
containing field names, separated by commas.
Each subsequent line records a value for each field,
with the values separated by commas.
By convention, CSV file names include a .csv
extension.
For example, temp.csv
might contain the following data.
name, age, wage Dave,19,10.50 Jean,21,11.75 Chris,20,7.85
The intended types of values must be inferred. Here one might guess that the name field holds a string, the age field holds an integer, and the wage field holds a floating point number. Many applications perform such inference when reading a CSV file.
Writing CSV Files
Consider the goal of writing multiple equal-length sequences of numerical data
to a plain text file in CSV_ format,
with one sequence in each column.
Here are a few ways to do that.
Note that NumPy and Polars must be installed before use
(e.g., with pip
).
In every case, work with the following data.
xs = [1.0, 2.0, 3.0]
ys = [2.0, 4.0, 6.0]
seqs = [xs, ys]
Use Python’s builtin
print
function, which converts each number to a string. Use itssep
option to add a comma after each number. Use itsfile
option to print to a file instead of the console.with open('/temp/temp.txt', mode='w', encoding='utf8') as fout: for row in zip(*seqs): print(*row, sep=',', file=fout)
Make the string conversion up front and then write to file. The following illustration additionally makes use of the
pathlib
module.import pathlib mypth = pathlib.Path('/temp/temp.txt') mystr = '\n'.join(','.join(map(str,row)) for row in zip(*seqs)) mypth.write_text(mystr, encoding='utf8')
Use Python’s
csv
module, and let it handle the conversion. An important detail here is the handling of newlines: thewriterows
procedure provides them, but unless suppressed so doesopen
.import csv mypth = pathlib.Path('/temp/temp.txt') with mypth.open(mode='w', encoding='utf8', newline='') as fout: writer = csv.writer(fout) writer.writerows(zip(*seqs))
Use the
savetxt
function of NumPy. This illustration uses thefmt
option to specify a floating-point format for the output.import numpy as np np.savetxt(mypth, np.transpose(seqs), fmt='%f', delimiter=',')
Use the
write_csv
method of a Polars dataframe. Here we provide only the data, so Polars will automatically generate column headers. (Alternatively, use theschema
option to provid headers.)import polars as pl df = pl.DataFrame(seqs) df.write_csv(mypth)
Incremental Writing of CSV Files
In order to write data as a CSV file, one may handle the details or rely on a library. Here is one common approach.
Open the file for writing, write a header line, and close the file.
Repeat the following as needed:
Produce the data for a new record (i.e., row).
Open the CSV file for appending, write the data, and close the file.
For example, suppose we want to write the temp.csv
file
mentioned above, one line at a time.
Given the desired field names, proceed as follows.
This approach uses a with
statement,
which will ensure that the file is properly closed.
(In fact, as a nice bonus, the with statement will
ensure the file is properly closed even if our
attempt to write to the file fails for some reason.)
fields = ["name","age","wage"]
header = ",".join(fields)
with open('/temp/temp.csv', mode='w', encoding='utf8') as fout:
fout.write(header)
The open
function opens the file temp.csv
for writing (since the mode argument is w
).
Be careful:
this completely overwrites any existing temp.csv
.
The open
function returns a file object,
which can be used to write to the file.
Here, the name is fout
.
Create the string to write by joining the header
strings into a single string of comma-separated headers.
Use the write
method of this file object
to write a string to the file.
After we write this string to the file,
we close the file object.
Next, each time we receive a new data record (i.e., row),
we need to append it to the same file.
Suppose for example we have record=["Dave",19,10.50]
.
Using Python’s builtin map
function,
we can proceed as follows.
(Note that we must provide the line endings.
record=["Dave",19,10.50]
with open('/temp/temp.csv', mode='a', encoding='utf8') as fout:
row = ','.join(map(str,record)) #interlace commas
fout.write(f"\n{row}")
Note that our record was initially a list containing
diverse data types. We first convert the fields to strings,
which we can then combine into a single string a write to our file.
Note that we also write an end-of-line (\n
) before writing
each record to files, since the records of CSV files are on
separate lines.
In sum, it is simple to write data to a CSV file.
If you want, you can open this file in your favorite spreadsheet.
However, we did have to handle stringification ourselves,
and we had to remember to start a newline each time we wrote a record to file.
We could use Python’s csv
module to handle such details for us.
Here is a quick illustration, writing one row at a time.
This example additionally illustrates a nice feature
of the CSV writer.
With the quoting
option set to QUOTE_NONNUMERIC
,
the writer adds quotation marks around non-numeric values.
import csv
data=[["name","age","wage"], ["Dave",19,10.50], ["Jean",21,11.75]]
with open('/temp/temp.csv', mode='w', encoding='utf8', newline='') as fout:
writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
for row in data:
writer.writerow(row)
Reading Columnar CSV Files
Consider the goal of reading multiple equal-length sequences of numerical data from columnar CSV_ file, with one sequence in each column. Here are a few ways to do that.
Use the
reader
function from thecsv
module. This produces and iterator over the lines of the file, returning each line as a list of strings. As a nice feature, if your non-numeric data is quoted, one may optionally assignQUOTE_NONNUMERIC
to thequoting
option of the CSV reader. This coerces unquoted fields to floating point values.import csv with open('/temp/temp.csv', mode='r', encoding='utf8') as fin: reader = csv.reader(fin, delimiter=",", quoting=csv.QUOTE_NONNUMERIC) for line in reader: print(*map(type, line))
Use the
DictReader
class from thecsv
module. This produces and iterator over the lines of the file, returning each line as adict
, where the field names are determined from the header. (Alternatively, field names may be specified with thefieldnames
parameter.)import csv with open('/temp/temp.csv', mode='r', encoding='utf8') as fin: reader = csv.DictReader(fin, delimiter=",", quoting=csv.QUOTE_NONNUMERIC) for record in reader: #a record is a dict print(record)
If the data is entirely numeric, NumPy’s
loadtxt
function can import it as a array of floating-point numbers.import numpy as np a2d = np.loadtxt('/temp/temp.csv', skiprows=1, delimiter=',', usecols=(1,2))
Probably the best option for reading a columnar CSV file is the
read_csv
function of Polars. It cleverly provides automatic semantic inference. (It also accepts help with that inference.) The result is a Polars dataframe, a particularly useful datatype for data science.import polars as pl df = pl.read_csv('/temp/temp.csv')