The TSVIO package provides a fast but simple interface for accessing, read-only, (subsets of) potentially very large (many gigabytes) data matrices stored in plain text files.
Data files are required to be plain text files containing lines with tab-separated data columns.
Each line is separated into logical columns (fields) by tab characters.
The first line must contain unique labels for each data column. The first line may contain one less field than the remaining lines. Such files are often produced by R. Alternatively, the first line may contain the same number of fields as the remaining lines and the first field on that line is ignored. Such files are often produced by anything other than R.
Every line (row) after the first must contain the same number of fields. The first field of each line must be a unique row label. (Row and column labels are treated separately and can have labels in common.)
tsvio
assumes that the data file is static and does not
change during an R session.
Before data can be read from a data file, an index file containing the starting position of the data line for each row label must be generated.
The index file can be generated explicitly by calling
tsvGenIndex
:
tsvio
assumes that the data file is static and does not
change during an R session. Hence, an index file, once created, does not
change during an R session either.
The index file must be regenerated by the user whenever the data file
changes. The tsvio
package cannot detect that the data file
has changed. Using an outdated index file can result in erroneous
results or a run-time error.
The data access functions described below can generate the index file automatically on first access. Depending on file permissions, this may allow the user to simply remove the index file whenever the data file is modified. A new index file will be generated on the next access (which will thus be slower than normal).
The function tsvGetData
is used to read data as a
matrix:
rowpatterns
is either NULL
or a vector of
row labels. If NULL
, data from all lines in the file is
returned. Otherwise, only data from rows matching an entry in
rowpatterns
is returned. Only exact matches are
supported.
Similarly, colpatterns
specifies which columns to return
data for.
Thus, the entire data matrix can be returned by specifying
NULL
for both rowpatterns
and
colpatterns
.
The return value is always a data matrix with two dimensions. If
rowpatterns
or colpatterns
is a single
element, the corresponding axis of the returned matrix is not ‘dropped’.
The standard R function drop
can be used to delete any
dimensions of length one if desired.
By default, if rowpatterns
or colpatterns
are not NULL
, any specified labels not in the data file
will be silently ignored and not included in the result. However, if
there are no matching rows or no matching columns,
tsvGetData
will throw an error.
Setting the optional parameter findany
to
FALSE
will cause tsvGetData
to throw an error
if any specified label is not in the data file.
Rows and columns in the returned matrix will occur in same order as
they appear in rowpatterns
and colpatterns
respectively. Duplicate entries in rowpatterns
or
colpatterns
will never match any label (and always result
in an error if findany
is FALSE
).
The returned matrix will have the same mode as the dtype
parameter, which can be a string, a numeric, or an integer. The value of
the parameter is ignored. Returning a numeric or integer matrix can be
much faster than returning a character matrix and then converting it.
However, it requires all data elements in the data file to conform to
that type. Otherwise tsvGetData
will throw an error.
The function tsvGetLines
returns a subset of the lines
in the data file as a string vector:
The string vector returned by tsvGetLines
consists of
the entire first line in the data file, followed by the entirety of
every line whose row label occurs in patterns. Unlike with
tsvGetData
, patterns cannot be NULL
and
matching lines are ordered by their order in the data file, not the
order of their labels in patterns. If findany
is
TRUE
, labels in patterns that do not occur are ignored. If
no labels match, an error is thrown. If findany
is
FALSE
, an error is thrown if there is no row for any label
in patterns.