Introduction
magmaR, magma, and the UCSF Data Library
This vignette focuses on how to explore, query, and retrieve data from magma via its R-client, magmaR.
Magma is the data warehouse of the UCSF Data Library.
The Data Library holds various research data sets, broadly broken up into “projects”, and provides tools for adding to, organizing, viewing and analyzing these data sets.
Internally, the system is composed of a set of applications that each provides a different piece of the Data Library pie. Through the Magma application, one can query and retrieve data from, or update data within, “projects” that exist in the library.
We provide some more detail below, but for an even deeper overview of the structure of magma and the Data Library system than is provided here, you can refer to the main source of documentation, https://mountetna.github.io/magma.html.
Organization of data within magma
Data types within magma projects are organized into models, and individual data then make up the records of those models.
For example, information & data for 3 tubes run on a flow cytometer might make up 3 individual records of a flow cytometry model
Each record might have multiple attributes, such as the “gene_counts” matrix, the “cell_number”, or sorted cell “fraction” attributes of records that are part of an “rna_seq” model.
The set of attributes which a record might possess, are defined separately for each model. Thus, records of a “flow” model might have an “fcs_file” attribute, but records of the “rna_seq” model likely would not.
Hierarchically, the root of a project is always the “project” model, and every other model must have a single parent
model. Thus, the data graph is like a tree.
(Technically, link-type attributes may be used to indicate additional one-to-one or one-to-many relationships between models other than the tree-like parent <- model <- “children” relationships, which allows the graph to be more like a directed acyclic graph (DAG) than a tree… but imagining projects as trees is certainly easier than as an abstract blob.)
Here is a sketch of what an example project might look like. Quite literally though, it is in fact the layout of the “example” project which we will be playing with later on in this vignette:
example_project_map
This “example” project has 6 different models, including the project model itself. Each model holds different chunks of information (attributes) about data in the “example” project. For example, the subject
model contains information about individuals (records) for whom biospecimens exist; the biospecimen
model would then contain information about specific specimens obtained from each subject; and the flow
and rna_seq
models contain data and information from individual flow cytometry or rna_seq assays that were run on an individual biospecimen.
Each project in the data library system might have its own distinct modeling layout – as where to split up the information sharing scheme is highly dependent on a project’s data collection and experimental plans. However, in general, one can think of records of a model at the bottom of the tree as inheriting attributes from the parent records of their parent models. So for example, although we can also think of each model as it’s own independent set of data, “rna_seq”-model records are ultimately linked to individual “subject”-model records. Thus, even though attributes of the “subject”-model are not directly included in the “rna_seq”-model, all “subject”-model attributes of “subject”-model records do apply to linked “rna_seq”-model records. In magmaR, we include a function retrieveMetadata() for retrieving such linked data. More on that later.
At this point, you should know enough about the structure of magma projects to start using magmaR. But more information exists within magma’s own documentation: https://mountetna.github.io/magma.html.
How magmaR functions work
In general, magmaR functions will:
- Take in inputs from the user.
- Make a curl request that calls on a magma function to either send or receive desired data.
- Restructure the received data, typically minimally, to be more accessible for downstream analyses.
- Return the output.
Data Restructuring Details
The goal of magmaR is to allow users as direct as possible, yet also as ready-to-analyze as possible, access to data that exists within magma. Thus, some minor restructuring is performed by magmaR functions which does not change the underlying data, but does reorganize that data into more efficient formats for downstream analysis within R.
The two main output structures of magma returns
There are two main output structures for returns from magma:
- Tab Separated Value (tsv) tables
- JavaScript Object Notation (json) objects
Both formats are received as character strings, but then:
- tsv format returns are converted to data.frames.
- json format returns are converted to a nested lists.
The data.frame format tends to be easier to work with, but both of these can be fit quite readily into downstream applications.
Authorization process, a.k.a. janus token utilization:
In order to access data in magma, a user needs to be authorized to do so. How this is achieved is via provision of a user-specific, temporary, string which we call a token. This token can be obtained from https://janus.ucsf.edu/.
Providing a token
Within magmaR, the token is provided as part of the target
input which can be constructed with the magmaRset()
function.
To this function, a user’s token can be provided in one of two ways.
1) Via an interactive prompt (Recommended when coding interactively):
When not provided explicitly, as is the other method, the user will be prompted to provide their token via the interactive console. It is recommended that you store the output of your call to magmaRset()
as a variable, and then provide this variable within each subsequent call to a magmaR function, as below.
# Method1: User will be prompted to give their token in the R console
prod <- magmaRset()
ids_subject <- retrieveProjects(
# Now, we give the output of magmaRset() to the 'target' input of any
# other magmaR function.
target = prod)
If you run the above code, you should be prompted to
Enter your Janus TOKEN (without quotes):
To fill this in, navigate to Janus via your favorite browser, click the Copy Token
button, then paste the value into your console.
2) Give your token explicitly
Users can alternatively fill their token in by providing it explicitly to the token
input of magmaRset()
. This is not the generally recommended method because it is not ideal to have authorization values saved within potentially share-able locations. However the tokens are short-lived and methods of mitigating risk of such token exposure exist, see below.
prod <- magmaRset(token = "<your-token-here>")
ids_subject <- retrieveProjects(
# Now, we give the output of magmaRset() to the 'target' input of any
# other magmaR function.
target = prod)
NOTE: Instead of adding your token directly to any file which you might save, it is recommended that you utilize your .Renviron
file to store your token. To do so, you can:
- Utilize the convenient
usethis::edit_r_environ()
function to open your .Renviron
file. (Install usethis
with install.packages("usethis")
first.)
- Then add this line to the opened file:
TOKEN="<your_token>"
.
- Save the file & restart your R session.
- Now you can provide
magmaRset(token = Sys.getenv("TOKEN"))
, but when you save your script or .Rmd, the token itself will not be included.
- Repeat these steps whenever your token refreshes. Tokens normally refresh every 24 hours, but you’ll know when this happens because you will get the error message below.
When magma thinks you are unauthorized
If a request to magma returns that “You are unauthorized”, magmaR will provide extra info so that users can fix this issue:
# Error message when magma sends back that user is unauthorized:
You are unauthorized. If you think this is a mistake, re-run `?magmaRset` to update your 'token' input, then retry.
Helper functions
These functions allow exploration of what data exists within a given project.
Although it is possible to rely on timur.ucsf.edu/<projectName>/map, or on Timur’s search functionality, in order to determine options for projectName
, modelName
, recordNames
or attributeName(s)
inputs, magmaR provides these helper functions to allow users to achieve these goals without leaving R.
# projectName options:
retrieveProjects(
target = prod)
## project_name project_name_full role privileged
## 1 mvir1 COMET editor FALSE
## 2 ipi Immunoprofiler Initiative editor FALSE
## 3 dscolab Data Science CoLab editor FALSE
## 4 coprojects_template Coprojects Template administrator FALSE
## 5 example Example project administrator FALSE
## 6 xcrs1 Human-Mouse Cancer Translator administrator FALSE
# modelName options:
retrieveModels(
target = prod,
projectName = "example")
## [1] "subject" "rna_seq" "population" "project" "flow"
## [6] "biospecimen"
# recordNames options:
retrieveIds(
target = prod,
projectName = "example",
modelName = "subject")
## [1] "EXAMPLE-HS1" "EXAMPLE-HS10" "EXAMPLE-HS11" "EXAMPLE-HS12" "EXAMPLE-HS2"
## [6] "EXAMPLE-HS3" "EXAMPLE-HS4" "EXAMPLE-HS5" "EXAMPLE-HS6" "EXAMPLE-HS7"
## [11] "EXAMPLE-HS8" "EXAMPLE-HS9"
# attributeName(s) options:
retrieveAttributes(
target = prod,
projectName = "example",
modelName = "subject")
## [1] "name" "project" "biospecimen" "group"
For more complex needs like a complicated query()
request, you might require accessing the project’s template itself. That can be achieved via the retrieveTemplate()
function:
# To retrieve the project template:
temp <- retrieveTemplate(
target = prod,
projectName = "example")
To explore the return, I recommend starting with the str()
function looking
only a few levels in. You should see something like this:
str(temp, max.level = 3)
## List of 1
## $ models:List of 6
## ..$ subject :List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 4
## ..$ rna_seq :List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 4
## ..$ population :List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 4
## ..$ project :List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 3
## ..$ flow :List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 4
## ..$ biospecimen:List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 4
Then, followup by looking into the $template
of individual models further, perhaps as below:
# For the "subject" model:
str(temp$models$subject$template)
## List of 4
## $ name : chr "subject"
## $ attributes:List of 6
## ..$ created_at :List of 8
## .. ..$ name : chr "created_at"
## .. ..$ attribute_name: chr "created_at"
## .. ..$ display_name : chr "Created At"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi TRUE
## .. ..$ validation : NULL
## .. ..$ attribute_type: chr "date_time"
## ..$ updated_at :List of 8
## .. ..$ name : chr "updated_at"
## .. ..$ attribute_name: chr "updated_at"
## .. ..$ display_name : chr "Updated At"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi TRUE
## .. ..$ validation : NULL
## .. ..$ attribute_type: chr "date_time"
## ..$ name :List of 8
## .. ..$ name : chr "name"
## .. ..$ attribute_name: chr "name"
## .. ..$ display_name : chr "Name"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi FALSE
## .. ..$ validation : NULL
## .. ..$ attribute_type: chr "identifier"
## ..$ project :List of 10
## .. ..$ name : chr "project"
## .. ..$ attribute_name : chr "project"
## .. ..$ model_name : chr "project"
## .. ..$ link_model_name: chr "project"
## .. ..$ display_name : chr "Project"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi FALSE
## .. ..$ validation : NULL
## .. ..$ attribute_type : chr "parent"
## ..$ biospecimen:List of 10
## .. ..$ name : chr "biospecimen"
## .. ..$ attribute_name : chr "biospecimen"
## .. ..$ model_name : chr "biospecimen"
## .. ..$ link_model_name: chr "biospecimen"
## .. ..$ display_name : chr "Biospecimen"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi FALSE
## .. ..$ validation : NULL
## .. ..$ attribute_type : chr "collection"
## ..$ group :List of 8
## .. ..$ name : chr "group"
## .. ..$ attribute_name: chr "group"
## .. ..$ display_name : chr "Group"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi FALSE
## .. ..$ validation : NULL
## .. ..$ attribute_type: chr "string"
## $ identifier: chr "name"
## $ parent : chr "project"
Main data download functions:
Finally, the meat of why we’re here.
magma has two main data output functions, /retrieve
and /query
.
magmaR provides methods for both.
retrieve() & retrieveJSON()
retrieve()
is probably the main workhorse function of magmaR
. If your goal is to download “subject” data for a specific patient of a project, or for all patients of the project, this is the function to start with.
The basic structure is to provide which project, projectName
and which model, modelName
, that you want data for.
df <- retrieve(
target = prod,
projectName = "example",
modelName = "subject")
head(df)
## name project biospecimen group
## 1 EXAMPLE-HS1 example EXAMPLE-HS1-WB1 g1
## 2 EXAMPLE-HS10 example EXAMPLE-HS10-WB1 g3
## 3 EXAMPLE-HS11 example EXAMPLE-HS11-WB1 g3
## 4 EXAMPLE-HS12 example EXAMPLE-HS12-WB1 g3
## 5 EXAMPLE-HS2 example EXAMPLE-HS2-WB1 g1
## 6 EXAMPLE-HS3 example EXAMPLE-HS3-WB1 g1
Optionally, a set of recordNames
or attributeNames
can be given as well to grab a more specific subset of data from the given project-model pair.
df <- retrieve(
target = prod,
projectName = "example",
modelName = "subject",
recordNames = c("EXAMPLE-HS1", "EXAMPLE-HS2"),
attributeNames = "group")
head(df)
## name group
## 1 EXAMPLE-HS1 g1
## 2 EXAMPLE-HS2 g1
(You can use the retrieveIDs()
and retrieveAttributes()
functions described above in the Helper functions section to determine options for the recordNames
and attributeNames
inputs, respectively.)
Unfortunately, for certain attribute data types, matrix
and table
, the literal data are not actually given via magma/retrieve when format = "tsv"
. Instead only a pointer is returned. For such attributes, the retrieveJSON()
function can retrieve such data (via a magma/retrieve call with format = "json"
) and a wrapper that makes efficient use of retrieveJSON()
specifically for matrix data retrieval is also included. Users should not typically need to make use of retrieveJSON()
directly, as when the desired data is a matrix, retrieveMatrix()
is recommended instead. More details on that function follow.
json <- retrieveJSON(
target = prod,
projectName = "example",
modelName = "rna_seq",
recordNames = c("EXAMPLE-HS1-WB1-RSQ1", "EXAMPLE-HS2-WB1-RSQ1"),
attributeNames = "gene_counts")
retrieveMatrix()
Because matrices are a very common and important data structure, but are not accessible via retrieve()
, we provide this function. For a single matrix-type attribute, it will obtain data from magma in the required json structure, and then automatically reorganize said data into the matrix structure that a user would typically expect.
In the example below, we obtain the transcripts-per-million(-reads) normalized counts data for all records/samples of the example project. In this matrix, columns will be the individual records, and rows will be features. Specifically, for the example data here, those row names are “gene1”, “gene2”, and so on, but for real rna_seq data, those row names would typically be the Ensembl gene ids that each row of the matrix represents.
mat <- retrieveMatrix(
target = prod,
projectName = "example",
modelName = "rna_seq",
recordNames = "all",
attributeNames = "gene_tpm")
head(mat, n = c(6,3))
## EXAMPLE-HS10-WB1-RSQ1 EXAMPLE-HS11-WB1-RSQ1 EXAMPLE-HS12-WB1-RSQ1
## gene1 0.5187 7.9960 4.8278
## gene2 29.9572 42.9785 31.2540
## gene3 111.6587 154.9225 114.0897
## gene4 269.3555 426.7866 302.6299
## gene5 0.3891 1.9990 0.0000
## gene6 0.0000 0.0000 0.0000
Most user need not worry about the internal method, but for those that are curious: Under the hood, data is grabbed via retrieveJSON()
for 10 records at a time. The relevant data are then extracted from the complex list output of this retrieval route, then they are converted into a matrix structure where column names are the recordNames
. Row names are then grabbed from the model’s template for what this data should represent.
query()
The Magma Query API lets you pull data out of Magma through an expressive query interface. Often, if you want a specific set of data from model-X, but only, say, for records where linked records of model-Y have data for attribute-Z, then this is the endpoint you want.
But note: the format of query()
calls can be a bit complicated, so it is recommended to check if retreiveMetadata()
might better serve your purposes first. We’ll describe that function a bit later.
For guidance on how to format query()
calls, see ?query
and https://mountetna.github.io/magma.html#query.
query_out <- query(
target = prod,
projectName = "example",
queryTerms =
list('rna_seq',
'::all',
'biospecimen',
'::identifier')
)
Details: The default output of this function is a list conversion of the direct json output returned by magma/query. This list will contain either 2 or 3 parts:
names(query_out)
## [1] "answer" "type" "format"
answer, type (optional), and format.
Alternatively, the output can be reformatted as a dataframe if format = "df"
is given.
subject_ids_of_rnaseq_records <- query(
target = prod,
projectName = "example",
queryTerms =
list('rna_seq',
'::all',
'biospecimen',
'::identifier'),
format = "df"
)
head(subject_ids_of_rnaseq_records)
## example::rna_seq#tube_name example::biospecimen#name
## 1 EXAMPLE-HS10-WB1-RSQ1 EXAMPLE-HS10-WB1
## 2 EXAMPLE-HS11-WB1-RSQ1 EXAMPLE-HS11-WB1
## 3 EXAMPLE-HS12-WB1-RSQ1 EXAMPLE-HS12-WB1
## 4 EXAMPLE-HS1-WB1-RSQ1 EXAMPLE-HS1-WB1
## 5 EXAMPLE-HS2-WB1-RSQ1 EXAMPLE-HS2-WB1
## 6 EXAMPLE-HS3-WB1-RSQ1 EXAMPLE-HS3-WB1
Details: When format = "df"
is added, the list output will be converted to a data.frame where data comes from the answer
and column names come from the format
pieces.