Rentrez Tutorial

Getting started with the rentrez

To make the most of all the data the NCBI shares you need to know a little about their databases, the records they contain and the ways you can find those records. The NCBI provides extensive documentation for each of their databases and for the EUtils API that rentrez takes advantage of. There are also some helper functions in rentrez that help users learn their way around the NCBI's databases.

First, you can use entrez_dbs() to find the list of available databases:

entrez_dbs()

##  [1] "pubmed"          "protein"         "nuccore"         "ipg"            
##  [5] "nucleotide"      "structure"       "sparcle"         "protfam"        
##  [9] "genome"          "annotinfo"       "assembly"        "bioproject"     
## [13] "biosample"       "blastdbinfo"     "books"           "cdd"            
## [17] "clinvar"         "gap"             "gapplus"         "grasp"          
## [21] "dbvar"           "gene"            "gds"             "geoprofiles"    
## [25] "homologene"      "medgen"          "mesh"            "ncbisearch"     
## [29] "nlmcatalog"      "omim"            "orgtrack"        "pmc"            
## [33] "popset"          "proteinclusters" "pcassay"         "biosystems"     
## [37] "pccompound"      "pcsubstance"     "seqannot"        "snp"            
## [41] "sra"             "taxonomy"        "biocollections"  "gtr"

There is a set of functions with names starting entrez_db_ that can be used to gather more information about each of these databases:

Functions that help you learn about NCBI databases

Function name	Return
`entrez_db_summary()`	Brief description of what the database is
`entrez_db_searchable()`	Set of search terms that can used with this database
`entrez_db_links()`	Set of databases that might contain linked records

For instance, we can get a description of the somewhat cryptically named database 'cdd'...

entrez_db_summary("cdd")

##  DbName: cdd
##  MenuName: Conserved Domains
##  Description: Conserved Domain Database
##  DbBuild: Build200915-1834.1
##  Count: 59951
##  LastUpdate: 2020/09/15 23:43

... or find out which search terms can be used with the Sequence Read Archive (SRA) database (which contains raw data from sequencing projects):

entrez_db_searchable("sra")

## Searchable fields for database 'sra'
##   ALL     All terms from all searchable fields 
##   UID     Unique number assigned to publication 
##   FILT    Limits the records 
##   ACCN    Accession number of sequence 
##   TITL    Words in definition line 
##   PROP    Classification by source qualifiers and molecule type 
##   WORD    Free text associated with record 
##   ORGN    Scientific and common names of organism, and all higher levels of taxonomy 
##   AUTH    Author(s) of publication 
##   PDAT    Date sequence added to GenBank 
##   MDAT    Date of last update 
##   GPRJ    BioProject 
##   BSPL    BioSample 
##   PLAT    Platform 
##   STRA    Strategy 
##   SRC     Source 
##   SEL     Selection 
##   LAY     Layout 
##   RLEN    Percent of aligned reads 
##   ACS     Access is public or controlled 
##   ALN     Percent of aligned reads 
##   MBS     Size in megabases

Just how these 'helper' functions might be useful will become clearer once you've started using rentrez, so let's get started.

Searching databases: `entrez_search()`

Very often, the first thing you'll want to do with rentrez is search a given NCBI database to find records that match some keywords. You can do this using the function entrez_search(). In the simplest case you just need to provide a database name (db) and a search term (term) so let's search PubMed for articles about the R language:

r_search <- entrez_search(db="pubmed", term="R Language")

The object returned by a search acts like a list, and you can get a summary of its contents by printing it.

r_search

## Entrez search result with 13887 hits (object contains 20 IDs and no web_history object)
##  Search term (as translated):  R[All Fields] AND ("programming languages"[MeSH Te ...

There are a few things to note here. First, the NCBI's server has worked out that we meant R as a programming language, and so included the 'MeSH' term term associated with programming languages. We'll worry about MeSH terms and other special queries later, for now just note that you can use this feature to check that your search term was interpreted in the way you intended. Second, there are many more 'hits' for this search than there are unique IDs contained in this object. That's because the optional argument retmax, which controls the maximum number of returned values has a default value of 20.

The IDs are the most important thing returned here. They allow us to fetch records matching those IDs, gather summary data about them or find cross-referenced records in other databases. We access the IDs as a vector using the $ operator:

r_search$ids

##  [1] "33162761" "33159508" "33159265" "33157941" "33156672" "33152986"
##  [7] "33151654" "33149734" "33148210" "33146627" "33144470" "33140307"
## [13] "33138840" "33137091" "33134808" "33132805" "33130251" "33129117"
## [19] "33126397" "33124328"

If we want to get more than 20 IDs we can do so by increasing the ret_max argument.

another_r_search <- entrez_search(db="pubmed", term="R Language", retmax=40)
another_r_search

## Entrez search result with 13887 hits (object contains 40 IDs and no web_history object)
##  Search term (as translated):  R[All Fields] AND ("programming languages"[MeSH Te ...

If we want to get IDs for all of the thousands of records that match this search, we can use the NCBI's web history feature described below.

Building search terms

The EUtils API uses a special syntax to build search terms. You can search a database against a specific term using the format query[SEARCH FIELD], and combine multiple such searches using the boolean operators AND, OR and NOT.

For instance, we can find next generation sequence datasets for the (amazing...) ciliate Tetrahymena thermophila by using the organism ('ORGN') search field:

entrez_search(db="sra",
              term="Tetrahymena thermophila[ORGN]",
              retmax=0)

## Entrez search result with 896 hits (object contains 0 IDs and no web_history object)
##  Search term (as translated):  "Tetrahymena thermophila"[Organism]

We can narrow our focus to only those records that have been added recently (using the colon to specify a range of values):

entrez_search(db="sra",
              term="Tetrahymena thermophila[ORGN] AND 2013:2015[PDAT]",
              retmax=0)

## Entrez search result with 75 hits (object contains 0 IDs and no web_history object)
##  Search term (as translated):  "Tetrahymena thermophila"[Organism] AND 2013[PDAT] ...

Or include recent records for either T. thermophila or it's close relative T. borealis (using parentheses to make ANDs and ORs explicit).

entrez_search(db="sra",
              term="(Tetrahymena thermophila[ORGN] OR Tetrahymena borealis[ORGN]) AND 2013:2015[PDAT]",
              retmax=0)

## Entrez search result with 75 hits (object contains 0 IDs and no web_history object)
##  Search term (as translated):  ("Tetrahymena thermophila"[Organism] OR "Tetrahyme ...

The set of search terms available varies between databases. You can get a list of available terms or any given data base with entrez_db_searchable()

entrez_db_searchable("sra")

## Searchable fields for database 'sra'
##   ALL     All terms from all searchable fields 
##   UID     Unique number assigned to publication 
##   FILT    Limits the records 
##   ACCN    Accession number of sequence 
##   TITL    Words in definition line 
##   PROP    Classification by source qualifiers and molecule type 
##   WORD    Free text associated with record 
##   ORGN    Scientific and common names of organism, and all higher levels of taxonomy 
##   AUTH    Author(s) of publication 
##   PDAT    Date sequence added to GenBank 
##   MDAT    Date of last update 
##   GPRJ    BioProject 
##   BSPL    BioSample 
##   PLAT    Platform 
##   STRA    Strategy 
##   SRC     Source 
##   SEL     Selection 
##   LAY     Layout 
##   RLEN    Percent of aligned reads 
##   ACS     Access is public or controlled 
##   ALN     Percent of aligned reads 
##   MBS     Size in megabases

Using the Filter field

"Filter" is a special field that, as the names suggests, allows you to limit records returned by a search to set of filtering criteria. There is no programmatic way to find the particular terms that can be used with the Filter field. However, the NCBI's website provides an "advanced search" tool for some databases that can be used to discover these terms.

For example, to find the list of possible to find all of the terms that can be used to filter searches to the nucleotide database using the advanced search for that databse. On that page selecting "Filter" from the first drop-down box then clicking "Show index list" will allow the user to scroll through possible filtering terms.

Precise queries using MeSH terms

In addition to the search terms described above, the NCBI allows searches using Medical Subject Heading (MeSH) terms. These terms create a 'controlled vocabulary', and allow users to make very finely controlled queries of databases.

For instance, if you were interested in reviewing studies on how a class of anti-malarial drugs called Folic Acid Antagonists work against Plasmodium vivax (a particular species of malarial parasite), you could use this search:

entrez_search(db   = "pubmed",
              term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")

## Entrez search result with 16 hits (object contains 16 IDs and no web_history object)
##  Search term (as translated):  "malaria, vivax"[MeSH Terms] AND "folic acid antag ...

The complete set of MeSH terms is available as a database from the NCBI. That means it is possible to download detailed information about each term and find the ways in which terms relate to each other using rentrez. You can search for specific terms with entrez_search(db="mesh", term =...) and learn about the results of your search using the tools described below.

Advanced counting

As you can see above, the object returned by entrez_search() includes the number of records matching a given search. This means you can learn a little about the composition of, or trends in, the records stored in the NCBI's databases using only the search utility. For instance, let's track the rise of the scientific buzzword "connectome" in PubMed, programmatically creating search terms for the PDAT field:

search_year <- function(year, term){
    query <- paste(term, "AND (", year, "[PDAT])")
    entrez_search(db="pubmed", term=query, retmax=0)$count
}

year <- 2008:2014
papers <- sapply(year, search_year, term="Connectome", USE.NAMES=FALSE)

plot(year, papers, type='b', main="The Rise of the Connectome")

Finding cross-references : `entrez_link()`:

One of the strengths of the NCBI databases is the degree to which records of one type are connected to other records within the NCBI or to external data sources. The function entrez_link() allows users to discover these links between records.

My god, it's full of links

To get an idea of the degree to which records in the NCBI are cross-linked we can find all NCBI data associated with a single gene (in this case the Amyloid Beta Precursor gene, the product of which is associated with the plaques that form in the brains of Alzheimer's Disease patients).

The function entrez_link() can be used to find cross-referenced records. In the most basic case we need to provide an ID (id), the database from which this ID comes (dbfrom) and the name of a database in which to find linked records (db). If we set this last argument to 'all' we can find links in multiple databases:

all_the_links <- entrez_link(dbfrom='gene', id=351, db='all')
all_the_links

## elink object with contents:
##  $links: IDs for linked records from NCBI
##

Just as with entrez_search the returned object behaves like a list, and we can learn a little about its contents by printing it. In the case, all of the information is in links (and there's a lot of them!):

all_the_links$links

## elink result with information from 57 databases:
##  [1] gene_bioconcepts               gene_biosystems               
##  [3] gene_biosystems_all            gene_clinvar                  
##  [5] gene_clinvar_specific          gene_dbvar                    
##  [7] gene_genome                    gene_gtr                      
##  [9] gene_homologene                gene_medgen_diseases          
## [11] gene_pcassay_alltarget_list    gene_pcassay_alltarget_summary
## [13] gene_pcassay_rnai              gene_pcassay_rnai_active      
## [15] gene_pcassay_target            gene_probe                    
## [17] gene_snp                       gene_structure                
## [19] gene_bioproject                gene_cdd                      
## [21] gene_gene_h3k4me3              gene_gene_neighbors           
## [23] gene_genereviews               gene_genome2                  
## [25] gene_geoprofiles               gene_nuccore                  
## [27] gene_nuccore_mgc               gene_nuccore_pos              
## [29] gene_nuccore_refseqgene        gene_nuccore_refseqrna        
## [31] gene_nucest_clust              gene_nucleotide               
## [33] gene_nucleotide_clust          gene_nucleotide_mgc           
## [35] gene_nucleotide_mgc_url        gene_nucleotide_pos           
## [37] gene_omim                      gene_pcassay_proteintarget    
## [39] gene_pccompound                gene_pcsubstance              
## [41] gene_pmc                       gene_pmc_nucleotide           
## [43] gene_protein                   gene_protein_refseq           
## [45] gene_pubmed                    gene_pubmed_all               
## [47] gene_pubmed_citedinomim        gene_pubmed_highlycited       
## [49] gene_pubmed_latest             gene_pubmed_pmc_nucleotide    
## [51] gene_pubmed_reviews            gene_pubmed_rif               
## [53] gene_snp_geneview              gene_sparcle                  
## [55] gene_taxonomy                  gene_unigene                  
## [57] gene_varview

The names of the list elements are in the format [source_database]_[linked_database] and the elements themselves contain a vector of linked-IDs. So, if we want to find open access publications associated with this gene we could get linked records in PubMed Central:

all_the_links$links$gene_pmc[1:10]

##  [1] "7490758" "7296003" "7291702" "6991300" "6889931" "6796269" "6778137"
##  [8] "6769662" "6745714" "6729714"

Or if were interested in this gene's role in diseases we could find links to clinVar:

all_the_links$links$gene_clinvar

##   [1] "983157" "980228" "967074" "963922" "932452" "899123" "899061" "899060"
##   [9] "898004" "898003" "898002" "898001" "898000" "897923" "897922" "897921"
##  [17] "897920" "896374" "896373" "896311" "896310" "896309" "896308" "896307"
##  [25] "896306" "896305" "894990" "894989" "894932" "894931" "894930" "894929"
##  [33] "894858" "894857" "894856" "872801" "870542" "851473" "836798" "816143"
##  [41] "809279" "798148" "789927" "784763" "772956" "772393" "771988" "755687"
##  [49] "754549" "752834" "746190" "745507" "742020" "740728" "729651" "723763"
##  [57] "721913" "717600" "710556" "706432" "706273" "705909" "705875" "705044"
##  [65] "704999" "704993" "704956" "704545" "704297" "704136" "703233" "687518"
##  [73] "686229" "664060" "662051" "657080" "653874" "649787" "645976" "644611"
##  [81] "640843" "638317" "635905" "604878" "604877" "604783" "604782" "604779"
##  [89] "604777" "585434" "585433" "585432" "585431" "585430" "585429" "585428"
##  [97] "585427" "564661" "546904" "524210" "524209" "524208" "521455" "453313"
## [105] "450260" "446856" "446855" "446854" "443979" "442418" "397432" "396808"
## [113] "396332" "396150" "394309" "391496" "369359" "339659" "339658" "339657"
## [121] "339656" "339655" "339654" "339653" "339652" "339651" "339650" "339649"
## [129] "339648" "339647" "339646" "339645" "339644" "339643" "339642" "339641"
## [137] "339640" "339639" "339638" "339637" "339636" "339635" "339634" "339633"
## [145] "339632" "339631" "339630" "339629" "339628" "339627" "339626" "339625"
## [153] "339624" "339623" "339622" "339621" "339620" "339619" "253512" "253403"
## [161] "236549" "236548" "236547" "221889" "160886" "155682" "155309" "155093"
## [169] "155053" "154360" "154063" "153438" "152839" "151388" "150018" "149551"
## [177] "149418" "149160" "149035" "148411" "148262" "148180" "146125" "145984"
## [185] "145474" "145468" "145332" "145107" "144677" "144194" "127268" "98242" 
## [193] "98241"  "98240"  "98239"  "98238"  "98237"  "98236"  "98235"  "59247" 
## [201] "59246"  "59245"  "59243"  "59226"  "59224"  "59223"  "59222"  "59221" 
## [209] "59010"  "59005"  "59004"  "37145"  "32099"  "18106"  "18105"  "18104" 
## [217] "18103"  "18102"  "18101"  "18100"  "18099"  "18098"  "18097"  "18096" 
## [225] "18095"  "18094"  "18093"  "18091"  "18090"  "18089"  "18088"  "18087"

Narrowing our focus

If we know beforehand what sort of links we'd like to find , we can to use the db argument to narrow the focus of a call to entrez_link.

For instance, say we are interested in knowing about all of the RNA transcripts associated with the Amyloid Beta Precursor gene in humans. Transcript sequences are stored in the nucleotide database (referred to as nuccore in EUtils), so to find transcripts associated with a given gene we need to set dbfrom=gene and db=nuccore.

nuc_links <- entrez_link(dbfrom='gene', id=351, db='nuccore')
nuc_links

## elink object with contents:
##  $links: IDs for linked records from NCBI
##

nuc_links$links

## elink result with information from 5 databases:
## [1] gene_nuccore            gene_nuccore_mgc        gene_nuccore_pos       
## [4] gene_nuccore_refseqgene gene_nuccore_refseqrna

The object we get back contains links to the nucleotide database generally, but also to special subsets of that database like refseq. We can take advantage of this narrower set of links to find IDs that match unique transcripts from our gene of interest.

nuc_links$links$gene_nuccore_refseqrna

##  [1] "1889693417" "1869284273" "1676441520" "1676319912" "1675178653"
##  [6] "1675118060" "1675113449" "1675055422" "1674986144" "1519241754"
## [11] "1370481385" "324021746"

We can use these ids in calls to entrez_fetch() or entrez_summary() to learn more about the transcripts they represent.

External links

In addition to finding data within the NCBI, entrez_link can turn up connections to external databases. Perhaps the most interesting example is finding links to the full text of papers in PubMed. For example, when I wrote this document the first paper linked to Amyloid Beta Precursor had a unique ID of 25500142. We can find links to the full text of that paper with entrez_link by setting the cmd argument to 'llinks':

paper_links <- entrez_link(dbfrom="pubmed", id=25500142, cmd="llinks")
paper_links

## elink object with contents:
##  $linkouts: links to external websites

Each element of the linkouts object contains information about an external source of data on this paper:

paper_links$linkouts

## $ID_25500142
## $ID_25500142[[1]]
## Linkout from Elsevier Science 
##  $Url: https://linkinghub.elsevie ...
## 
## $ID_25500142[[2]]
## Linkout from Europe PubMed Central 
##  $Url: http://europepmc.org/abstr ...
## 
## $ID_25500142[[3]]
## Linkout from Ovid Technologies, Inc. 
##  $Url: http://ovidsp.ovid.com/ovi ...
## 
## $ID_25500142[[4]]
## Linkout from PubMed Central 
##  $Url: https://www.ncbi.nlm.nih.g ...
## 
## $ID_25500142[[5]]
## Linkout from MedlinePlus Health Information 
##  $Url: https://medlineplus.gov/al ...
## 
## $ID_25500142[[6]]
## Linkout from Mouse Genome Informatics (MGI) 
##  $Url: http://www.informatics.jax ...

Each of those linkout objects contains quite a lot of information, but the URL is probably the most useful. For that reason, rentrez provides the function linkout_urls to make extracting just the URL simple:

linkout_urls(paper_links)

## $ID_25500142
## [1] "https://linkinghub.elsevier.com/retrieve/pii/S0014-4886(14)00393-8"     
## [2] "http://europepmc.org/abstract/MED/25500142"                             
## [3] "http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=25500142.ui"
## [4] "https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/25500142/"               
## [5] "https://medlineplus.gov/alzheimersdisease.html"                         
## [6] "http://www.informatics.jax.org/reference/25500142"

The full list of options for the cmd argument are given in in-line documentation (?entrez_link). If you are interested in finding full text records for a large number of articles checkout the package fulltext which makes use of multiple sources (including the NCBI) to discover the full text articles.

Using more than one ID

It is possible to pass more than one ID to entrez_link(). By default, doing so will give you a single elink object containing the complete set of links for all of the IDs that you specified. So, if you were looking for protein IDs related to specific genes you could do:

all_links_together  <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"))
all_links_together

## elink object with contents:
##  $links: IDs for linked records from NCBI
##

all_links_together$links$gene_protein

##  [1] "1720383952" "1387845369" "1387845338" "1370513171" "1370513169"
##  [6] "1034662000" "1034661998" "1034661996" "1034661994" "1034661992"
## [11] "558472750"  "545685826"  "194394158"  "166221824"  "154936864" 
## [16] "148697547"  "148697546"  "122346659"  "119602646"  "119602645" 
## [21] "119602644"  "119602643"  "119602642"  "81899807"   "74215266"  
## [26] "74186774"   "37787317"   "37787309"   "37787307"   "37787305"  
## [31] "37589273"   "33991172"   "31982089"   "26339824"   "26329351"  
## [36] "21619615"   "10834676"

Although this behaviour might sometimes be useful, it means we've lost track of which protein ID is linked to which gene ID. To retain that information we can set by_id to TRUE. This gives us a list of elink objects, each once containing links from a single gene ID:

all_links_sep  <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id=TRUE)
all_links_sep

## List of 2 elink objects,each containing
##   $links: IDs for linked records from NCBI
##

lapply(all_links_sep, function(x) x$links$gene_protein)

## [[1]]
##  [1] "1387845369" "1387845338" "1370513171" "1370513169" "1034662000"
##  [6] "1034661998" "1034661996" "1034661994" "1034661992" "558472750" 
## [11] "545685826"  "194394158"  "166221824"  "154936864"  "122346659" 
## [16] "119602646"  "119602645"  "119602644"  "119602643"  "119602642" 
## [21] "37787309"   "37787307"   "37787305"   "33991172"   "21619615"  
## [26] "10834676"  
## 
## [[2]]
##  [1] "1720383952" "148697547"  "148697546"  "81899807"   "74215266"  
##  [6] "74186774"   "37787317"   "37589273"   "31982089"   "26339824"  
## [11] "26329351"

Getting summary data: `entrez_summary()`

Having found the unique IDs for some records via entrez_search or entrez_link(), you are probably going to want to learn something about them. The Eutils API has two ways to get information about a record. entrez_fetch() returns 'full' records in varying formats and entrez_summary() returns less information about each record, but in relatively simple format. Very often the summary records have the information you are after, so rentrez provides functions to parse and summarise summary records.

The summary record

entrez_summary() takes a vector of unique IDs for the samples you want to get summary information from. Let's start by finding out something about the paper describing Taxize, using its PubMed ID:

taxize_summ <- entrez_summary(db="pubmed", id=24555091)
taxize_summ

## esummary result with 42 items:
##  [1] uid               pubdate           epubdate          source           
##  [5] authors           lastauthor        title             sorttitle        
##  [9] volume            issue             pages             lang             
## [13] nlmuniqueid       issn              essn              pubtype          
## [17] recordstatus      pubstatus         articleids        history          
## [21] references        attributes        pmcrefcount       fulljournalname  
## [25] elocationid       doctype           srccontriblist    booktitle        
## [29] medium            edition           publisherlocation publishername    
## [33] srcdate           reportnumber      availablefromurl  locationlabel    
## [37] doccontriblist    docdate           bookname          chapter          
## [41] sortpubdate       sortfirstauthor

Once again, the object returned by entrez_summary behaves like a list, so you can extract elements using $. For instance, we could convert our PubMed ID to another article identifier...

taxize_summ$articleids

##       idtype idtypen                           value
## 1     pubmed       1                        24555091
## 2        doi       3 10.12688/f1000research.2-191.v2
## 3        pmc       8                      PMC3901538
## 4        rid       8                        24563765
## 5        eid       8                        24555091
## 6    version       8                               2
## 7 version-id       8                               2
## 8      pmcid       5             pmc-id: PMC3901538;

...or see how many times the article has been cited in PubMed Central papers

taxize_summ$pmcrefcount

## [1] 64

Dealing with many records

If you give entrez_summary() a vector with more than one ID you'll get a list of summary records back. Let's get those Plasmodium vivax papers we found in the entrez_search() section back, and fetch some summary data on each paper:

vivax_search <- entrez_search(db = "pubmed",
                              term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
multi_summs <- entrez_summary(db="pubmed", id=vivax_search$ids)

rentrez provides a helper function, extract_from_esummary() that takes one or more elements from every summary record in one of these lists. Here it is working with one...

extract_from_esummary(multi_summs, "fulljournalname")

##                                                                                                                 32933490 
##                                                                                                "BMC infectious diseases" 
##                                                                                                                 32745103 
##                                                                                       "PLoS neglected tropical diseases" 
##                                                                                                                 29016333 
##                                                                  "The American journal of tropical medicine and hygiene" 
##                                                                                                                 28298235 
##                                                                                                        "Malaria journal" 
##                                                                                                                 24861816 
## "Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases" 
##                                                                                                                 24145518 
##                                                                                  "Antimicrobial agents and chemotherapy" 
##                                                                                                                 24007534 
##                                                                                                        "Malaria journal" 
##                                                                                                                 23230341 
##                                                                                     "The Korean journal of parasitology" 
##                                                                                                                 23043980 
##                                                                                              "Experimental parasitology" 
##                                                                                                                 20810806 
##                                                                  "The American journal of tropical medicine and hygiene" 
##                                                                                                                 20412783 
##                                                                                                           "Acta tropica" 
##                                                                                                                 19597012 
##                                                                                          "Clinical microbiology reviews" 
##                                                                                                                 17556611 
##                                                                  "The American journal of tropical medicine and hygiene" 
##                                                                                                                 17519409 
##                                                                                                                   "JAMA" 
##                                                                                                                 17368986 
##                                                                                                 "Trends in parasitology" 
##                                                                                                                 12374849 
##                                        "Proceedings of the National Academy of Sciences of the United States of America"

... and several elements:

date_and_cite <- extract_from_esummary(multi_summs, c("pubdate", "pmcrefcount",  "title"))
knitr::kable(head(t(date_and_cite)), row.names=FALSE)

pubdate	pmcrefcount	title
2020 Sep 15		Distribution pattern of amino acid mutations in chloroquine and antifolate drug resistance associated genes in complicated and uncomplicated Plasmodium vivax isolates from Chandigarh, North India.
2020 Aug	1	Population genomics identifies a distinct Plasmodium vivax population on the China-Myanmar border of Southeast Asia.
2017 Dec	4	Distribution of Mutations Associated with Antifolate and Chloroquine Resistance among Imported Plasmodium vivax in the State of Qatar.
2017 Mar 16	15	Clinical and molecular surveillance of drug resistant vivax malaria in Myanmar (2009-2016).
2014 Aug	2	Prevalence of mutations in the antifolates resistance-associated genes (dhfr and dhps) in Plasmodium vivax parasites from Eastern and Central Sudan.
2014	9	Prevalence of polymorphisms in antifolate drug resistance molecular marker genes pvdhfr and pvdhps in clinical isolates of Plasmodium vivax from Kolkata, India.

Fetching full records: `entrez_fetch()`

As useful as the summary records are, sometimes they just don't have the information that you need. If you want a complete representation of a record you can use entrez_fetch, using the argument rettype to specify the format you'd like the record in.

Fetch DNA sequences in fasta format

Let's extend the example given in the entrez_link() section about finding transcript for a given gene. This time we will fetch cDNA sequences of those transcripts.We can start by repeating the steps in the earlier example to get nucleotide IDs for refseq transcripts of two genes:

gene_ids <- c(351, 11647)
linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids, db="nuccore")
linked_transripts <- linked_seq_ids$links$gene_nuccore_refseqrna
head(linked_transripts)

## [1] "1907153913" "1889693417" "1869284273" "1676441520" "1676319912"
## [6] "1675178653"

Now we can get our sequences with entrez_fetch, setting rettype to "fasta" (the list of formats available for each database is give in this table):

all_recs <- entrez_fetch(db="nuccore", id=linked_transripts, rettype="fasta")
class(all_recs)

## [1] "character"

nchar(all_recs)

## [1] 57671

Congratulations, now you have a really huge character vector! Rather than printing all those thousands of bases we can take a peak at the top of the file:

cat(strwrap(substr(all_recs, 1, 500)), sep="\n")

## >XM_006538498.4 PREDICTED: Mus musculus alkaline phosphatase,
## liver/bone/kidney (Alpl), transcript variant X2, mRNA
## AGGCCCTGTAACTCCTCCAAGAGAACACATGCCCAGTCCAGAGAAGAGCACAAGGTAGATCTTGTGACCA
## TCATCGGAACAAGCTGCAGTGGTAGCCTGGGTAGAAGCTGGCAGAGGGAGACCATCTGCAAACCAGGAAC
## GCTGTGAGAAGAGAAAGGACAGAGGTCCTGACATACTGTCACAGCCGCTCTGATGTATGGATCGGAACGT
## CAATTAACGTCAATTAACATCTGACGCTGCCCCCCCCCCCCTCTTCCCACCATCTGGGCTCCAGCGAGGG
## ACGAATCTCAGGGTACACCATGATCTCACCATTTTTAGTACTGGCCATCGGCACCTGCCTTACCAACTCT
## TTTGTGCCAGAGAAAGAGAGAGACCCCAG

If we wanted to use these sequences in some other application we could write them to file:

write(all_recs, file="my_transcripts.fasta")

Alternatively, if you want to use them within an R session
we could write them to a temporary file then read that. In this case I'm using read.dna() from the pylogenetics package ape (but not executing the code block in this vignette, so you don't have to install that package):

temp <- tempfile()
write(all_recs, temp)
parsed_recs <- ape::read.dna(all_recs, temp)

Fetch a parsed XML document

Most of the NCBI's databases can return records in XML format. In additional to downloading the text-representation of these files, entrez_fetch() can return objects parsed by the XML package. As an example, we can check out the Taxonomy database's record for (did I mention they are amazing....) Tetrahymena thermophila, specifying we want the result to be parsed by setting parsed=TRUE:

Tt <- entrez_search(db="taxonomy", term="(Tetrahymena thermophila[ORGN]) AND Species[RANK]")
tax_rec <- entrez_fetch(db="taxonomy", id=Tt$ids, rettype="xml", parsed=TRUE)
class(tax_rec)

## [1] "XMLInternalDocument" "XMLAbstractDocument"

The package XML (which you have if you have installed rentrez) provides functions to get information from these files. For relatively simple records like this one you can use XML::xmlToList:

tax_list <- XML::xmlToList(tax_rec)
tax_list$Taxon$GeneticCode

## $GCId
## [1] "6"
## 
## $GCName
## [1] "Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear"

For more complex records, which generate deeply-nested lists, you can use XPath expressions along with the function XML::xpathSApply or the extraction operatord [ and [[ to extract specific parts of the file. For instance, we can get the scientific name of each taxon in T. thermophila's lineage by specifying a path through the XML

tt_lineage <- tax_rec["//LineageEx/Taxon/ScientificName"]
tt_lineage[1:4]

## [[1]]
## <ScientificName>cellular organisms</ScientificName> 
## 
## [[2]]
## <ScientificName>Eukaryota</ScientificName> 
## 
## [[3]]
## <ScientificName>Sar</ScientificName> 
## 
## [[4]]
## <ScientificName>Alveolata</ScientificName>

As the name suggests, XML::xpathSApply() is a counterpart of base R's sapply, and can be used to apply a function to nodes in an XML object. A particularly useful function to apply is XML::xmlValue, which returns the content of the node:

XML::xpathSApply(tax_rec, "//LineageEx/Taxon/ScientificName", XML::xmlValue)

##  [1] "cellular organisms" "Eukaryota"          "Sar"               
##  [4] "Alveolata"          "Ciliophora"         "Intramacronucleata"
##  [7] "Oligohymenophorea"  "Hymenostomatida"    "Tetrahymenina"     
## [10] "Tetrahymenidae"     "Tetrahymena"

There are a few more complex examples of using XPath on the rentrez wiki

Using NCBI's Web History features

When you are dealing with very large queries it can be time consuming to pass long vectors of unique IDs to and from the NCBI. To avoid this problem, the NCBI provides a feature called "web history" which allows users to store IDs on the NCBI servers then refer to them in future calls.

Post a set of IDs to the NCBI for later use: `entrez_post()`

If you have a list of many NCBI IDs that you want to use later on, you can post them to the NCBI's severs. In order to provide a brief example, I'm going to post just one ID, the omim identifier for asthma:

upload <- entrez_post(db="omim", id=600807)
upload

## Web history object (QueryKey = 1, WebEnv = MCID_5faafb5...)

The NCBI sends you back some information you can use to refer to the posted IDs. In rentrez, that information is represented as a web_history object.

Note that if you have a very long list of IDs you may receive a 414 error when you try to upload them. If you have such a list (and they come from an external sources rather than a search that can be save to a web_history object), you may have to 'chunk' the IDs into smaller sets that can processed.

Get a `web_history` object from `entrez_search` or `entrez_link()`

In addition to directly uploading IDs to the NCBI, you can use the web history features with entrez_search and entrez_link. For instance, imagine you wanted to find all of the sequences of the widely-studied gene COI from all snails (which are members of the taxonomic group Gastropoda):

entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]")

## Entrez search result with 95077 hits (object contains 20 IDs and no web_history object)
##  Search term (as translated):  COI[Gene] AND "Gastropoda"[Organism]

That's a lot of sequences! If you really wanted to download all of these it would be a good idea to save all those IDs to the server by setting use_history to TRUE (note you now get a web_history object along with your normal search result):

snail_coi <- entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]", use_history=TRUE)
snail_coi

## Entrez search result with 95077 hits (object contains 20 IDs and a web_history object)
##  Search term (as translated):  COI[Gene] AND "Gastropoda"[Organism]

snail_coi$web_history

## Web history object (QueryKey = 1, WebEnv = MCID_5faafb5...)

Similarity, entrez_link() can return web_history objects by using the cmd neighbor_history. Let's find genetic variants (from the clinvar database) associated with asthma (using the same OMIM ID we identified earlier):

asthma_clinvar <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", id=600807)
asthma_clinvar$web_histories

## $omim_clinvar
## Web history object (QueryKey = 1, WebEnv = MCID_5faafb5...)

As you can see, instead of returning lists of IDs for each linked database (as it would be default), entrez_link() now returns a list of web_histories.

Use a `web_history` object

Once you have those IDs stored on the NCBI's servers, you are going to want to do something with them. The functions entrez_fetch() entrez_summary() and entrez_link() can all use web_history objects in exactly the same way they use IDs.

So, we could repeat the last example (finding variants linked to asthma), but this time using the ID we uploaded earlier

asthma_variants <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", web_history=upload)
asthma_variants

## elink object with contents:
##  $web_histories: Objects containing web history information

... if we want to get some genetic information about these variants we need to map our clinvar IDs to SNP IDs:

snp_links <- entrez_link(dbfrom="clinvar", db="snp", 
                         web_history=asthma_variants$web_histories$omim_clinvar,
                         cmd="neighbor_history")
snp_summ <- entrez_summary(db="snp", web_history=snp_links$web_histories$clinvar_snp)
knitr::kable(extract_from_esummary(snp_summ, c("chr", "fxn_class", "global_maf")))

	61816761	41364547	11558538	2303067	1805018	1805015	1051931	1042714	1042713	20541
chr	1	11	2	5	6	16	6	5	5	5
fxn_class	upstream_transcript_variant,stop_gained,coding_sequence_variant,synonymous_variant	5_prime_UTR_variant,intron_variant	missense_variant,intron_variant,5_prime_UTR_variant,coding_sequence_variant,non_coding_transcript_variant,genic_downstream_transcript_variant	coding_sequence_variant,missense_variant	coding_sequence_variant,non_coding_transcript_variant,missense_variant	downstream_transcript_variant,missense_variant,genic_downstream_transcript_variant,coding_sequence_variant	coding_sequence_variant,non_coding_transcript_variant,missense_variant	coding_sequence_variant,stop_gained,missense_variant	coding_sequence_variant,missense_variant	coding_sequence_variant,missense_variant
NA	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL

If you really wanted to you could also use web_history objects to download all those thousands of COI sequences. When downloading large sets of data, it is a good idea to take advantage of the arguments retmax and restart to split the request up into smaller chunks. For instance, we could get the first 200 sequences in 50-sequence chunks:

(note: this code block is not executed as part of the vignette to save time and bandwidth):

for( seq_start in seq(1,200,50)){
    recs <- entrez_fetch(db="nuccore", web_history=snail_coi$web_history,
                         rettype="fasta", retmax=50, retstart=seq_start)
    cat(recs, file="snail_coi.fasta", append=TRUE)
    cat(seq_start+49, "sequences downloaded\r")
}

Rate-limiting and API Keys

By default, the NCBI limits users to making only 3 requests per second (and rentrez enforces that limit). Users who register for an "API key" are able to make up to ten requests per second. Getting one of these keys is simple, you just need to register for "my ncbi" account then click on a button in the account settings page.

Once you have an API key, rentrez will allow you to take advantage of it. For one-off cases, this is as simple as adding the api_key argument to given function call. (Note these examples are not executed as part of this document, as the API key used here not a real one).

entrez_link(db="protein", dbfrom="gene", id=93100, api_key ="ABCD123")

It most cases you will want to use your API for each of several calls to the NCBI. rentrez makes this easy by allowing you to set an environment variable ,ENTREZ_KEY. Once this value is set to your key rentrez will use it for all requests to the NCBI. To set the value for a single R session you can use the function set_entrez_key(). Here we set the value and confirm it is available.

set_entrez_key("ABCD123")
Sys.getenv("ENTREZ_KEY")

## [1] "ABCD123"

If you use rentrez often you should edit your .Renviron file (see r help(Startup) for description of this file) to include your key. Doing so will mean all requests you send will take advantage of your API key.

ENTREZ_KEY=ABCD123

As long as an API key is set by one of these methods, rentrez will allow you to make up to ten requests per second.

Slowing rentrez down when you hit the rate-limit

rentrez won't let you send requests to the NCBI at a rate higher than the rate-limit, but it is sometimes possible that they will arrive too close together an produce errors. If you are using rentrez functions in a for loop and find rate-limiting errors are occuring, you may consider adding a call to Sys.sleep(0.1) before each message sent to the NCBI. This will ensure you stay beloe the rate limit.

Rentrez Tutorial

David winter

2020-11-11

Introduction: The NCBI, entrez and `rentrez`.

Getting started with the rentrez

Searching databases: `entrez_search()`

Building search terms

Using the Filter field

Precise queries using MeSH terms

Advanced counting

Finding cross-references : `entrez_link()`:

My god, it's full of links

Narrowing our focus

External links

Using more than one ID

Getting summary data: `entrez_summary()`

The summary record

Dealing with many records

Fetching full records: `entrez_fetch()`

Fetch DNA sequences in fasta format

Fetch a parsed XML document

Using NCBI's Web History features

Post a set of IDs to the NCBI for later use: `entrez_post()`

Get a `web_history` object from `entrez_search` or `entrez_link()`

Use a `web_history` object

Rate-limiting and API Keys

Slowing rentrez down when you hit the rate-limit

What next ?

Rentrez Tutorial

David winter

2020-11-11

Introduction: The NCBI, entrez and rentrez.

Getting started with the rentrez

Searching databases: entrez_search()

Building search terms

Using the Filter field

Precise queries using MeSH terms

Advanced counting

Finding cross-references : entrez_link():

My god, it's full of links

Narrowing our focus

External links

Using more than one ID

Getting summary data: entrez_summary()

The summary record

Dealing with many records

Fetching full records: entrez_fetch()

Fetch DNA sequences in fasta format

Fetch a parsed XML document

Using NCBI's Web History features

Post a set of IDs to the NCBI for later use: entrez_post()

Get a web_history object from entrez_search or entrez_link()

Use a web_history object

Rate-limiting and API Keys

Slowing rentrez down when you hit the rate-limit

What next ?

Introduction: The NCBI, entrez and `rentrez`.

Searching databases: `entrez_search()`

Finding cross-references : `entrez_link()`:

Getting summary data: `entrez_summary()`

Fetching full records: `entrez_fetch()`

Post a set of IDs to the NCBI for later use: `entrez_post()`

Get a `web_history` object from `entrez_search` or `entrez_link()`

Use a `web_history` object