Code here written by Erica Krimmel.
In this demo we will cover how to:
idig_search_media
First, you need to find all the media records for which you are
interested in downloading media files. Do this using the
idig_search_media
function from the ridigbio package, which
allows you to search for media records based on data contained in linked
specimen records, like species or collecting locality. You can learn
more about this function from the iDigBio API
documentation and ridigbio
documentation. In this example, we want to search for images of
herbarium specimens of species in the genus Acer that were
collected in the United States.
# Edit the fields (e.g. `genus`) and values (e.g. "manis") in `list()`
# to adjust your query and the fields (e.g. `uuid`) in `fields` to adjust the
# columns returned in your results; edit the number after `limit` to adjust the
# number of records you will retrieve images for
records <- idig_search_media(rq =
list(genus = "acer",
country = "united states"),
fields = c("uuid",
"accessuri",
"rights",
"format",
"records"),
limit = 10)
records$accessuri <- if_else(grepl("^http://", records$accessuri),
gsub("^http://", "", records$accessuri),
records$accessuri
)
records$accessuri <- if_else(grepl("https://mam.ansp.org", records$accessuri),
gsub("https://mam.ansp.org", "mam.ansp.org", records$accessuri),
records$accessuri
)
records$accessuri <- if_else(grepl("https://ibss-images.calacademy.org", records$accessuri),
gsub("https://ibss-images.calacademy.org", "ibss-images.calacademy.org", records$accessuri),
records$accessuri
)
The result of the code above is a data frame called
records
:
uuid | accessuri | rights | format | records |
---|---|---|---|---|
0000b146-6fd2-4a6a-bf78-9e709cc995e9 | mam.ansp.org/image/CM/Fullsize/345/CM345773.jpg | BY-NC-SA | image/jpeg | 8af932f6-24c1-4c9c-9963-10b9b584b632 |
0000b511-67fd-4485-adea-56df8f7e4c66 | https://sernecportal.org/imglib/seinet/sernec/NCU_VascularPlants/NCU00418/NCU00418780_01.JPG | BY-NC-SA | image/jpeg | 21de8a6f-bb99-4e95-93d9-881fe1b6a73b |
0000d1cd-8211-45c6-8dfa-bd4a9a001aad | mam.ansp.org/image/TAWES/Fullsize/0005/TAWES0005954.jpg | BY-NC-SA | image/jpeg | 0a18cad9-2e85-4ae1-8274-68274b058b61 |
000244a4-6c5c-4e75-9c2e-db585b377935 | ibss-images.calacademy.org:80/static/botany/originals/12/df/12dff2af-8a4b-4845-9964-79152fb7ce71.jpg | CC0 | NA | 4aa31647-7fcd-45f0-b039-d1a0a27637d6 |
000277e9-659b-4e0c-a61b-c5262d33969b | https://www.pnwherbaria.org/images/jpeg.php?Image=WTU-V-023351.jpg | NA | image/jpeg | ac4d39f6-8775-48ea-b1d1-cd14c6f60e08 |
0002acf9-13d4-4318-a53e-4c00e9361a07 | https://cch2.org/imglib/cch2/CHSC_VascularPlants/CHSC021/CHSC021794_lg.jpg | BY-NC-SA | image/jpeg | b0d09b22-e30f-4b89-ab03-d4c863b4727a |
000495be-df5c-4c01-a951-5656a3fe5ef5 | bgbaseserver.eeb.uconn.edu/DATABASEIMAGES/CONN00075691.JPG | BY-NC-SA | image/jpeg | 2642a89e-bcda-4c8c-8b20-3753b37ab990 |
0005475a-c2c9-4908-9a27-ba9e555d0c43 | https://cdn.floridamuseum.ufl.edu/Herbarium/945bc71c-a736-42e2-8b41-7a00bf9529fe | BY-NC | image/jpeg | 197ea7d4-de1f-4c93-ac98-8c81b53fbece |
00056e02-50b6-4c62-a975-306cc870dd83 | https://data.cyverse.org/dav-anon/iplant/projects/magnoliagrandiFLORA/images/specimens/MISS0038464/MISS0038464.JPG | BY-NC-SA | image/jpeg | 04764068-999e-4902-b30b-4e1d0d23d214 |
000654a2-02f3-4c14-b28a-7567cb55aa57 | https://api.idigbio.org/v2/media/793b0495f464a5403db6918811399c1d | BY-NC-SA | image/jpeg | b8f0b19c-2ea7-4f40-971f-4e92f0f6fd98 |
Now that we know what media records are of interest to us, we need to isolate the URLs that link to the actual media files so that we can download them. In this example, we will demonstrate how to download files that are cached on the iDigBio server, as well as the original files hosted externally by the data provider. You likely do not need to download two sets of images, so can choose to comment out the steps related to either “_iDigBio” or “_external” depending on your preference.
# Assemble a vector of iDigBio server download URLs from `records`
mediaurl_idigbio <- records %>%
mutate(mediaURL = paste("https://api.idigbio.org/v2/media/", uuid, sep = "")) %>%
select(mediaURL) %>%
pull()
# Assemble a vector of external server download URLs from `records`
mediaurl_external <- records$accessuri %>%
str_replace("\\?size=fullsize", "")
These vectors look like this:
## [1] "https://api.idigbio.org/v2/media/0000b146-6fd2-4a6a-bf78-9e709cc995e9"
## [2] "https://api.idigbio.org/v2/media/0000b511-67fd-4485-adea-56df8f7e4c66"
## [3] "https://api.idigbio.org/v2/media/0000d1cd-8211-45c6-8dfa-bd4a9a001aad"
## [4] "https://api.idigbio.org/v2/media/000244a4-6c5c-4e75-9c2e-db585b377935"
## [5] "https://api.idigbio.org/v2/media/000277e9-659b-4e0c-a61b-c5262d33969b"
## [6] "https://api.idigbio.org/v2/media/0002acf9-13d4-4318-a53e-4c00e9361a07"
## [7] "https://api.idigbio.org/v2/media/000495be-df5c-4c01-a951-5656a3fe5ef5"
## [8] "https://api.idigbio.org/v2/media/0005475a-c2c9-4908-9a27-ba9e555d0c43"
## [9] "https://api.idigbio.org/v2/media/00056e02-50b6-4c62-a975-306cc870dd83"
## [10] "https://api.idigbio.org/v2/media/000654a2-02f3-4c14-b28a-7567cb55aa57"
## [1] "mam.ansp.org/image/CM/Fullsize/345/CM345773.jpg"
## [2] "https://sernecportal.org/imglib/seinet/sernec/NCU_VascularPlants/NCU00418/NCU00418780_01.JPG"
## [3] "mam.ansp.org/image/TAWES/Fullsize/0005/TAWES0005954.jpg"
## [4] "ibss-images.calacademy.org:80/static/botany/originals/12/df/12dff2af-8a4b-4845-9964-79152fb7ce71.jpg"
## [5] "https://www.pnwherbaria.org/images/jpeg.php?Image=WTU-V-023351.jpg"
## [6] "https://cch2.org/imglib/cch2/CHSC_VascularPlants/CHSC021/CHSC021794_lg.jpg"
## [7] "bgbaseserver.eeb.uconn.edu/DATABASEIMAGES/CONN00075691.JPG"
## [8] "https://cdn.floridamuseum.ufl.edu/Herbarium/945bc71c-a736-42e2-8b41-7a00bf9529fe"
## [9] "https://data.cyverse.org/dav-anon/iplant/projects/magnoliagrandiFLORA/images/specimens/MISS0038464/MISS0038464.JPG"
## [10] "https://api.idigbio.org/v2/media/793b0495f464a5403db6918811399c1d"
We can use the download URLs that we assembled in the step above to go and download each media file. For clarity, we will place files in two different folders, based on whether we downloaded them from the iDigBio server or an external server. We will name each file based on its unique identifier.
# Create new directories to save media files in
dir.create("jpgs_idigbio")
dir.create("jpgs_external")
# Assemble another vector of file paths to use when saving media downloaded
# from iDigBio
mediapath_idigbio <- paste("jpgs_idigbio/", records$uuid, ".jpg", sep = "")
# Assemble another vector of file paths to use when saving media downloaded
# from external servers; please note that it's probably not a great idea to
# assume these files are all jpgs, as we're doing here...
mediapath_external <- paste("jpgs_external/", records$uuid, ".jpg", sep = "")
# Add a check to deal with URLs that are broken links
possibly_download.file = purrr::possibly(download.file,
otherwise = "cannot download")
#"mode" argument (="wb") in the walk function to download.file.
# Iterate through the action of downloading whatever file is at each
# iDigBio URL
purrr::walk2(.x = mediaurl_idigbio,
.y = mediapath_idigbio, possibly_download.file)
# Iterate through the action of downloading whatever file is at each
# external URL
purrr::walk2(.x = mediaurl_external,
.y = mediapath_external, possibly_download.file)
You should now have two folders, each with ten images downloaded from
iDigBio and external servers, respectively. Note that we only downloaded
ten images here for brevity’s sake, but you can increase this using the
limit
argument in the first step. Here is an example of one
of the images we downloaded:
[1] “”