AWS Athena Query Caching

Dyfan Jones

Intro

RAthena now supports caching. This was originally inspired by pyathenato reduce the cost of using AWS Athena. RAthena however has a different caching method and utilities local caching in R environments instead of using AWS list_query_executions. This is down to dbClearResult clearing AWS S3’s AWS Athena output when caching is disabled.

Caching benefits

By caching queries the performance of repeat queries is significantly improved. This is because the query is no longer sent to AWS Athena. Instead the query ID, of the repeating query, is taken from the R environment and the result is returned from AWS S3.

library(DBI)
library(RAthena)

con = dbConnect(athena())

# Start caching queries
RAthena_options(cache_size = 10)

# Upload Data to AWS Athena
dbWriteTable(con, "iris", iris, partition = c("Partition" = "01"))

# initial query to AWS Athena
system.time(df1 = dbGetQuery(con, "select * from iris"))

# Info: (Data scanned: 3.63 KB)
#   user  system elapsed 
#  0.105   0.004   3.397 

# repeat query to AWS Athena
system.time(df2 = dbGetQuery(con, "select * from iris"))

# Info: (Data scanned: 3.63 KB)
#   user  system elapsed 
#  0.072   0.000   0.348 

Here we can see a performance increase of x10 with repeat query execution.

Caching weakness

The weakness in caching occurs when the underlying data is updated. The cache will still only retrieve the previous query ID. This means that the new updated data won’t be return when the caching is enabled:

# Updating iris table
dbWriteTable(con, "iris", iris, append = T, partition = c("Partition" = "02"))

dt5 = dbGetQuery(con, "select * from iris")

# Stop using cache data
RAthena_options()

dt6 = dbGetQuery(con, "select * from iris")

nrow(dt5)
# 150

nrow(dt6)
# 300

Sadly the cached query didn’t pick up the new data from iris.

Cache memory

The caching method in RAthena will remember previous query ids within each R session, even if you stop and start caching in RAthena_options.

# Start caching
RAthena_options(cache_size = 10)
res1 = dbExecute(con, "select * from iris")

# Stop caching
RAthena_options()
res2 = dbExecute(con, "select * from iris")

# Start caching
RAthena_options(cache_size = 10)
res3 = dbExecute(con, "select * from iris")

# Compare Query ID's
res1@info$QueryExecutionId
# 9a9272f5-0632-4774-9aa9-d07f151dabc5

res2@info$QueryExecutionId
# be12fe0-3ec0-4595-b3e6-b3bf67efa266

res3@info$QueryExecutionId
# 9a9272f5-0632-4774-9aa9-d07f151dabc5

We can see that res1 and res3 utilise the same QueryID, even tho caching was stopped and started.

Clear down cache

To clear down the cache, just set the parameter: clear_cache within RAthena_options to TRUE

RAthena_options(clear_cache = T)