Extract text or metadata from over a thousand file types.
Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages …
For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities. (From https://en.wikipedia.org/wiki/Apache_Tika, accessed Jan 18, 2018)
This is an R interface to the Tika software.
To start, you need R and Java 8
or
OpenJDK 1.8
. Higher versions work. To check your version,
run the command java -version
from a terminal. Get Java
installation tips at https://www.java.com/en/download/ or https://openjdk.org/install/. Because the
rJava
package is not required,
installation is simple. You can cut and paste the following snippet:
install.packages('rtika', repos = 'https://cloud.r-project.org')
library('rtika')
# You need to install the Apache Tika .jar once.
install_tika()
Read an introductory article at https://docs.ropensci.org/rtika/articles/rtika_introduction.html.
tika_text()
to extract plain text.tika_xml()
and tika_html()
to get a
structured XHMTL rendition.tika_json()
to get metadata as .json
, with
XHMTL content.tika_json_text()
to get metadata as .json
,
with plain text content.tika()
is the main function the others above inherit
from.tika_fetch()
to download files with a file extension
matching the Content-Type.Tika parses and extracts text or metadata from over one thousand digital formats, including:
.pdf
).rtf
).epub
).jpeg
, .png
, etc.).mbox
, Outlook).html
).xml
, etc.).gzip
,
.rar
, etc.)For a list of MIME types, look for the “Supported Formats” page here: https://tika.apache.org/
The rtika
package processes batches of documents
efficiently, so I recommend batches. Currently, the
tika()
parsers take a tiny bit of time to spin up, and that
will get annoying with hundreds of separate calls to the functions.
# Test files
<- c(
batch system.file("extdata", "jsonlite.pdf", package = "rtika"),
system.file("extdata", "curl.pdf", package = "rtika"),
system.file("extdata", "table.docx", package = "rtika"),
system.file("extdata", "xml2.pdf", package = "rtika"),
system.file("extdata", "R-FAQ.html", package = "rtika"),
system.file("extdata", "calculator.jpg", package = "rtika"),
system.file("extdata", "tika.apache.org.zip", package = "rtika")
)
# batches are best, and can also be piped with magrittr.
<- tika_text(batch)
text
# text has one string for each document:
length(text)
#> [1] 7
# A snippet:
cat(substr(text[1], 54, 190))
#> lite’
#> June 1, 2017
#>
#> Version 1.5
#>
#> Title A Robust, High Performance JSON Parser and Generator for R
#>
#> License MIT + file LICENSE
#>
#> NeedsCompi
To learn more and find out how to extract structured text and metadata, read the vignette: https://docs.ropensci.org/rtika/articles/rtika_introduction.html.
Tika also can interact with the Tesseract OCR program on some Linux
variants, to extract plain text from images of text. If
tesseract-ocr
is installed, Tika should automatically
locate and use it for images and PDFs that contain images of text.
However, this does not seem to work on OS X or Windows. To try on Linux,
first follow the Tesseract
installation instructions. The next time Tika is run, it should
work. For a different approach, I suggest tesseract
package by @jeroen,
which is a specialized R interface.
The Apache Tika community welcomes your feedback. Issues regarding
the R interface should be raised at the rTika
Github
Issue Tracker. If you are confident the issue concerns Tika or one
of its underlying parsers, use the Tika
Bugtracking System.
If your project or package needs to use the Tika App
.jar
, you can include rTika
as a dependency
and call the rtika::tika_jar()
function to get the path to
the Tika app installed on the system.
The are a number of specialized parsers that overlap in
functionality. For example, the pdftools
package extracts metadata and text from PDF files, the antiword
package extracts text from recent versions of Word, and the epubr
package
by @leonawicz
processes epub
files. These packages do not depend on Java,
while rTika
does.
The big difference between Tika and a specialized parser is that Tika integrates dozens of specialist libraries maintained by the Apache Foundation. Apache Tika processes over a thousand file types and multiple versions of each. This eases the processing of digital archives that contain unpredictable files. For example, researchers use Tika to process archives from court cases, governments, or the Internet Archive that span multiple years. These archives frequently contain diverse formats and multiple versions of each format. Because Tika finds the matching parser for each individual file, is well suited to diverse sets of documents. In general, the parsing quality is good and consistently so. In contrast, specialized parsers may only work with a particular version of a file, or require extra tinkering.
On the other hand, a specialized library can offer more control and
features when it comes to structured data and formatting. For example,
the tabulizer
package by @leeper and
@tpaskhalis
includes bindings to the ‘Tabula PDF Table Extractor Library’. Because
PDF files store tables as a series of positions with no obvious
boundaries between data cells, extracting a data.frame
or
matrix
requires heuristics and customization which that
package provides. To be fair to Tika, there are some formats where
rtika
will extract data as table-like XML. For example,
with Word and Excel documents, Tika extracts simple tables as XHTML data
that can be turned into a tabular data.frame
using the
rvest::html_table()
function.
In September 2017, github.com user kyusque released
tikaR
, which uses the rJava
package to
interact with Tika (See: https://github.com/kyusque/tikaR). As of writing, it
provided similar text and metadata extraction, but only xml
output.
Back in March 2012, I started a similar project to interface with
Apache Tika. My code also used low-level functions from the
rJava
package. I halted development after discovering that
the Tika command line interface (CLI) was easier to use. My empty
repository is at https://r-forge.r-project.org/projects/r-tika/.
I chose to finally develop this package after getting excited by
Tika’s new ‘batch processor’ module, written in Java. The batch
processor has very good efficiency when processing tens of thousands of
documents. Further, it is not too slow for a single document either, and
handles errors gracefully. Connecting R
to the Tika batch
processor turned out to be relatively simple, because the R
code is simple. It uses the CLI to point Tika to the files. Simplicity,
along with continuous testing, should ease integration. I anticipate
that some researchers will need plain text output, while others will
want json
output. Some will want multiple processing
threads to speed things up. These features are now implemented in
rtika
, although apparently not in tikaR
yet.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.