Match Regular Expressions with a Nicer ‘API’
A small wrapper on regular expression matching functions
regexpr
and gregexpr
to return the results in
tidy data frames.
install.packages("rematch2")
Note that rematch2
is not compatible with the original
rematch
package. There are at least three major changes: *
The order of the arguments for the functions is different. In
rematch2
the text
vector is first, and
pattern
is second. * In the result, .match
is
the last column instead of the first. * rematch2
returns
tibble
data frames. See
https://github.com/hadley/tibble.
library(rematch2)
With capture groups:
<- c("2016-04-20", "1977-08-08", "not a date", "2016",
dates "76-03-02", "2012-06-30", "2015-01-21 19:58")
<- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
isodate re_match(text = dates, pattern = isodate)
#> # A tibble: 7 x 5
#> `` `` `` .text .match
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2016 04 20 2016-04-20 2016-04-20
#> 2 1977 08 08 1977-08-08 1977-08-08
#> 3 <NA> <NA> <NA> not a date <NA>
#> 4 <NA> <NA> <NA> 2016 <NA>
#> 5 <NA> <NA> <NA> 76-03-02 <NA>
#> 6 2012 06 30 2012-06-30 2012-06-30
#> 7 2015 01 21 2015-01-21 19:58 2015-01-21
Named capture groups:
<- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
isodaten re_match(text = dates, pattern = isodaten)
#> # A tibble: 7 x 5
#> year month day .text .match
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2016 04 20 2016-04-20 2016-04-20
#> 2 1977 08 08 1977-08-08 1977-08-08
#> 3 <NA> <NA> <NA> not a date <NA>
#> 4 <NA> <NA> <NA> 2016 <NA>
#> 5 <NA> <NA> <NA> 76-03-02 <NA>
#> 6 2012 06 30 2012-06-30 2012-06-30
#> 7 2015 01 21 2015-01-21 19:58 2015-01-21
A slightly more complex example:
<- c(
github_repos "metacran/crandb",
"jeroenooms/curl@v0.9.3",
"jimhester/covr#47",
"hadley/dplyr@*release",
"r-lib/remotes@550a3c7d3f9e1493a2ba",
"/$&@R64&3"
)<- "(?:(?<owner>[^/]+)/)?"
owner_rx <- "(?<repo>[^/@#]+)"
repo_rx <- "(?:/(?<subdir>[^@#]*[^@#/]))?"
subdir_rx <- "(?:@(?<ref>[^*].*))"
ref_rx <- "(?:#(?<pull>[0-9]+))"
pull_rx <- "(?:@(?<release>[*]release))"
release_rx
<- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx)
subtype_rx <- sprintf(
github_rx "^(?:%s%s%s%s|(?<catchall>.*))$",
owner_rx, repo_rx, subdir_rx, subtype_rx
)re_match(text = github_repos, pattern = github_rx)
#> # A tibble: 6 x 9
#> owner repo subdir ref pull release catchall
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 metacran crandb
#> 2 jeroenooms curl v0.9.3
#> 3 jimhester covr 47
#> 4 hadley dplyr *release
#> 5 r-lib remotes 550a3c7d3f9e1493a2ba
#> 6 /$&@R64&3
#> # ... with 2 more variables: .text <chr>, .match <chr>
Extract all names, and also first names and last names:
<- paste0(
name_rex "(?<first>[[:upper:]][[:lower:]]+) ",
"(?<last>[[:upper:]][[:lower:]]+)"
)<- c(
notables " Ben Franklin and Jefferson Davis",
"\tMillard Fillmore"
)<- re_match_all(notables, name_rex)
not not
#> # A tibble: 2 x 4
#> first last .text .match
#> <list> <list> <chr> <list>
#> 1 <chr [2]> <chr [2]> Ben Franklin and Jefferson Davis <chr [2]>
#> 2 <chr [1]> <chr [1]> "\tMillard Fillmore" <chr [1]>
$first not
#> [[1]]
#> [1] "Ben" "Jefferson"
#>
#> [[2]]
#> [1] "Millard"
$last not
#> [[1]]
#> [1] "Franklin" "Davis"
#>
#> [[2]]
#> [1] "Fillmore"
$.match not
#> [[1]]
#> [1] "Ben Franklin" "Jefferson Davis"
#>
#> [[2]]
#> [1] "Millard Fillmore"
re_exec
and re_exec_all
are similar to
re_match
and re_match_all
, but they also
return match positions. These functions return match records. A match
record has three components: match
, start
,
end
, and each component can be a vector. It is similar to a
data frame in this respect.
<- re_exec(notables, name_rex)
pos pos
#> # A tibble: 2 x 4
#> first last .text .match
#> * <list> <list> <chr> <list>
#> 1 <list [3]> <list [3]> Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]> "\tMillard Fillmore" <list [3]>
Unfortunately R does not allow hierarchical data frames (i.e. a
column of a data frame cannot be another data frame), but
rematch2
defines some special classes and an $
operator, to make it easier to extract parts of re_exec
and
re_exec_all
matches. You simply query the
match
, start
or end
part of a
column:
$first$match pos
#> [1] "Ben" "Millard"
$first$start pos
#> [1] 3 2
$first$end pos
#> [1] 5 8
re_exec_all
is very similar, but these queries return
lists, with arbitrary number of matches:
<- re_exec_all(notables, name_rex)
allpos allpos
#> # A tibble: 2 x 4
#> first last .text .match
#> <list> <list> <chr> <list>
#> 1 <list [3]> <list [3]> Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]> "\tMillard Fillmore" <list [3]>
$first$match allpos
#> [[1]]
#> [1] "Ben" "Jefferson"
#>
#> [[2]]
#> [1] "Millard"
$first$start allpos
#> [[1]]
#> [1] 3 20
#>
#> [[2]]
#> [1] 2
$first$end allpos
#> [[1]]
#> [1] 5 28
#>
#> [[2]]
#> [1] 8
MIT © Mango Solutions, Gábor Csárdi