In the world of Wikidata, the most efficient way of getting data is usually through well-crafted SPARQL queries. However, what is theoretically technically most efficient, may be highly impractical if someone is unfamiliar with the structure of such queries or with how they can be best applied to to get the most out of Wikidata.
Most people who turn to Wikidata will however be familiar with its more famous sister project, Wikipedia. As this post will explain, Wikipedia pages can be used as a starting point for getting data out of Wikidata.
An introductory note on caching. It is generally advisable to enable
cache at the beginning of each script. Thanks to caching, when a block
of code based on tidywikidatar
is re-run it gives results
almost instantly. But given that tidywikidatar
queries for
Wikidata items one at a time, it can be relatively time-consuming when
it is run for the first time. So be patient when you run some of the
commands below for the first time, and considering testing code with
only a sample of the data at first (e.g. by adding something such as
slice_sample(n = 100)
to the data frame you use as a
starting point).
Especially if you intend to use Wikidata in different projects, it may make sense to have the cache stored in a folder outside the current project. As the data can be recovered, it is usually unnecessary to backup the cache folder, and unwise to keep it in folder that are constantly synced with online services.
Most users should be perfectly fine using the default SQLite database for caching.
library(dplyr, warn.conflicts = FALSE)
library("tidywikidatar")
tw_enable_cache()
tw_set_cache_folder(path = fs::path(fs::path_home_r(), "R", "tw_data"))
tw_set_language(language = "en")
tw_create_cache_folder(ask = FALSE)
More advanced users, especially those relying on
tidywikidatar
in Shiny apps, may instead prefer relying on
a MySql database, setting up the connection at the beginning of each
session as follows.
library(dplyr, warn.conflicts = FALSE)
library("tidywikidatar")
tw_set_language(language = "en")
tw_enable_cache(SQLite = FALSE)
tw_set_cache_db(driver = "MySQL",
host = "localhost",
port = 3306,
database = "tidywikidatar",
user = "secret_username",
pwd = "secret_password")
# for testing:
# docker run --name tidywikidatar_db -p 3306:3306 -e MYSQL_ROOT_PASSWORD=secret_root_password -e MYSQL_USER=secret_username -e MYSQL_PASSWORD=secret_password -e MYSQL_DATABASE=tidywikidatar mysql:latest
Let’s suppose we are interested in finding details about the European Parliament and its members, and we do not want to approach SPARQL queries. We are, however, familiar with Wikipedia and we are happy to start from a Wikipedia page such as List of members of the European Parliament (2019–2024). It includes links to the Wikipedia pages of all Members of the European Parliament (MEPs), the party list from where they were elected at national level, the elections themselves, etc.
This is how we can take it from there with
tidywikidatar
.
mep_wiki_df <- tw_get_wikipedia_page_links(
url = "https://en.wikipedia.org/wiki/List_of_members_of_the_European_Parliament_(2019–2024)")
str(mep_wiki_df)
#> tibble [1,393 × 8] (S3: tbl_df/tbl/data.frame)
#> $ source_title_url : chr [1:1393] "List_of_members_of_the_European_Parliament_(2019–2024)" "List_of_members_of_the_European_Parliament_(2019–2024)" "List_of_members_of_the_European_Parliament_(2019–2024)" "List_of_members_of_the_European_Parliament_(2019–2024)" ...
#> $ source_wikipedia_title: chr [1:1393] "List of members of the European Parliament (2019–2024)" "List of members of the European Parliament (2019–2024)" "List of members of the European Parliament (2019–2024)" "List of members of the European Parliament (2019–2024)" ...
#> $ source_qid : chr [1:1393] "Q55619306" "Q55619306" "Q55619306" "Q55619306" ...
#> $ wikipedia_title : chr [1:1393] "1979 European Parliament election" "1984 European Parliament election" "1989 European Parliament election" "1994 European Parliament election" ...
#> $ wikipedia_id : int [1:1393] 9737190 9737225 9736812 9736512 294038 433911 5966954 23194184 42869480 60699754 ...
#> $ qid : chr [1:1393] "Q1376068" "Q1376075" "Q1376076" "Q1376071" ...
#> $ description : chr [1:1393] "First election to the European Parliament" "Election to the European Parliament" "Election to the European Parliament" "Election to the European Parliament" ...
#> $ language : chr [1:1393] "en" "en" "en" "en" ...
What kind of things are mentioned in this page?
mep_wiki_df %>%
pull(qid) %>%
tw_get_property(p = "P31") %>%
group_by(value) %>%
tally(sort = TRUE) %>%
transmute(instance_of = tw_get_label(value),
n)
#> # A tibble: 67 × 2
#> instance_of n
#> <chr> <int>
#> 1 human 819
#> 2 Wikimedia list article 227
#> 3 political party 161
#> 4 <NA> 62
#> 5 election to the European Parliament 35
#> 6 political party in Germany 13
#> 7 Green party 13
#> 8 political group of the European Parliament 8
#> 9 Wikimedia project page 6
#> 10 electoral alliance 6
#> # … with 57 more rows
It’s mostly humans - likely, mostly MEPs themselves, but also political parties, elections, and other Wikipedia page lists.
We can filter results in order to keep only humans as follows:
humans_df <- mep_wiki_df %>%
pull(qid) %>%
tw_get_property(p = "P31") %>% # instance of
filter(value == "Q5") # human
meps_df <- humans_df %>%
tw_get_property(p = "P39") %>% # position held
filter(value == "Q27169") %>% # keep only MEPs
distinct(id)
It appears we have 819 humans, and 805 among them who have been members of the European Parliament mentioned in this page. Perhaps some of those mentioned in the page are people who never took their place, or perhaps MEPs in previous legislatures?
To get a step closer to clarifying this, we can check the qualifiers of a property. So if you look at the property “position held” for a member of the European Parliament (2019-2024), you may notice that the “position held” -> “member of the European Parliament” has a few qualifiers, including “parliamentary term” which may have value “Ninth European Parliament”.
meps9_df <- meps_df %>%
tw_get_qualifiers(p = "P39") %>%
filter(qualifier_property == "P2937",
qualifier_value == "Q64038205") %>%
distinct(id)
It appears that 801 have actually been members of the “Ninth European Parliament”.
Out of curiosity, and as a quality check, let’s see who are the MEPs mentioned in the page who have not been members of the ninth legislature.
anti_join(meps_df,
meps9_df,
by = "id") %>%
mutate(name = tw_get_label(id))
#> # A tibble: 4 × 2
#> id name
#> <chr> <chr>
#> 1 Q312901 Josep Borrell
#> 2 Q11769926 Mariya Gabriel
#> 3 Q1055449 Matteo Salvini
#> 4 Q57630 Valdis Dombrovskis
Indeed, all of them gave up their seat for one reason or another.
As we have reason to believe the Wikipedia page we have used as a starting point is reasonably complete, we can now proceed assuming that we now a reasonably complete list of MEPs of the ninth legislature. We can then take it as a starting point for all sorts questions, such as which position they held before becoming MEPs (or at the same time as being MEPs), or who among them joined the European Parliament after the beginning of the legislature. This is, for example, a list of MEPs who joined the parliament only in 2021, and with reference to whom did they replace:
replacing_meps_df <- meps9_df %>%
tw_get_qualifiers(p = "P39") %>%
filter(qualifier_id == "Q27169", # member of the european parliament
qualifier_property == "P580") %>% # start time
arrange(desc(qualifier_value)) %>%
filter(stringr::str_starts(qualifier_value, "\\+2021")) %>%
tw_get_qualifiers(p = "P39") %>%
filter(qualifier_id == "Q27169", # member of the european parliament
qualifier_property == "P1365") %>% # replaces
transmute(new_mep_id = id,
new_mep = tw_get_label(id),
previous_mep_id = qualifier_value,
previous_mep = tw_get_label(qualifier_value))
replacing_meps_df
#> # A tibble: 9 × 4
#> new_mep_id new_mep previous_mep_id previous_mep
#> <chr> <chr> <chr> <chr>
#> 1 Q108904462 Karolin Braunsberger-Reinhold Q17093506 Sven Schulze
#> 2 Q22443735 João Pimenta Lopes Q1671880 Inês Zuber
#> 3 Q22443735 João Pimenta Lopes Q11728715 João Ferreira
#> 4 Q106519771 Michiel Hoogeveen Q979348 Derk Jan Eppink
#> 5 Q106518433 Thijs Reuten Q17143453 Kati Piri
#> 6 Q104126196 Emma Wiesner Q513301 Fredrick Federley
#> 7 Q18225970 Tom Vandenkendelaere Q270530 Marianne Thyssen
#> 8 Q18225970 Tom Vandenkendelaere Q12869 Kris Peeters
#> 9 Q63224243 Ernő Schaller-Baross Q731042 József Szájer
Curiously, we find that there are two current MEPs who have replaced a colleague after the beginning of the legislature more than once. Can we confirm in which legislatures those who have replaced a colleague more than once have served as MEPs? Of course.
replacing_meps_df %>%
group_by(new_mep_id) %>%
count(name = "n") %>%
filter(n > 1) %>%
pull(new_mep_id) %>%
tw_get_qualifiers(p = "P39") %>%
filter(qualifier_property == "P2937", # parliamentary term
qualifier_id == "Q27169") %>% # as MEP
tw_label() %>%
transmute(mep = id, legislature = qualifier_value)
#> # A tibble: 4 × 2
#> mep legislature
#> <chr> <chr>
#> 1 Tom Vandenkendelaere Eighth European Parliament
#> 2 Tom Vandenkendelaere Ninth European Parliament
#> 3 João Pimenta Lopes Eighth European Parliament
#> 4 João Pimenta Lopes Ninth European Parliament
This blog post presents some examples of the way Wikidata works based on MEPs, and they can easily be replicated taking it from here.
Let’s make some more examples.
meps9_pob_df <- meps9_df %>%
mutate(mep = tw_get_label(id)) %>%
mutate(place_of_birth_id = tw_get_p(id = id, p = "P19", only_first = TRUE)) %>%
mutate(place_of_birth = tw_get_label(place_of_birth_id)) %>%
mutate(place_of_birth_coordinates = tw_get_p(id = place_of_birth_id,
p = "P625",
only_first = TRUE)) %>%
tidyr::separate(
col = place_of_birth_coordinates,
into = c(
"pob_latitude",
"pob_longitude"
),
sep = ",",
remove = TRUE,
convert = TRUE
) %>%
filter(is.na(pob_latitude)==FALSE)
We have a place of birth on the records for 795 MEPs out of 801. Not bad. We could use some proxy for the cases when such information is unavailable, for example using properties such as “residence” (P551) or the earliest place where they got their education, but for the time being we will leave it at that.
We could now easily place these data on a map (code example not run to remove dependency on external files and packages).
world_sf_file <- fs::path("world_geo_data", "world_sf.rds")
if (fs::file_exists(world_sf_file)) {
world_sf <- readr::read_rds(world_sf_file)
} else {
fs::dir_create(fs::path("world_geo_data"))
download.file(
url = "https://gisco-services.ec.europa.eu/distribution/v2/countries/download/ref-countries-2020-60m.geojson.zip",
destfile = fs::path("world_geo_data", "world.geojson.zip")
)
unzip(fs::path("world_geo_data", "world.geojson.zip"),
exdir = fs::path("world_geo_data")
)
world_sf <- sf::st_read(dsn = fs::path("world_geo_data", "CNTR_RG_60M_2020_4326.geojson"))
saveRDS(object = world_sf, file = fs::path("world_geo_data", "world_sf.rds"))
}
meps9_pob_df %>%
sf::st_as_sf(coords = c("pob_longitude", "pob_latitude"), crs = 4326) %>%
ggplot2::ggplot() +
ggplot2::geom_sf(data = world_sf) +
ggplot2::geom_sf(size = 0.1) +
ggplot2::theme_minimal()
And we can notice that quite a few MEPs were not born in an EU country. We could then geocode the locations, or ask Wikidata in which country their place of birth is located.
Here we already start to meet some of the quirks of Wikidata.
The question “in which country is this city located” seems
straightforward, but as illustrated in
the package readme it actually isn’t always the case, as, for
example, London has many values for the property “country”, all the way
from “Roman Empire” through “United Kingdom”. Enabling the
only_first
parameters, which is a the most efficient option
in many circumstances, is not a sensible solution in this case, as
predicting the order of properties is not straightforward.
To make things easier in such circumstances
tidywikidatar
includes dedicated parameters:
preferred
(which gives back the value recorded as
“preferred”), and latest_start_time
(which checks the
qualifiers, and selects the one with the most recent “start time”).
meps9_pob_df %>%
mutate(pob_country_id = tw_get_p(id = place_of_birth_id,
p = "P17",
only_first = TRUE,
preferred = TRUE)) %>%
group_by(pob_country_id) %>%
tally(sort = TRUE) %>%
mutate(pob_country = tw_get_label(pob_country_id))
#> # A tibble: 58 × 3
#> pob_country_id n pob_country
#> <chr> <int> <chr>
#> 1 Q183 95 Germany
#> 2 Q38 78 Italy
#> 3 Q142 77 France
#> 4 Q145 60 United Kingdom
#> 5 Q29 56 Spain
#> 6 Q36 50 Poland
#> 7 Q218 34 Romania
#> 8 Q55 30 Netherlands
#> 9 Q31 23 Belgium
#> 10 Q213 22 Czech Republic
#> # … with 48 more rows
Keep in mind that enabling latest_start_time
has an
impact on processing time (as it relies on
tw_get_qualifiers()
which caches data separately), but that
enabling preferred
is just as fast as having it
disabled.