new vignette, small text updates, cleanup

michalovadek · Aug 21, 2023 · e4cc57b · e4cc57b
1 parent 3467765
commit e4cc57b
Show file tree

Hide file tree

Showing 40 changed files with 1,476 additions and 124 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,7 +1,8 @@
+*.Rproj
 .Rproj.user
 .Rhistory
 .RData
 .Ruserdata
-eurlex.Rproj
 /doc/
 /Meta/
+inst/doc
diff --git a/CRAN-SUBMISSION b/CRAN-SUBMISSION
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -34,6 +34,7 @@ Suggests:
     wordcloud,
     purrr,
     ggplot2,
+    ggiraph,
     testthat (>= 3.0.0)
 URL: https://michalovadek.github.io/eurlex/
 VignetteBuilder: knitr

diff --git a/README.md b/README.md
@@ -1,11 +1,14 @@
+# eurlex: Retrieve Data on European Union Law <img src="man/figures/logo.png" align="right" width="140" />
+
+<!-- badges: start -->
 [![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/eurlex)](https://cran.r-project.org/package=eurlex)
 [![CRAN\_Downloads](http://cranlogs.r-pkg.org/badges/grand-total/eurlex)](https://cran.r-project.org/package=eurlex)
 [![R-CMD-check](https://github.com/michalovadek/eurlex/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/michalovadek/eurlex/actions/workflows/check-standard.yaml)
-# eurlex <img src="man/figures/logo.png" align="right" width="150" />
+<!-- badges: end -->
 
-The `eurlex` R package attempts to significantly reduce the overhead associated with using SPARQL and REST APIs made available by the EU Publication Office and other EU institutions. Compared to pure web-scraping, the package provides more efficient and transparent access to data on European Union laws and policies.
+The `eurlex` R [package](https://michalovadek.github.io/eurlex/) reduces the overhead associated with using SPARQL and REST APIs made available by the EU Publication Office and other EU institutions. Compared to pure web-scraping, the package provides more efficient and transparent access to data on European Union laws and policies.
 
-See the [vignette](https://michalovadek.github.io/eurlex/articles/eurlexpkg.html) for a walkthrough on how to use the package. Check function documentation for most up-to-date overview of features. Example use cases are shown in this [paper](https://www.tandfonline.com/doi/full/10.1080/2474736X.2020.1870150).
+See the [vignette](https://michalovadek.github.io/eurlex/articles/eurlexpkg.html) for a basic walkthrough on how to use the package. Check function documentation for most up-to-date overview of features. Example use cases are shown in this [paper](https://www.tandfonline.com/doi/full/10.1080/2474736X.2020.1870150).
 
 You can use `eurlex` to create automatically updated overviews of EU decision-making activity, as shown [here](https://michalovadek.github.io/eulaw/).
 
@@ -18,31 +21,39 @@ The development version is available via `remotes::install_github("michalovadek/
 Michal Ovádek (2021) **Facilitating access to data on European Union laws**, *Political Research Exchange*, 3:1, DOI: [10.1080/2474736X.2020.1870150](https://www.tandfonline.com/doi/full/10.1080/2474736X.2020.1870150)
 
 ## Basic usage
-
 The `eurlex` package currently envisions the typical use-case to consist of getting bulk information about EU legislation into R as fast as possible. The package contains three core functions to achieve that objective: `elx_make_query()` to create pre-defined or customized SPARQL queries; `elx_run_query()` to execute the pre-made or any other manually input query; and `elx_fetch_data()` to fire GET requests for certain metadata to the REST API.
 
 The function `elx_make_query` takes as its first argument the type of resource to be retrieved (such as "directive" or "any") from the semantic database that powers Eur-Lex (and other publications) called Cellar. If you are familiar with SPARQL, you can always specify your own queries and execute them with `elx_run_query()`.
 
-`elx_run_query()` executes SPARQL queries on a pre-specified endpoint of the EU Publication Office. It outputs a `data.frame` where each column corresponds to one of the requested variables, while the rows accumulate observations of the resource type satisfying the query criteria. Obviously, the more data is to be returned, the longer the execution time, varying from a few seconds to several hours, depending also on your connection. The first column always contains the unique URI of a "work" (legislative act or court judgment) which identifies each resource in Cellar. Several human-readable identifiers are normally associated with each "work" but the most useful one is [CELEX](https://eur-lex.europa.eu/content/tools/TableOfSectors/types_of_documents_in_eurlex.html), retrieved by default.
+`elx_run_query()` executes SPARQL queries on a pre-specified endpoint of the EU Publication Office. It outputs a `data.frame` where each column corresponds to one of the requested variables, while the rows accumulate observations of the resource type satisfying the query criteria. Obviously, the more data is to be returned, the longer the execution time, varying from a few seconds to several hours, depending also on your connection. The first column always contains the unique URI of a "work" (usually legislative act or court judgment) which identifies each resource in Cellar. Several human-readable identifiers are normally associated with each "work" but the most useful one tends to be [CELEX](https://eur-lex.europa.eu/content/tools/TableOfSectors/types_of_documents_in_eurlex.html), retrieved by default.
 
-For the moment, it is recommended to retrieve metadata one variable at a time. For example, if you wish to obtain the legal bases of directives and the date of transposition, you should run separate calls:
+``` r
+# load library
+library(eurlex)
 
-0. `ids <- elx_make_query("directive") |> elx_run_query()`
-1. `lbs <- elx_make_query("directive", include_lbs = TRUE) |> elx_run_query()`
-2. `dates <- elx_make_query("directive", include_date_transpos = TRUE) |> elx_run_query()`
-3. `ids |> dplyr::left_join(lbs) |> dplyr::left_join(dates)`
+# create query
+query <- elx_make_query("directive", include_date_transpos = TRUE)
 
-rather than `elx_make_query("directive", include_lbs = TRUE, include_date_transpos = TRUE)`. This approach is usually faster and should also make it easier to understand the returned data frame(s), especially when some variables contain missing or duplicated data. Always keep an eye on whether the `work` and `celex` columns identify rows uniquely or not.
+# execute query
+results <- elx_run_query(query)
+```
 
-One of the most useful things about the API is that we obtain a comprehensive list of identifiers that we can subsequently use to obtain more data relating to the document in question. While the results of the SPARQL queries are useful also for webscraping (with the `rvest` package), the function `elx_fetch_data()` enables us to fire GET requests to retrieve data on documents with known identifiers (including Cellar URI). The function for example enables downloading the title and the full text of a document in all available languages.
+One of the most useful things about the API is that we obtain a comprehensive list of identifiers that we can subsequently use to obtain more data relating to the document in question. While the results of the SPARQL queries can also be useful for web-scraping, the function `elx_fetch_data()` makes it possible to fire GET requests to retrieve data on documents with known identifiers (including Cellar URI). The function for example enables downloading the title and the full text of a document in all available languages.
 
 ## Note
-This package nor its author are in any way affiliated with the EU Publications Office. Please refer to the applicable [data reuse policies](https://eur-lex.europa.eu/content/welcome/data-reuse.html).
+This package nor its author are in any way affiliated with the EU, its institutions, offices or agencies. Please refer to the applicable [data reuse policies](https://eur-lex.europa.eu/content/welcome/data-reuse.html).
 
 Please consider contributing to the maintenance and development of the package by reporting bugs or suggesting new features.
 
 ## Latest changes
 
+### eurlex 0.4.5
+
+- breaking change: `elx_run_query()` now strips URIs (except Eurovoc ones) by default and keeps only the identifier to reduce object size
+- where `elx_fetch_data()` is used to retrieve texts from an html document, it now uses by default `rvest::html_text2()` instead of `rvest::html_text()`. This is slower but more resembling of how the page renders in some cases. New argument `html_text = "text2"` controls the setting.
+- new feature: `elx_make_query(..., include_court_origin = TRUE)` retrieves the country of origin of a court case. As per Eur-Lex documentation, this is primarily intended to be the country of the national court referring a preliminary question, but other countries are present in the data as well at the moment. Recommended to interact with court procedure
+- new feature: `elx_make_query(..., include_original_language = TRUE)` retrieves the authentic language of a document, typically a court case
+
 ### eurlex 0.4.3
 
 - all date variables retrieved through `elx_make_query(include_... = TRUE)` are now properly named
@@ -61,29 +72,6 @@ Please consider contributing to the maintenance and development of the package b
 - fixed bug in `elx_download_xml()` parameter checking
 - `elx_download_xml(notice = "object")` now retrieves metadata correctly
 
-### eurlex 0.4.0
-
-- download XML notices associated with Cellar URLs with `elx_download_xml()`
-- retrieve European Case Law Identifier (ECLI) with `elx_make_query(include_ecli = TRUE)`
-
-### eurlex 0.3.6
-
-- `elx_run_query()` now fails gracefully in presence of internet/server problems
-- `elx_fetch_data()` now automatically fixes urls with parentheses (e.g. "32019H1115(01)" used to fail)
-- minor fixes to vignette
-- `elx_parse_xml` no longer an exported function
-
-### eurlex 0.3.5
-
-- it is now possible to select all resource types available with `elx_make_query(resource_type = "any")`. Since there are nearly 1 million CELEX codes, use with discretion and expect long execution times
-- results can be restricted to a particular directory code with `elx_make_query(directory = "18")` (directory code "18" denotes Common Foreign and Security Policy)
-- results can be restricted to a particular sector with `elx_make_query(sector = 2)` (sector code 2 denotes EU international agreements)
-
-- new feature: request date of court case submission `elx_make_query(include_date_lodged = TRUE)`
-- new feature: request type of court procedure and outcome `elx_make_query(include_court_procedure = TRUE)`
-- new feature: request directory code of legal act `elx_make_query(include_directory = TRUE)`
-- `elx_curia_list()` has a new default parameter `parse = TRUE` which creates separate columns for `ecli`, `see_case`, `appeal` applying regular expressions on `case_info`
-
 ## Useful resources
 Guide to CELEX numbers: https://eur-lex.europa.eu/content/tools/TableOfSectors/types_of_documents_in_eurlex.html
 

diff --git a/cran-comments.md b/cran-comments.md
diff --git a/doc/eurlexpkg.R b/doc/eurlexpkg.R
@@ -108,7 +108,7 @@ dir_titles <- results[1:5,] %>% # take the first 5 directives only to save time
 print(dir_titles)
 
 
-## ---- eval=FALSE--------------------------------------------------------------
+## ----dirsdata, eval=FALSE-----------------------------------------------------
 #  dirs <- elx_make_query(resource_type = "directive", include_date = TRUE, include_force = TRUE) %>%
 #    elx_run_query()
 
@@ -120,7 +120,7 @@ dirs %>%
   ggplot(aes(x = force, y = n)) +
   geom_col()
 
-## -----------------------------------------------------------------------------
+## ----dirforce-----------------------------------------------------------------
 dirs %>% 
   filter(!is.na(force)) %>% 
   mutate(date = as.Date(date)) %>% 
@@ -130,7 +130,7 @@ dirs %>%
         axis.line.y = element_blank(),
         axis.ticks.y = element_blank())
 
-## -----------------------------------------------------------------------------
+## ----dirtitles----------------------------------------------------------------
 dirs_1970_title <- dirs %>% 
   filter(between(as.Date(date), as.Date("1970-01-01"), as.Date("1973-01-01")),
          force == "true") %>% 
@@ -140,12 +140,12 @@ dirs_1970_title <- dirs %>%
   as_tibble()
 
 print(dirs_1970_title)
-
 
 ## ----wordcloud, message = FALSE, warning=FALSE, error=FALSE-------------------
 library(tidytext)
 library(wordcloud)
 
+# wordcloud
 dirs_1970_title %>% 
   select(celex,title) %>% 
   unnest_tokens(word, title) %>% 
@@ -154,4 +154,3 @@ dirs_1970_title %>%
   bind_tf_idf(word, celex, n) %>% 
   with(wordcloud(word, tf_idf, max.words = 40))
 
-
diff --git a/doc/eurlexpkg.Rmd b/doc/eurlexpkg.Rmd
@@ -203,7 +203,7 @@ Note that text requests are by far the most time-intensive; requesting the full
 
 In this section I showcase a simple application of `eurlex` on making overviews of EU legislation. First, we collate data on directives.
 
-```{r, eval=FALSE}
+```{r dirsdata, eval=FALSE}
 dirs <- elx_make_query(resource_type = "directive", include_date = TRUE, include_force = TRUE) %>% 
   elx_run_query()
 ```
@@ -221,7 +221,7 @@ dirs %>%
 
 Directives become naturally outdated with time. It might be all the more interesting to see which older acts are thus still surviving.
 
-```{r}
+```{r dirforce}
 dirs %>% 
   filter(!is.na(force)) %>% 
   mutate(date = as.Date(date)) %>% 
@@ -234,7 +234,7 @@ dirs %>%
 
 We want to know a bit more about some directives from the early 1970s that are still in force today. Their titles could give us a clue.
 
-```{r}
+```{r dirtitles}
 dirs_1970_title <- dirs %>% 
   filter(between(as.Date(date), as.Date("1970-01-01"), as.Date("1973-01-01")),
          force == "true") %>% 
@@ -244,7 +244,6 @@ dirs_1970_title <- dirs %>%
   as_tibble()
 
 print(dirs_1970_title)
-  
 ```
 
 I will use the `tidytext` package to get a quick idea of what the legislation is about.
@@ -253,14 +252,14 @@ I will use the `tidytext` package to get a quick idea of what the legislation is
 library(tidytext)
 library(wordcloud)
 
+# wordcloud
 dirs_1970_title %>% 
   select(celex,title) %>% 
   unnest_tokens(word, title) %>% 
   count(celex, word, sort = TRUE) %>% 
   filter(!grepl("\\d", word)) %>% 
   bind_tf_idf(word, celex, n) %>% 
   with(wordcloud(word, tf_idf, max.words = 40))
-
 ```
 
 I use term-frequency inverse-document frequency (tf-idf) to weight the importance of the words in the wordcloud. If we used pure frequencies, the wordcloud would largely consist of words conveying little meaning ("the", "and", ...).