Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix spurious warnings in guess_dates #75

Open
thibautjombart opened this issue May 19, 2019 · 8 comments
Open

Fix spurious warnings in guess_dates #75

thibautjombart opened this issue May 19, 2019 · 8 comments
Assignees
Labels
More information needed This issue cannot be resolved until more inforamation is provided

Comments

@thibautjombart
Copy link
Contributor

Sorry for I cannot share data to reproduce this, but I get the following on the latest version of linelist:

> x <- x %>%
+   mutate_at(.vars = vars(contains("date")),
+             .funs = guess_dates)
Warning messages:
1: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 1 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-12-16  |  2019-12-16
2: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 2 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-07-04  |  2019-07-04
  2019-10-21  |  2019-10-21
3: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 9 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-08-08  |  2019-08-08
  2019-08-22  |  2019-08-22
  2019-09-02  |  2019-09-02
  2019-09-03  |  2019-09-03
  2019-09-11  |  2019-09-11
  2019-10-20  |  2019-10-20
  2019-11-02  |  2019-11-02
  2019-11-20  |  2019-11-20
  2019-12-12  |  2019-12-12
4: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 27 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-08-08  |  2019-08-08
  2019-08-15  |  2019-08-15
  2019-08-16  |  2019-08-16
  2019-08-18  |  2019-08-18
  2019-08-19  |  2019-08-19
  2019-08-22  |  2019-08-22
  2019-08-30  |  2019-08-30
  2019-09-06  |  2019-09-06
  2019-09-14  |  2019-09-14
  2019-09-16  |  2019-09-16
  2019-09-17  |  2019-09-17
  2019-09-19  |  2019-09-19
  2019-09-21  |  2019-09-21
  2019-09-27  |  2019-09-27
  2019-10-04  |  2019-10-04
  2019-10-08  |  2019-10-08
  2019-10-10  |  2019-10-10
  2019-10-12  |  2019-10-12
  2019-10-13  |  2019-10-13
  2019-10-24  |  2019-10-24
  2019-10-25  |  2019-10-25
  2019-10-30  |  2019-10-30
  2019-10-31  |  2019-10-31
  2019-11-02  |  2019-11-02
  2019-11-09  |  2019-11-09
  2019-11-13  |  2019-11-13
  2019-12-14  |  2019-12-14
5: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 16 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-08-17  |  2019-08-17
  2019-08-19  |  2019-08-19
  2019-09-04  |  2019-09-04
  2019-09-19  |  2019-09-19
  2019-09-20  |  2019-09-20
  2019-09-22  |  2019-09-22
  2019-09-28  |  2019-09-28
  2019-09-29  |  2019-09-29
  2019-10-13  |  2019-10-13
  2019-10-14  |  2019-10-14
  2019-10-16  |  2019-10-16
  2019-10-30  |  2019-10-30
  2019-11-03  |  2019-11-03
  2019-11-15  |  2019-11-15
  2019-11-17  |  2019-11-17
  2019-12-18  |  2019-12-18
> 
@thibautjombart thibautjombart added the high priority this feature should be completed and tested as soon as possible label May 19, 2019
@zkamvar
Copy link
Member

zkamvar commented May 19, 2019 via email

@zkamvar zkamvar added the More information needed This issue cannot be resolved until more inforamation is provided label May 20, 2019
@thibautjombart
Copy link
Contributor Author

Problem is having long lists of dates original / parsed that are identical.

@zkamvar
Copy link
Member

zkamvar commented May 20, 2019

Problem is having long lists of dates original / parsed that are identical.

Is the problem the length or the fact that they appear to be identical?

@zkamvar
Copy link
Member

zkamvar commented May 20, 2019

To give a bit of background as to what is happening:

Because guess_dates() attempts to convert YMD, DMY, and MDY in that specific order, it's possible for some dates to fail because they were parsed incorrectly (e.g. the DMY date 11/02/2019 is interpreted as 2019-11-02 under the MDY system). These results are collected as they are parsed and then presented in a table as you saw. Usually it looks something like this:

library("linelist")
x <- c("04 Feb 1982", "19 Sep 2018", "2001-01-01", "2011.12.13",
       "ba;abb;a: 03:11:2012!", "haha... 2013-12-13..",
       "that's a NA", "gender", "not a date", "01__Feb__1999___", 
       "19/09/18", "09/08/18", "2018-08-09")
last_date <-as.Date("2012-11-05")
first_date <- as.Date("1962-11-05")
res <- guess_dates(x, error_tolerance = 1, last_date = last_date)
#> Warning in guess_dates(x, error_tolerance = 1, last_date = last_date): 
#> The following 5 dates were not in the correct timeframe (1962-11-05 -- 2012-11-05):
#> 
#>   original              |  parsed    
#>   --------              |  ------    
#>   09/08/18              |  2018-08-09
#>   09/08/18              |  2018-09-08
#>   19 Sep 2018           |  2018-09-19
#>   19/09/18              |  2018-09-19
#>   2018-08-09            |  2018-08-09
#>   haha... 2013-12-13..  |  2013-12-13
res
#>  [1] "1982-02-04" NA           "2001-01-01" "2011-12-13" "2012-11-03"
#>  [6] NA           NA           NA           NA           "1999-02-01"
#> [11] NA           NA           NA

Created on 2019-05-20 by the reprex package (v0.3.0)

Do you want me to get rid of this warning alltogether?

@zkamvar zkamvar removed the high priority this feature should be completed and tested as soon as possible label Oct 11, 2019
@ffinger
Copy link
Collaborator

ffinger commented Oct 27, 2019

I think it would already be a step forward if the warning could say which column is concerned. If guess_dates is called by clean_data you don't necessarily know which part of the warning comes from which column.

@zkamvar
Copy link
Member

zkamvar commented Oct 28, 2019

I think it would already be a step forward if the warning could say which column is concerned. If guess_dates is called by clean_data you don't necessarily know which part of the warning comes from which column.

Thank you for adding this clarification, @ffinger, and I agree with you. Collecting warnings in a loop is not a straightforward problem, but luckily, I've already written some code to handle this situation in clean_variable_spelling() (see below) and can implement it in clean_dates() if you want.

I think adopting the warning pattern that readr::parse_date() uses will be helpful: https://readr.tidyverse.org/reference/parse_datetime.html

  library("linelist")
  my_data_frame <- data.frame(
    raboof    = c(letters[1:5], "foubar", "foobr", "fubar", "", "unknown", "fumar"),
    treatment = c(letters[5:1], "Y", "Yes", "N", NA, "No", "yes"),
    region    = state.name[1:11]
  )
  corrections <- data.frame(
    bad = c("foubar", "foobr", "fubar", ".missing", "unknown", "Yes", "Y", "No", "N", ".missing"),
    good = c("foobar", "foobar", "foobar", "missing", "missing", "yes", "yes", "no", "no", "missing"),
    column = c(rep("raboof", 5), rep("treatment", 5)),
    orders = c(1:5, 5:1),
    stringsAsFactors = FALSE
  )
  corr <- data.frame(bad = c(".default", ".default"),
                     good = c("check data", "check data"),
                     column = c("raboof", "treatment"),
                     orders = Inf,
                     stringsAsFactors = FALSE
  )
  corr <- rbind(corrections, corr)
   clean_variable_spelling(my_data_frame, corr, warn = TRUE)
#> Warning in clean_variable_spelling(my_data_frame, corr, warn = TRUE): The following warnings were found...
#>   raboof_____:
#>   .... 'a', 'b', 'c', 'd', 'e', 'fumar' were changed to the default value ('check data')
#>   treatment__:
#>   .... 'a', 'b', 'c', 'd', 'e' were changed to the default value ('check data')
#>        raboof  treatment      region
#> 1  check data check data     Alabama
#> 2  check data check data      Alaska
#> 3  check data check data     Arizona
#> 4  check data check data    Arkansas
#> 5  check data check data  California
#> 6      foobar        yes    Colorado
#> 7      foobar        yes Connecticut
#> 8      foobar         no    Delaware
#> 9     missing    missing     Florida
#> 10    missing         no     Georgia
#> 11 check data        yes      Hawaii

Created on 2019-10-28 by the reprex package (v0.3.0)

@thibautjombart
Copy link
Contributor Author

I am getting warnings which look like they may not be appropriate. Example below

dates <- c("18_03_2020", "19_03_2020", "20_03_2020", "21_03_2020", "22_03_2020", 
"23_03_2020", "24_03_2020", "25_03_2020", "26_03_2020", "27_03_2020", 
"28_03_2020", "29_03_2020", "30_03_2020", "31_03_2020", "01_04_2020", 
"02_04_2020", "03_04_2020", "04_04_2020", "05_04_2020", "06_04_2020", 
"07_04_2020", "08_04_2020")

res <- linelist::guess_dates(dates)

gives the following warning:

Warning message:
In linelist::guess_dates(dates) : 
The following 4 dates were not in the correct timeframe (1970-04-10 -- 2020-04-10):

  original    |  parsed    
  --------    |  ------    
  05_04_2020  |  2020-05-04
  06_04_2020  |  2020-06-04
  07_04_2020  |  2020-07-04
  08_04_2020  |  2020-08-04

Which would suggest conversion did not go as planned, but it is actually not the case:

> res
 [1] "2020-03-18" "2020-03-19" "2020-03-20" "2020-03-21" "2020-03-22"
 [6] "2020-03-23" "2020-03-24" "2020-03-25" "2020-03-26" "2020-03-27"
[11] "2020-03-28" "2020-03-29" "2020-03-30" "2020-03-31" "2020-04-01"
[16] "2020-04-02" "2020-04-03" "2020-04-04" "2020-04-05" "2020-04-06"
[21] "2020-04-07" "2020-04-08"
> range(res)
[1] "2020-03-18" "2020-04-08"

@zkamvar
Copy link
Member

zkamvar commented Apr 12, 2020

The warnings come from the fact that it's trying out both the "mdy" and "dmy" versions of the dates. If you only expect dmy versions of dates, then set orders = "dmy"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
More information needed This issue cannot be resolved until more inforamation is provided
Projects
None yet
Development

No branches or pull requests

3 participants