Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Date example #97

Open
lf-araujo opened this issue Nov 8, 2020 · 2 comments
Open

Date example #97

lf-araujo opened this issue Nov 8, 2020 · 2 comments

Comments

@lf-araujo
Copy link

Hi Vindaar,

First thank you for this excellent module.

For future reference, this is a common example of use of Date types in R that should improve the plotting ggplotnim capabilities. I believe this is perhaps more suited to be managed in a separate data management module.

So, for the data set:

https://gist.github.com/lf-araujo/5da7266c44b3824b824d578411dee73c

Where dates are arbitrarily entered, but one would usually want to print this out standardised as year, typical R code would use as.Date() and look like the following:

read.csv("finances.csv") %>% 
	ggplot(aes( x = as.Date(date), y = yrate)) +
		 geom_line()+
		 geom_point()

Resulting in:

image

However, current ggplotnim code would look like:

proc plot() = 
  let df = toDf(readCsv("./finances.csv"))
  ggplot(df, aes(x = "date")) + 
    geom_line(aes( y = "yrate")) + 
    ylab("Rate") +
    xlab(rotate = -45.0, margin = 1.75, alignTo = "right") +
    ggtitle("Rate evolution") +
    ggsave("finances.png")

and finances.png looks rather confusing:

finances

It is not handling negative values correctly and the dates are not standardised by year. This is currently not possible and not supported. This example is just for future testing.

@Vindaar
Copy link
Owner

Vindaar commented Nov 9, 2020

Hey!

Thanks for the example.
From this I can deduce two typical usage examples:

  • have dates as a string and allow for parsing to Time (your example)
  • have dates as timestamp and allow for parsing to Time

with useful choice of ticks / formatting.

I will probably first try to find a way to accomplish this in a way that doesn't require the data frame to be able to handle a new data type with Time. While that would make the logic for the plotting easier it would result in a significantly more complex implementation (having to extend the Column and Value variant types). Instead I envision to handle this implicitly either using parse or fromUnix. The internal data will still be treated as float / string.

and finances.png looks rather confusing: ... It is not handling negative values correctly and the dates are not standardised by year. This is currently not possible and not supported.

The issue you're seeing here both for the x and y axis is a simple one.

For x: your dates are given as strings. This means the data will be interpreted as discrete data (continuous string data is not a useful concept). For a more sane handling converting the dates to timestamps and formatting the labels manually is the correct solution. As timestamps the dates will be treated as continuous data and the number of ticks will be a reasonable value. Custom formatting then allows for nice labels.
Conversion to timestamps has to be done manually for the moment of course (which can be a bit ugly, but we can easily add sugar for this).

For y:
The y axis is really simple. Your data is not "continuous" for the heuristics used by ggplotnim. This is an unfortunate side effect of "trying to do the right thing". Arguably for float data it might be a good idea in general to always treat the data as continuous.
Essentially ggplotnim looks at a subset of 100 rows of a column (random indices) and checks if the number of different values is larger than a certain percentage. If not the data is treated as discrete data. Since most of your entries are 0.019 that threshold is not crossed. And because of string comparison of the labels the negative values suddenly appear at the top and the distance between values does not correspond to their numerical difference anymore. One can easily force the scale to be continuous using scale_y_continuous().

Full example with a few comments:

import ggplotnim, times

let df = toDf(readCsv("./data/finances.csv"))
  # perform calculation of the timestamp using `parseTime`
  # have to give type hints
  .mutate(f{string -> int64: "timestamp" ~ parseTime(df["date"][idx],
                                                     "YYYY-MM-dd",
                                                     utc()).toUnix})
# alternatively we could do (if `df` is mutable):
# df["timestamp"] = df["date"].toTensor(string).map_inline(
#   parseTime(x, "YYYY-MM-dd", utc()).toUnix)

proc formatDate(f: float): string =
  ## format timestamp to YYYY-MM-dd. We do not only format via
  ## YYYY, because that will fool us. The ticks will ``not`` be placed
  ## at year change (31/12 -> 01/01), but rather at the "sensible" positions
  ## in unix timestamp space. So if you only format via YYYY we end up with
  ## years not sitting where we expect! That's the major downside of having
  ## no "understanding" of what dates mean.
  result = fromUnix(f.int).format("YYYY-MM-dd")

ggplot(df, aes(x = timestamp, y = yrate)) + # can use raw identifiers if not ambiguous
  geom_line() +
  geom_point() +
  xlab("Date") + ylab("Rate") +
  scale_y_continuous() + # force y continuous
  # use `formatDate` to format the dates from timestamp
  scale_x_continuous(labels = formatDate) +
  ggtitle("Rate evolution") +
  ggsave("finances.png")

This issue will remain open until the handling of such things is more convenient.

finances

@Vindaar
Copy link
Owner

Vindaar commented Nov 9, 2020

As I said, I'll leave this open as a reminder for myself (and for others to find it easier) until a cleaner solution with less manual work is available.

@Vindaar Vindaar reopened this Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants