-
Notifications
You must be signed in to change notification settings - Fork 5
/
README.Rmd
361 lines (265 loc) · 14.3 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
options(width = 120)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# RITCH - an R interface to the ITCH Protocol
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/RITCH)](https://CRAN.R-project.org/package=RITCH) [![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/RITCH)](https://www.r-pkg.org/pkg/RITCH) [![R-CMD-check](https://github.com/DavZim/RITCH/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/DavZim/RITCH/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
The `RITCH` library provides an `R` interface to NASDAQs ITCH protocol, which is used to distribute financial messages to participants.
Messages include orders, trades, market status, and much more financial information.
A full list of messages is shown later.
The main purpose of this package is to parse the binary ITCH files to a [`data.table`](https://CRAN.R-project.org/package=data.table) in `R`.
The package leverages [`Rcpp`](https://CRAN.R-project.org/package=Rcpp) and `C++` for efficient message parsing.
Note that the package provides a small simulated sample dataset in the `ITCH_50` format for testing and example purposes.
Helper functions are provided to list and download sample files from NASDAQs official server.
## Install
To install `RITCH` you can use the following
```R
# stable version:
install.packages("RITCH")
# development version:
# install.packages("remotes")
remotes::install_github("DavZim/RITCH")
```
## Quick Overview
The main functions of `RITCH` are read-related and are easily identified by their `read_` prefix.
Due to the inherent structural differences between message classes, each class has its own read function.
A list of message types and the respective classes are provided later in this Readme.
Example message classes used in this example are *orders* and *trades*.
First we define the file to load and count the messages, then we read in the orders and the first 100 trades
```{r}
library(RITCH)
# use built in example dataset
file <- system.file("extdata", "ex20101224.TEST_ITCH_50", package = "RITCH")
# count the number of messages in the file
msg_count <- count_messages(file)
dim(msg_count)
names(msg_count)
# read the orders into a data.table
orders <- read_orders(file)
dim(orders)
names(orders)
# read the first 100 trades
trades <- read_trades(file, n_max = 100)
dim(trades)
names(trades)
```
Note that the file can be a plain `ITCH_50` file or a gzipped `ITCH_50.gz` file, which will be decompressed to the current directory.
You may also note that the output reports quite a low read speed in the `MB/s`.
This lowish number is due to including the parsing process, furthermore, due to overhead of setup code, this number gets higher on larger files.
If you want to know more about the functions of the package, read on.
## Main Functions
`RITCH` provides the following main functions:
- `read_itch(file, ...)` to read an ITCH file
Convenient wrappers for different message classes such as `orders`, `trades`, etc are also provided as `read_orders()`, `read_trades()`, ...
- `filter_itch(infile, outfile, ...)` to filter an ITCH file and write directly to another file without loading the data into R
- `write_itch(data, file, ...)` to write a dataset to an ITCH file
There are also some helper functions provided, a selection is:
- `download_sample_file(choice)` to download a sample file from the NASDAQ server and `list_sample_files()` to get a list of all available sample files
- `download_stock_directory(exchange, date)` to download the stock locate information for a given exchange and date
- `open_itch_sample_server()` to open the official NASDAQ server in your browser, which hosts among other things example data files
- `gzip_file(infile, outfile)` and `gunzip_file(infile, outfile)` for gzip functionality
- `open_itch_specification()` to open the official NASDAQ ITCH specification PDF in your browser
## Writing ITCH Files
`RITCH` also provides functionality for writing ITCH files.
Although it could be stored in other file formats (for example a database or a [`qs`](https://CRAN.R-project.org/package=qs) file), ITCH files are quite optimized regarding size as well as write/read speeds.
Thus the `write_itch()` function allows you to write a single or multiple types of message to an `ITCH_50` file.
Note however, that only the standard columns are supported.
Additional columns will not be written to file!
Additional information can be saved in the filename.
By default the date, exchange, and fileformat information is added to the filename unless you specify `add_meta = FALSE`, in which case the given name is used.
As a last note: if you write your data to an ITCH file and want to filter for stocks later on, make sure to save the stock directory of that day/exchange, either externally or in the ITCH file directly (see example below).
### Simple Write Example
A simple write example would be to read all modifications from an ITCH file and save it to a separate file to save space, reduce read times later on, etc.
```{r}
file <- system.file("extdata", "ex20101224.TEST_ITCH_50", package = "RITCH")
md <- read_modifications(file, quiet = TRUE)
dim(md)
names(md)
outfile <- write_itch(md, "modifications", compress = TRUE)
# compare file sizes
files <- c(full_file = file, subset_file = outfile)
format_bytes(sapply(files, file.size))
```
```{r, include = FALSE}
unlink(outfile)
```
### Comprehensive Write Example
A typical work flow would look like this:
- read in some message classes from file and filter for certain stocks
- save the results for later analysis, also compress to save disk space
```{r}
## Read in the different message classes
file <- system.file("extdata", "ex20101224.TEST_ITCH_50", package = "RITCH")
# read in the different message types
data <- read_itch(file,
c("system_events", "stock_directory", "orders"),
filter_stock_locate = c(1, 3),
quiet = TRUE)
str(data, max.level = 1)
## Write the different message classes
outfile <- write_itch(data,
"alc_char_subset",
compress = TRUE)
outfile
# compare file sizes
format_bytes(
sapply(c(full_file = file, subset_file = outfile),
file.size)
)
## Lastly, compare the two datasets to see if they are identical
data2 <- read_itch(outfile, quiet = TRUE)
all.equal(data, data2)
```
```{r, include=FALSE}
# remove files from write_itch again...
unlink(outfile)
outfile_unz <- gsub("\\.gz$", "", outfile)
unlink(outfile_unz)
```
For comparison, the same format in the [`qs`](https://CRAN.R-project.org/package=qs) format results in `44788` bytes.
<!---qs::qsave(data, "data.qs", preset = "archive");file.info("data.qs")[["size"]];unlink("data.qs")-->
## ITCH Messages
There are a total of 22 different message types which are grouped into 13 classes by `RITCH`.
The messages and their respective classes are:
```{r, echo=FALSE}
d <- get_msg_classes()
d$msg_type <- paste0("<code>", d$msg_type, "</code>")
d$read_function <- paste0("<code>", "read_", d$msg_class, "()", "</code>")
data.table::setcolorder(d, c("msg_type", "msg_class", "read_function",
"msg_name", "doc_nr"))
data.table::setnames(d, c("Type", "<code>RITCH</code> Class",
"<code>RITCH</code> Read Function", "ITCH Name",
"ITCH Spec Section"))
knitr::kable(d, escape = FALSE)
```
Note that if you are interested in the exact definition of the messages and its components, you should look into the [official ITCH specification](https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/NQTVITCHspecification.pdf), which can also be opened by calling `open_itch_specification()`.
## Data
The `RITCH` package provides a small, artificial dataset in the ITCH format for example and test purposes.
To learn more about the dataset check `?ex20101224.TEST_ITCH_50`.
To access the dataset use:
```{r}
file <- system.file("extdata", "ex20101224.TEST_ITCH_50", package = "RITCH")
count_messages(file, add_meta_data = TRUE, quiet = TRUE)
```
Note that the example dataset does not contain messages from all classes but is limited to 6 system messages, 3 stock directory, 3 stock trading action, 5000 trade, 5000 order, and 2000 order modification messages.
As seen by the 3 stock directory messages, the file contains data about 3 made up stocks (see also the plot later in the Readme).
MASDAQ provides sample ITCH files on their official server at <https://emi.nasdaq.com/ITCH/Nasdaq%20ITCH/> (or in R use `open_itch_sample_server()`) which can be used to test code on larger datasets.
Note that the sample files are up to 5GB compressed, which inflate to about 13GB.
To interact with the sample files, use `list_sample_files()` and `download_sample_files()`.
## Notes on Memory and Speed
There are some tweaks available to deal with memory and speed issues.
For faster reading speeds, you can increase the buffer size of the `read_` functions to something around 1 GB or more (`buffer_size = 1e9`).
### Provide Message Counts
If you have to read from a single file multiple times, for example because you want to extract orders and trades, you can count the messages beforehand and provide it to each read's `n_max` argument, reducing the need to pass the file for counting the number of messages.
```{r}
# count messages once
n_msgs <- count_messages(file, quiet = TRUE)
# use counted messages multiple times, saving file passes
orders <- read_orders(file, quiet = TRUE, n_max = n_msgs)
trades <- read_trades(file, quiet = TRUE, n_max = n_msgs)
```
### Batch Read
If the dataset does not fit entirely into RAM, you can do a partial read specifying `skip` and `n_max`, similar to this:
```{r}
file <- system.file("extdata", "ex20101224.TEST_ITCH_50", package = "RITCH")
n_messages <- count_orders(count_messages(file, quiet = TRUE))
n_messages
# read 1000 messages at a time
n_batch <- 1000
n_parsed <- 0
while (n_parsed < n_messages) {
cat(sprintf("Parsing Batch %04i - %04i", n_parsed, n_parsed + n_batch))
# read in a batch
df <- read_orders(file, quiet = TRUE, skip = n_parsed, n_max = n_batch)
cat(sprintf(": with %04i orders\n", nrow(df)))
# use the data
# ...
n_parsed <- n_parsed + n_batch
}
```
### Filter when Reading Data
You can also filter a dataset directly while reading messages for `msg_type`, `stock_locate`, `timestamp` range, as well as `stock`.
Note that filtering for a specific stock, is just a shorthand lookup for the stocks' `stock_locate` code, therefore a `stock_directory` needs to be supplied (either by providing the output from `read_stock_directory()` or `download_stock_locate()`) or the function will try to extract the stock directory from the file (might take some time depending on the size of the file).
```{r}
# read in the stock directory as we filter for stock names later on
sdir <- read_stock_directory(file, quiet = TRUE)
od <- read_orders(
file,
filter_msg_type = "A", # take only 'No MPID add orders'
min_timestamp = 43200000000000, # start at 12:00:00.000000
max_timestamp = 55800000000000, # end at 15:30:00.000000
filter_stock_locate = 1, # take only stock with code 1
filter_stock = "CHAR", # but also take stock CHAR
stock_directory = sdir # provide the stock_directory to match stock names to stock_locates
)
# count the different message types
od[, .(n = .N), by = msg_type]
# see if the timestamp is in the specified range
range(od$timestamp)
# count the stock/stock-locate codes
od[, .(n = .N), by = .(stock_locate, stock)]
```
### Filter Data to File
On larger files, reading the data into memory might not be the best idea, especially if only a small subset is actually needed.
In this case, the `filter_itch` function will come in handy.
The basic design is identical to the `read_itch` function but instead of reading the messages into memory, they are immediately written to a file.
Taking the filter data example from above, we can do the following
```{r}
# the function returns the final name of the output file
outfile <- filter_itch(
infile = file,
outfile = "filtered",
filter_msg_type = "A", # take only 'No MPID add orders'
min_timestamp = 43200000000000, # start at 12:00:00.000000
max_timestamp = 55800000000000, # end at 15:30:00.000000
filter_stock_locate = 1, # take only stock with code 1
filter_stock = "CHAR", # but also take stock CHAR
stock_directory = sdir # provide the stock_directory to match stock names to stock_locates
)
format_bytes(file.size(outfile))
# read in the orders from the filtered file
od2 <- read_orders(outfile)
# check that the filtered dataset contains the same information as in the example above
all.equal(od, od2)
```
```{r, include=FALSE}
# remove files from filter_itch again...
unlink(outfile)
```
## Create a Plot with Trades and Orders of the largest ETFs
As a last step, a quick visualization of the example dataset
```{r ETF_plot}
library(ggplot2)
file <- system.file("extdata", "ex20101224.TEST_ITCH_50", package = "RITCH")
# load the data
orders <- read_orders(file, quiet = TRUE)
trades <- read_trades(file, quiet = TRUE)
# replace the buy-factor with something more useful
orders[, buy := ifelse(buy, "Bid", "Ask")]
ggplot() +
geom_point(data = orders,
aes(x = as.POSIXct(datetime), y = price, color = buy), alpha = 0.2) +
geom_step(data = trades, aes(x = as.POSIXct(datetime), y = price), size = 0.2) +
facet_grid(stock~., scales = "free_y") +
theme_light() +
labs(title = "Orders and Trades of Three Simulated Stocks",
subtitle = "Date: 2010-12-24 | Exchange: TEST",
caption = "Source: RITCH package", x = "Time", y = "Price", color = "Side") +
scale_y_continuous(labels = scales::dollar) +
scale_color_brewer(palette = "Set1")
```
## Other Notes
If you find this package useful or have any other kind of feedback, I'd be happy if you let me know. Otherwise, if you need more functionality, please feel free to create an issue or a pull request.
Citation and CRAN release are WIP.
If you are interested in gaining a better understanding of the internal data structures, converting data to and from binary, have a look at the `debug` folder and its contents (only available on the [RITCH's Github page](https://github.com/DavZim/RITCH/)).