-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathForecasting MW 9_30.Rmd
803 lines (633 loc) · 35.2 KB
/
Forecasting MW 9_30.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
---
title: "Prediction on test data"
author: "Mark Wang and Paola Aleman"
date: "9/30/2019"
output: word_document
editor_options:
chunk_Output_type: console
Output: word_document
---
# Instructions
This markdown file is used to make predictions based on the whole existing dataset, as well as the four best-performing algorithms identified on the existing dataset.
This markdown should be used along with the training markdown. Existing data shall be processed and models trained using the training markdown. Then this markdown file can be applied.
Note the models are label by "c1k". which stands for c -> colombia, 1 -> club 6101 and k -> kitchen trash bags.
It is necessary to change "c1k" whenever you're analyzing a different set of item and club. This way graphs and models are saved into folders with their specific name.
For easier substitution ctrl+f. In the first box type the name you want to substitute, for example "c1k". and on the second box include the prefix you want to substitute with, for example "c2k". Then click the last All option.
# Installing packages
```{r Installing packages, include=FALSE, results = 'hide'}
# Install the packages just once. Afterwards you only need to load the libraries
source("Code/Install Packages.R")
```
# Prepare data
```{r prepare data}
knitr::opts_knit$set(root.dir = normalizePath("C:/Users/mwang/Desktop/Forecast"))
setwd("C:/Users/mwang/Desktop/Forecast")
getwd()
len_dec<-1 #The frequency in which buyers make purchase decisions
len_inv<-17 #The time span from purchase to selling of an item
#Item, club and country we are analyzing
item <- "Kitchen Trash Bag"
club <- 6101
country <- "Colombia"
load(paste0("Output/Countries/Colombia/", club, ".RData"))
c1k_prediction<-data.frame(TransactionTime=(as.Date(c1k_week$TransactionTime[nrow(c1k_week)])+7*(1:(len_dec+len_inv))), quantity_avrg=NA)
```
# Identify seasons
The periodogram function does not detect high frequency seasons accurately. To solve this problem, one of the built-in functions need to be modified.
Run the following line in r (which is included in this chunk of code):
trace(spec.pgram, edit=TRUE)
Then change line 9 to:
N <- N0 <- nrow(x) * 128
Then click "save"
The function is hence temporarily changed.
Similar to in the training markdown, the identification of seasons here is not completely automated. Personal judgement is still needed.
Firstly, 52.14, the number of weeks per year, is always included.
Secondly, only periods shorter than 52.14 should be included, and their length should be adjusted to its closest whole-month length. For example, if 13.02 is reported is a strong period by Fourier transformation, 12.53 (three months) should be the season used in analyses.
Thirdly, none period should be multiple of the other (otherwise Fourier regressor does not work). Therefore, half-year period is changed from 26.07 to 25.07 weeks. We believe better ways to deal with this problem exist.
```{r iedntify seasons}
trace(spec.pgram, edit=TRUE)
p = periodogram(c1k_week$quantity_avrg)
dd = data.frame(freq=p$freq, spec=p$spec)
order = dd[order(-dd$spec),]
order<-order[(1/order)<60,]
(1/(order$freq))%>%head(10)
season1<-25.07
season2<-52.14
season3<-12.53
```
# Creating time series
```{r creating TS}
t_log<-ts(log(c1k_week$quantity_avrg), frequency = 52.14, start = c(c1k_week$week_number_year[1], c1k_week$week_number[1]))
plot(t_log)
decompose_log <- stl(t_log, s.window = 13, t.window = 13)
plot(decompose_log)
adjust_log<- t_log - decompose_log$time.series[,1] # deseasonalize data
outlier_free_log<- tsclean(adjust_log)
trend_log<- decompose_log$time.series[, 2]
detrend_ts_log <- outlier_free_log-(trend_log - trend_log[1]) # corrected detrending part
plot(adjust_log)
plot(outlier_free_log)
autoplot(adjust_log)+autolayer(outlier_free_log)
plot(detrend_ts_log)
##Training Set: uni-variate methods
training_log<-ts(detrend_ts_log, frequency = 52.14, start = c(c1k_week$week_number_year[1], c1k_week$week_number[1]))
##future trend and seasonality, which will be added back:
trend_fit_log <- auto.arima(trend_log)
trend_for_log <- forecast(trend_fit_log, len_inv+len_dec)$mean
retrend_log<-trend_for_log
reseasonal_log<-forecast(decompose_log$time.series[,1], len_inv+len_dec)$mean
##data visualization
theme_ts <- theme(panel.border = element_rect(fill = NA,
colour = "grey10"),
panel.background = element_blank(),
panel.grid.minor = element_line(colour = "grey85"),
panel.grid.major = element_line(colour = "grey85"),
panel.grid.major.x = element_line(colour = "grey85"),
axis.text = element_text(size = 13, face = "bold"),
axis.title = element_text(size = 15, face = "bold"),
plot.title = element_text(size = 16, face = "bold"),
strip.text = element_text(size = 16, face = "bold"),
strip.background = element_rect(colour = "black"),
legend.text = element_text(size = 15),
legend.title = element_text(size = 16, face = "bold"),
legend.background = element_rect(fill = "white"),
legend.key = element_rect(fill = "white"))
decomp_stl_log<- data.table(Quant = c(t_log, decompose_log$time.series[, 1], decompose_log$time.series[, 2]-decompose_log$time.series[, 2][1], detrend_ts_log),
Date = rep(c1k_week$TransactionTime, ncol(decompose_log$time.series)+1),
Type = factor(rep(c("log quantity", colnames(decompose_log$time.series)),
each = nrow(decompose_log$time.series)),
levels = c("log quantity", colnames(decompose_log$time.series))))
ggplot(decomp_stl_log, aes(x = Date, y = Quant)) +
geom_line() +
facet_grid(Type ~ ., scales = "free_y", switch = "y") +
labs(x = "Time", y = NULL,
title = "Time Series Decomposition by STL (log quantity)") +
theme_ts
decomp_stl_training_log<- decomp_stl_log
decomp_stl_training_log$set<-"training"
decomp_stl_forecast_log<-data.table(Quant=c(rep(NA, len_dec+len_inv), reseasonal_log, retrend_log-trend_log[1], rep(NA, len_dec+len_inv)),
Date = rep(c1k_prediction$TransactionTime, ncol(decompose_log$time.series)+1),
Type = factor(rep(c("log quantity", colnames(decompose_log$time.series)),
each = len_dec+len_inv),
levels = c("log quantity", colnames(decompose_log$time.series))))
decomp_stl_forecast_log$set<-"forecast"
decomp_stl_combined_log<-rbind(decomp_stl_training_log, decomp_stl_forecast_log)
ggplot(decomp_stl_combined_log, aes(x = Date, y = Quant, col=set)) +
geom_line() +
facet_grid(Type ~ ., scales = "free_y", switch = "y") +
labs(x = "Time", y = NULL,
title = "Time Series Decomposition by STL (log quantity)") +
theme_ts
```
# Simple model: Average method
```{r average method}
Average_Method<-meanf(training_log, h = len_dec+len_inv)
autoplot(training_log) +
autolayer(Average_Method,
series = "Average", PI = FALSE)+
ggtitle("test data: average method") +
xlab("Year") + ylab( paste(item, "\n logged, detrended and de-seasonalized"))+
guides(colour=guide_legend(title="Forecast"))
c1k_prediction$Average_Method<-exp(Average_Method$mean+(retrend_log-trend_log[1])+reseasonal_log)
```
# Simple model: Naive method
```{r naive method}
Naive_Method<-naive(training_log, h = len_dec+len_inv)
autoplot(training_log) +
autolayer(Naive_Method,
series = "Naive", PI = FALSE)+
ggtitle("test data: naive method") +
xlab("Year") + ylab( paste(item, "\n logged, detrended and de-seasonalized"))+
guides(colour=guide_legend(title="Forecast"))
c1k_prediction$Naive_Method<-exp(Naive_Method$mean+(retrend_log-trend_log[1])+reseasonal_log)
```
# Simple model: Seasonal Naive method
```{r seasonal naive method}
Seasonal_Naive_Method<-snaive(training_log, h = len_dec+len_inv)
autoplot(training_log) +
autolayer(Seasonal_Naive_Method,
series = "Naive", PI = FALSE)+
ggtitle("test data: naive method") +
xlab("Year") + ylab( paste(item, "\n logged, detrended and de-seasonalized"))+
guides(colour=guide_legend(title="Forecast"))
c1k_prediction$Seasonal_Naive_Method<-exp(Seasonal_Naive_Method$mean+(retrend_log-trend_log[1])+reseasonal_log)
```
# Simple modeol: Drift method
```{r drift method}
# Drift method
Drift_Method<-rwf(training_log, h = len_dec+len_inv, drift = TRUE)
autoplot(training_log) +
autolayer(Drift_Method,
series = "Drift", PI = FALSE)+
ggtitle("test data: drift method") +
xlab("Year") + ylab( paste(item, "\n logged, detrended and de-seasonalized"))+
guides(colour=guide_legend(title="Forecast"))
c1k_prediction$Drift_Method<-exp(Drift_Method$mean+(retrend_log-trend_log[1])+reseasonal_log)
```
# Arima: auto.arima
```{r autoarima}
Simple_Arima<- auto.arima(training_log)
fc_Simple_Arima_1<- forecast(Simple_Arima, len_dec+len_inv, bootstrap = TRUE)
autoplot(fc_Simple_Arima_1) +
xlab("Year") + ylab("Logged Quantity Sold of Kitchen Trash Bags")+
guides(colour=guide_legend(title="Validation Set"))
cr_Simple_Arima_1<-checkresiduals(fc_Simple_Arima_1)
c1k_prediction$Simple_Arima_1<-exp(fc_Simple_Arima_1$mean+(retrend_log-trend_log[1])+reseasonal_log)
```
# ARIMA double season
This section makes Arima predictions based on seasons 1 and 2
```{r ARIMA double season}
#this arima model uses fourier and the three seasonality periods obtained above.
Arima_AIC <- auto.arima(training_log)
bestfit <- list(aicc=Arima_AIC$aicc, i=0, j=0, Arima_Seasons1_2=Arima_AIC)
fc_ARIMA_fourier<- forecast(Arima_AIC, h = len_dec+len_inv)
autoplot(fc_ARIMA_fourier) +
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
for(i in 1:3) {
for (j in 1:3){
z1 <- fourier(ts(training_log, frequency= season1), K=i)
z2 <- fourier(ts(training_log, frequency= season2), K=j)
Arima_Seasons1_2 <- auto.arima(training_log, xreg=cbind(z1, z2), seasonal=F)
if(Arima_Seasons1_2$aicc < bestfit$aicc) {
bestfit <- list(aicc=Arima_Seasons1_2$aicc, i=i, j=j, Arima_Seasons1_2=Arima_Seasons1_2)
}
}
}
bestfit
fc_Arima_Seasons1_2 <- forecast(bestfit$Arima_Seasons1_2, h=len_dec+len_inv,
xreg=cbind(
fourier(ts(training_log, frequency=season1), K=bestfit$i, h=len_dec+len_inv),
fourier(ts(training_log, frequency=season2), K=bestfit$j, h=len_dec+len_inv)))
autoplot(fc_Arima_Seasons1_2) +
xlab("Year") + ylab("Logged Quantity Sold of Kitchen Trash Bags")+
guides(colour=guide_legend(title="Validation Set"))
cr_ARIMA_seasons1_2<-checkresiduals(fc_Arima_Seasons1_2)
c1k_prediction$Arima_Seasons1_2<-exp(fc_Arima_Seasons1_2$mean+(retrend_log-trend_log[1])+reseasonal_log)
```
# ARIMA single season
This section uses grid to tune the K parameter to get the Fourier regressor
```{r ARIMA single season}
Arima_Fourier_AIC<-list(aicc=Inf)
for(K in seq(25)) {
fit <- auto.arima(training_log, xreg=fourier(training_log, K=K),
seasonal=FALSE)
if(fit[["aicc"]] < Arima_Fourier_AIC[["aicc"]]) {
Arima_Fourier_AIC <- fit
bestK <- K
}
}
Arima_Fourier_AIC
fc_Arima_Fourier_AIC <- forecast(Arima_Fourier_AIC,xreg=fourier(training_log, K=bestK, h=len_dec+len_inv))
autoplot(fc_Arima_Fourier_AIC)+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
cr_ARIMA_Fourier_AIC<-checkresiduals(fc_Arima_Fourier_AIC)
c1k_prediction$Arima_Fourier_AIC<-exp(fc_Arima_Fourier_AIC$mean+(retrend_log-trend_log[1])+reseasonal_log)
```
# TBATS
```{r TBATS}
# Uses a combination of Fourier terms with an exponential smoothing state space model and a Box-Cox transformation. Seasonality is allowed to change slowly over time.
# Raw_1
# Feed Raw Data to this model [with seasonality, trend and outliers]
fit_TBATS_Raw_1<- tbats(t_log, use.box.cox = NULL, use.trend = TRUE, use.damped.trend = NULL, seasonal.periods = 52, use.arma.errors = TRUE, biasadj = TRUE)
fc_TBATS_Raw_1<- forecast(fit_TBATS_Raw_1, h=len_dec+len_inv, bootstrap = TRUE)
autoplot(fc_TBATS_Raw_1) +
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_TBATS_Raw_1)
c1k_prediction$fc_TBATS_Raw_1<-exp(fc_TBATS_Raw_1$mean)
# TBATS Model with top 3 seasonal periods
## Seasons 1 and 2
fc_TBATS_Season1_2 <- forecast(tbats(t_log, use.box.cox = NULL, use.trend = NULL, use.damped.trend = NULL, seasonal.periods = c(season2,season1), use.arma.errors = TRUE, biasadj = TRUE), h=len_dec+len_inv, bootstrap = TRUE)
autoplot(fc_TBATS_Season1_2)+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_TBATS_Season1_2)
c1k_prediction$fc_TBATS_Season1_2<-exp(fc_TBATS_Season1_2$mean)
## Seasons 2 and 3
fc_TBATS_Season2_3<- forecast(tbats(t_log, use.box.cox = NULL, use.trend = NULL, use.damped.trend = NULL, seasonal.periods = c(season2,season3), use.arma.errors = TRUE, biasadj = TRUE),h=len_dec+len_inv)
autoplot(fc_TBATS_Season2_3)+ xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_TBATS_Season2_3)
c1k_prediction$fc_TBATS_Season2_3<-exp(fc_TBATS_Season2_3$mean)
## Seasons 1 and 3
fc_TBATS_Season1_3 <- forecast(tbats(t_log, use.box.cox = NULL, use.trend = NULL, use.damped.trend = NULL, seasonal.periods = c(season1,season3), use.arma.errors = TRUE, biasadj = TRUE), h=len_dec+len_inv, bootstrap = TRUE)
autoplot(fc_TBATS_Season1_3)+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_TBATS_Season1_3)
c1k_prediction$fc_TBATS_Season1_3<-exp(fc_TBATS_Season1_3$mean)
# really simple tbats
fit_TBATS_NoSeason<- tbats(t_log)
fc_TBATS_NoSeason<- forecast(fit_TBATS_NoSeason, h=len_dec+len_inv)
autoplot(fc_TBATS_NoSeason)+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
ggtitle(paste("Forecast for Quantity Sold of", item, "\n", country, ": Club", club)) +
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_TBATS_NoSeason)
c1k_prediction$fc_TBATS_NoSeason<-exp(fc_TBATS_NoSeason$mean)
```
# Neural network
```{r Neural Networks}
#Neural Networks with detrneded and de-seasonalized data
#NNAR(p,P,k)m -> p = lagged inputs, P = equivalent to ARIMA(p,0,0)(P,0,0)m, k = nods in the single hidden layer
fit_NN_1<- nnetar(training_log, lambda = "auto")
fc_NN_1<- forecast(fit_NN_1, h=len_dec+len_inv)
autoplot(fc_NN_1) + xlab("Year") + ylab(paste("Logged Quantity Sold of", item))
c1k_week_quantity_models<-performance_index_dtds(df_Actual = c1k_week_quantity_models, pred_name = "fc_NN_1", pred_value = fc_NN_1$mean)
c1k_prediction$fc_NN_1<-exp(fc_NN_1$mean+(retrend_log-trend_log[1])+reseasonal_log)
# neural network with raw data
fit_NN_Raw<- nnetar(t_log, lambda = "auto")
fc_NN_Raw<- forecast(fit_NN_Raw, h=len_dec+len_inv)
autoplot(fc_NN_Raw) +
xlab("Year") + ylab("Logged Quantity Sold of Kitchen Trash Bags")
c1k_prediction$fc_NN_Raw<-exp(fc_NN_Raw$mean)
```
# Machine learning setup
```{r Machine Learning setup}
# feature engineering
c1k_week$month<-month(c1k_week$TransactionTime)
c1k_week$month<-as.factor(c1k_week$month)
c1k_week$log1p_transaction_avrg<-log1p(c1k_week$transaction_avrg) #
c1k_week$log1p_members_avrg<-log1p(c1k_week$members_avrg) #
c1k_week$log1p_sales_local_avrg<-log1p(c1k_week$sales_local_avrg) #
c1k_week$log1p_sales_usd_avrg<-log1p(c1k_week$sales_usd_avrg) #
c1k_week$log1p_category_sales_local_avrg<-log1p(c1k_week$category_sales_local_avrg) #
c1k_week$log1p_quantity_avrg<-log1p(c1k_week$quantity_avrg) #
c1k_week$log1p_category_sales_local_avrg<- log1p(c1k_week$category_sales_local_avrg)
c1k_week$log1p_category_sales_usd_avrg<-log1p(c1k_week$category_sales_usd_avrg)
c1k_week$log1p_category_quantity_avrg<- log1p(c1k_week$category_quantity_avrg)
c1k_week$log1p_salePrice_local_avrg<- log1p(c1k_week$salePrice_local_avrg)
c1k_week$log1p_salePrice_usd_avrg<- log1p(c1k_week$salePrice_usd_avrg)
c1k_week_prediction<-data.frame(matrix(nrow=len_dec+len_inv, ncol = ncol(c1k_week)))
colnames(c1k_week_prediction)<-colnames(c1k_week)
c1k_week_prediction$TransactionTime<-c1k_prediction$TransactionTime
c1k_week_prediction$week_number<-c1k_week_prediction$TransactionTime%>%date2ISOweek()%>%substr(7,8)
c1k_week_prediction$week_number_year<-c1k_week_prediction$TransactionTime%>%date2ISOweek()%>%substr(1,4)
c1k_week_prediction$month<-month(c1k_week_prediction$TransactionTime)
t_v_c1k<-rbind(as.data.frame(c1k_week), c1k_week_prediction)
t_v_c1k<- t_v_c1k[order(t_v_c1k$TransactionTime),]%>%ungroup()
t_v_c1k<- feature_engineering(t_v_c1k, c("quantity_avrg", "transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg","salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg","log1p_category_sales_local_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg"))
# creating training matrix
t_v_c1k<- fastDummies::dummy_cols(t_v_c1k, select_columns = c("week_number_year", "week_number", "month"))
t_v_c1k$month<-NULL
t_v_c1k$week_number_year<-NULL
t_v_c1k$week_number<-NULL
train_c1k<-t_v_c1k[1:(nrow(t_v_c1k)-len_dec-len_inv),]
test_c1k<-t_v_c1k[(nrow(t_v_c1k)-len_dec-len_inv+1):nrow(t_v_c1k),]
```
# XGB model
```{r XGB MODEL}
trainc1k_XGB<-train_c1k%>%select(-c("transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg","salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg", "log1p_transaction_avrg", "log1p_members_avrg", "log1p_sales_local_avrg", "log1p_sales_usd_avrg", "log1p_category_sales_local_avrg","log1p_category_sales_usd_avrg", "log1p_category_quantity_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg","TransactionTime"))
testc1k_XGB<-test_c1k%>%select(-c("transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg","salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg", "log1p_transaction_avrg", "log1p_members_avrg", "log1p_sales_local_avrg", "log1p_sales_usd_avrg", "log1p_category_sales_local_avrg","log1p_category_sales_usd_avrg", "log1p_category_quantity_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg","TransactionTime"))
testc1k_XGB$quantity_avrg<-0
trainTask <- makeRegrTask(data = trainc1k_XGB, target = "quantity_avrg")
testTask <- makeRegrTask(data = testc1k_XGB, target = "quantity_avrg")
xgb_learner <- makeLearner(
"regr.xgboost",
predict.type = "response",
par.vals = list(
objective = "reg:squarederror",
eval_metric = "rmse",
nrounds = 200
)
)
# Create a model
xgb_model <- mlr::train(xgb_learner, task = trainTask)
xgb_params <- makeParamSet(
# The number of trees in the model (each one built sequentially)
makeIntegerParam("nrounds", lower = 100, upper = 500),
# number of splits in each tree
makeIntegerParam("max_depth", lower = 1, upper = 10),
# "shrinkage" - prevents overfitting
makeNumericParam("eta", lower = .1, upper = .5),
# L2 regularization - prevents overfitting
makeNumericParam("lambda", lower = -1, upper = 0, trafo = function(x) 10^x)
)
control <- makeTuneControlRandom(maxit = 1)
resample_desc <- makeResampleDesc("CV", iters = 10)
tuned_params <- tuneParams(
learner = xgb_learner,
task = trainTask,
resampling = resample_desc,
par.set = xgb_params,
control = control
)
xgb_tuned_learner <- setHyperPars(
learner = xgb_learner,
par.vals = tuned_params$x
)
xgb_model <- mlr::train(xgb_tuned_learner, trainTask)
XGBoost_pred <- predict(xgb_model ,testTask)
c1k_week_quantity_models<-performance_index_raw(df_Actual = c1k_week_quantity_models, pred_name = "XGBoost", pred_value = log(XGBoost_pred$data$response))
c1k_prediction$XGBoost<-log(XGBoost_pred$data$response)
```
# Random Forest 1
```{r Random Forest 1, fig.width=12}
# RF1
trainc1k_RF<-train_c1k%>%select(-c("transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg","salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg", "log1p_transaction_avrg", "log1p_members_avrg", "log1p_sales_local_avrg", "log1p_sales_usd_avrg", "log1p_category_sales_local_avrg","log1p_category_sales_usd_avrg", "log1p_category_quantity_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg","TransactionTime"))
testc1k_RF<-test_c1k%>%select(-c("transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg","salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg", "log1p_transaction_avrg", "log1p_members_avrg", "log1p_sales_local_avrg", "log1p_sales_usd_avrg", "log1p_category_sales_local_avrg","log1p_category_sales_usd_avrg", "log1p_category_quantity_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg","TransactionTime"))
RF_1 <- randomForest(quantity_avrg ~. , data = trainc1k_RF,
ntree = 1000, mtry = 3, nodesize = 5, importance = TRUE, na.action = na.omit)
varImpPlot(RF_1, main = "Variable importance")
RF1_pred<-predict(RF_1, testc1k_RF)
c1k_prediction$RF_1<-RF1_pred
```
# Random Forest 2
```{r Random Forest 2}
# RF2
#Defining the Control
trControl<- trainControl(method = "cv", number = 10, search = "grid")
metric <- "RMSE"
seed<- set.seed(156230)
```
## Step 1 Run a Default model
```{r Random Forest 2 (1)}
rf_default<- caret::train(quantity_avrg~ .
, data = trainc1k_RF
, method = "rf", metric = "RMSE", trControl = trControl, na.action=na.exclude)
rf_default
#RMSE was used to select the optimal model using the smallest value.
```
## Step 2 Search best mtry
```{r Random Forest 2 (2)}
#mtry is the number of variables available for splitting at each tree node
tuneGrid<- expand.grid(.mtry = seq(1, 60, by=5))
rf_mtry<- caret::train(quantity_avrg ~ .,
data = trainc1k_RF,
method = "rf", metric = "RMSE", tuneGrid = tuneGrid, trControl = trControl, importance = TRUE, na.action=na.exclude)
rf_mtry
#RMSE was used to select the optimal model using the smallest value.
best_mtry<- rf_mtry$bestTune$mtry #store the best value for mtry
min(rf_mtry$results$RMSE)
```
## Step 3 Search Best Maxnodes
```{r Random Forest 2 (3)}
store_maxnode<- list() # create a list to find the optimal max of nodes
tuneGrid<- expand.grid(.mtry= best_mtry)
for (maxnodes in c(1, 2, 3, 4, 5, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70)){
set.seed(156230)
rf_maxnode<- caret::train(quantity_avrg ~ .,
data = trainc1k_RF,
method = "rf", metric = "RMSE", tuneGrid = tuneGrid, trControl = trControl, importance = TRUE, maxnodes = maxnodes, nodesize = 4, na.action = na.exclude)
current_iteration<- toString(maxnodes)
store_maxnode[[current_iteration]]<- rf_maxnode
}
results_node<- resamples(store_maxnode)
results_node<-summary(results_node)
nnode_optimal<-results_node$models[results_node$statistics$RMSE%>%as.data.frame()%>%select("Mean")%>%as.matrix()%>%as.numeric()%>%which.min()]%>%as.numeric()
```
## Step 4 Search the best ntrees
```{r Random Forest 2 (4)}
store_maxtrees <- list()
for (ntree in c(10, 20, 30, 40, 50, 100 , 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700)) {
set.seed(156230)
rf_maxtrees <- caret::train(quantity_avrg ~ .,
data = trainc1k_RF,
method = "rf",
metric = "RMSE",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
maxnodes = nnode_optimal,
ntree = ntree,
na.action = na.exclude)
key <- toString(ntree)
store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
results_tree<-summary(results_tree)
ntrees_optimal<-results_tree$models[results_tree$statistics$RMSE%>%as.data.frame()%>%select("Mean")%>%as.matrix()%>%as.numeric()%>%which.min()]%>%as.numeric()
```
## Step 5 Run model with the best specifications found above
```{r Random Forest 2 (5)}
fit_rf<- caret::train(quantity_avrg ~ .,
data = trainc1k_RF,
method = "rf",
metric = "RMSE",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
maxnodes = nnode_optimal,
ntree = ntrees_optimal,
na.action = na.exclude)
```
## Step 6 Evaluate Model
```{r Random Forest 2 (6)}
pred_rf2<- predict(fit_rf,test_c1k, predict.all = TRUE)
c1k_prediction$RF_2<-c(pred_rf2, rep(NA, len_dec+len_inv-length(pred_rf2)))
```
# Random Forest 3
```{r Random Forest 3, fig.width=12}
RF_3<- randomForest(quantity_avrg ~ .,
data = trainc1k_RF,
mtry = best_mtry,
importance = TRUE,
maxnodes = nnode_optimal,
ntree = ntrees_optimal,
na.action = na.exclude)
varImpPlot(RF_3)
pred_rf3 <- predict(RF_3, test_c1k)
pred_rf3<- data.table(Predicted_Quantity = pred_rf3, TransactionTime = test_c1k$TransactionTime)
ggplot() +
geom_line(pred_rf3, mapping = aes(TransactionTime, Predicted_Quantity,color = "Predicted Quantity"))+
labs(x = "Time", y = "Quantity", title = "Random Forest forecasts") +
scale_color_manual(values = c("Predicted Quantity" = 'darkblue', "Actual Quantity" = 'red'))
c1k_prediction$RF_3<-c(pred_rf3$Predicted_Quantity, rep(NA, len_dec+len_inv-length(pred_rf3$Predicted_Quantity)))
```
# Linear regression: Ridge, Random Forest and elastic
```{r Linear Regression}
train_c1k_GLMNET<-drop_na(train_c1k%>%select(-c("transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg", "salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg", "log1p_transaction_avrg", "log1p_members_avrg", "log1p_sales_local_avrg", "log1p_sales_usd_avrg", "log1p_category_sales_local_avrg","log1p_category_sales_usd_avrg", "log1p_category_quantity_avrg", "log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg", "TransactionTime")))
test_c1k_GLMNET<-drop_na(test_c1k%>%select(-c("quantity_avrg", "transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg", "salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg", "log1p_transaction_avrg", "log1p_members_avrg", "log1p_sales_local_avrg", "log1p_sales_usd_avrg", "log1p_category_sales_local_avrg","log1p_category_sales_usd_avrg", "log1p_category_quantity_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg", "TransactionTime")))
y_train<- data.matrix(train_c1k_GLMNET["quantity_avrg"])
x_train<-data.matrix(subset(train_c1k_GLMNET, select=-c(quantity_avrg)))
x_test<-data.matrix(test_c1k_GLMNET)
```
## Ridge
```{r Ridge}
ridge <- glmnet(x_train,y_train,alpha = 0)
plot(ridge, xvar = "lambda")
ridge$lambda %>% head()
#Tuning to find the right value for lamda
ridge_cv <- cv.glmnet(x_train,y_train,alpha = 0)
plot(ridge_cv)
# as we constrain our coefficients with log ( λ ) ≥ 0 penalty, the MSE rises considerably. The numbers at the top of the plot (38) just refer to the number of variables in the model. Ridge regression does not force any variables to exactly zero so all features will remain in the model
#The first and second vertical dashed lines represent the λ value with the minimum MSE and the largest λ value within one standard error of the minimum MSE.
#extract our minimum and one standard error MSE and λ values
min(ridge_cv$cvm) #minimum MSE
ridge_cv$lambda.min #lambda for this minimum MSE
ridge_cv$cvm[ridge_cv$lambda == ridge_cv$lambda.1se] # 1 st.error of min MSE
ridge_cv$lambda.1se # lambda for this MSE
ridge_min <- glmnet(x_train,y_train, alpha = 0)
plot(ridge_min, xvar = "lambda")
abline(v = log(ridge_cv$lambda.1se), col = "red", lty = "dashed")
#Most Influential Feautures to predict accuracy
coef(ridge_cv, s = "lambda.1se") %>%
tidy() %>%
filter(row != "(Intercept)") %>%
top_n(10, wt = abs(value)) %>%
ggplot(aes(value, reorder(row, value), color = value > 0)) +
geom_point(show.legend = FALSE) +
ggtitle("Influential variables") +
xlab("Coefficient") +
ylab(NULL)
min(ridge_cv$cvm)
#Ridge model will retian all variables. Therefore, a ridge model is good only if we beleve that we need to retain all features in the model yet reduce the noise that less influential variable smay create and minimize collinearity. Ridge doesn't perform feature selection
##PREDICTING
ridge_pred<- predict(ridge_cv, s=ridge_cv$lambda.1se, x_test, type = "response")
ridge_pred_train<-predict(ridge_cv, s=ridge_cv$lambda.1se, x_train, type = "response")
ridge_pred_train<-merge(data.frame(week=1:nrow(train_c1k)), data.frame(week=ridge_pred_train%>%rownames()%>%as.numeric(), pred=ridge_pred_train), all.x = TRUE)
colnames(ridge_pred_train)<-c("week", "pred")
#Graph
pred_ridge<- data.frame(Predicted_Quantity = ridge_pred, TransactionTime = test_c1k$TransactionTime[1:length(ridge_pred)])
colnames(pred_ridge)<-c("Predicted_Quantity", "TransactionTime")
ggplot() +
geom_line(pred_ridge, mapping = aes(TransactionTime, Predicted_Quantity, color = "Predicted Quantity"))+
labs(x = "Time", y = "Quantity", title = "Ridge") + scale_color_manual(values = c("Predicted Quantity" = 'darkblue', "Actual Quantity" = 'red'))
c1k_prediction$Ridge<-c(pred_ridge$Predicted_Quantity, rep(NA, len_dec+len_inv-length(pred_ridge$Predicted_Quantity)))
```
## Lasso
```{r Lasso}
lasso <- glmnet(x_train,y_train,alpha = 1)
plot(lasso, xvar = "lambda")
# when log(λ)=− 3 all 8 variables are in the model, when log(λ)=−1 2 variables are retained
#Tuning to find the right value for lamda
lasso_cv <- cv.glmnet(x_train,y_train,alpha = 1)
plot(lasso_cv)
#extract our minimum and one standard error MSE and λ values
min(lasso_cv$cvm) #minimum MSE
lasso_cv$lambda.min #lambda for this minimum MSE
lasso_cv$cvm[lasso_cv$lambda == lasso_cv$lambda.1se] # 1 st.error of min MSE
lasso_cv$lambda.1se # lambda for this MSE
lasso_min <- glmnet(x_train,y_train, alpha = 1)
plot(lasso_min, xvar = "lambda")
abline(v = log(lasso_cv$lambda.min), col = "red", lty = "dashed")
abline(v = log(lasso_cv$lambda.1se), col = "red", lty = "dashed")
#Most Influential Feautures to predict accuracy
coef(lasso_cv, s = "lambda.1se") %>%
tidy() %>%
filter(row != "(Intercept)") %>%
ggplot(aes(value, reorder(row, value), color = value > 0)) +
geom_point(show.legend = FALSE) +
ggtitle("Influential variables") +
xlab("Coefficient") +
ylab(NULL)
##Predicting
lasso_pred<- predict(lasso_cv, s=lasso_cv$lambda.min, x_test, type = "response")
c1k_prediction$Lasso<-c(lasso_pred, rep(NA, len_dec+len_inv-length(lasso_pred)))
```
## Elstic Net
```{r Elastic Net}
#Tune λ and the alpha parameters.
# maintain the same folds across all models
fold_id <- sample(1:10, size = length(y_train), replace=TRUE)
# search across a range of alphas
tuning_grid <- tibble::tibble(
alpha = seq(0, 1, by = .1),
mse_min = NA,
mse_1se = NA,
lambda_min = NA,
lambda_1se = NA
)
#Now we can iterate over each alpha value, apply a CV elastic net, and extract the minimum and one standard error MSE values and their respective λ values.
for(i in seq_along(tuning_grid$alpha)) {
# fit CV model for each alpha value
fit <- cv.glmnet(x_train, y_train, alpha = tuning_grid$alpha[i], foldid = fold_id)
# extract MSE and lambda values
tuning_grid$mse_min[i] <- fit$cvm[fit$lambda == fit$lambda.min]
tuning_grid$mse_1se[i] <- fit$cvm[fit$lambda == fit$lambda.1se]
tuning_grid$lambda_min[i] <- fit$lambda.min
tuning_grid$lambda_1se[i] <- fit$lambda.1se
}
tuning_grid
#plot the MSE
elastic_tuning<-tuning_grid %>%mutate(se = mse_1se - mse_min)
elastic_tuning%>%
ggplot(aes(alpha, mse_min)) +
geom_line(size = 2) +
geom_ribbon(aes(ymax = mse_min + se, ymin = mse_min - se), alpha = .25) +
ggtitle("MSE ± one standard error")
alpha_optimal<-(elastic_tuning%>%as.data.frame()%>%select("alpha"))[which.min(elastic_tuning%>%as.data.frame()%>%select("mse_1se")%>%unlist()%>%as.numeric()),1]
#advantage of the elastic net model is that it enables effective regularization via the ridge penalty with the feature selection characteristics of the lasso penalty
elastic_cv<-cv.glmnet(x_train,y_train,alpha = alpha_optimal)
elastic_pred<- predict(elastic_cv, s=elastic_cv$lambda.1se, x_test, type = "response")
c1k_prediction$elastic<-c(elastic_pred, rep(NA, len_dec+len_inv-length(lasso_pred)))
```
# Final prediction
```{r Final prediction}
c1k_prediction$quantity_avrg<-rowMeans(c1k_prediction%>%select(best_two_total_ml[1],
best_two_total_ml[2],
best_two_total[1],
best_two_total[2]))
c1k_prediction$week_begin<-c1k_prediction$TransactionTime-2
c1k_prediction$week_end<-c1k_prediction$TransactionTime+4
```
# Benchmark model 1: HoltWinter Smoothing
```{r HW Smoothing}
t_log_HW<- ts(log(c1k_week$quantity_avrg), frequency = 52)
expsmo<-HoltWinters(t_log_HW, seasonal = "additive")
pred_HW<-predict(expsmo, n.ahead = len_dec+len_inv)
c1k_prediction$HW<-exp(c(pred_HW%>%as.numeric(), rep(NA, len_dec+len_inv-length(pred_HW)))+(retrend_log-trend_log[1])+reseasonal_log)
```
# Benchmark model 2: Annual Growth
```{r Annual Growth}
annual_growth_last<-t_log_HW[length(t_log_HW)]/lag(t_log_HW%>%as.numeric(), n = 52)[length(t_log_HW)]
pred_AG<-c(c1k_week$quantity_avrg, rep(NA, len_dec+len_inv))
pred_AG_lag52<-lag(pred_AG%>%as.numeric(), n = 52)%>%tail(len_dec+len_inv)
c1k_prediction$AG<-pred_AG_lag52*annual_growth_last
```
# Output
```{r Output}
# Spreadsheet
write.csv(c1k_prediction%>%select("quantity_avrg", "week_begin", "week_end", "HW", "AG"), "Output/Countries/Colombia/pred_c1k.csv")
# Grphs
test_history<-ts(c1k_week$quantity_avrg, frequency = 52.14, start = c(c1k_week$week_number_year[1], c1k_week$week_number[1]))
c1k_prediction$week_number<-c1k_prediction$TransactionTime%>%date2ISOweek()%>%substr(7,8)%>%as.numeric()
c1k_prediction$week_number_year<-c1k_prediction$TransactionTime%>%date2ISOweek()%>%substr(1,4)%>%as.numeric()
test_pred<-ts(c1k_prediction$quantity_avrg, frequency = 52.14, start = c(c1k_prediction$week_number_year[1], c1k_prediction$week_number[1]))
test_HW<-ts(c1k_prediction$HW, frequency = 52.14, start = c(c1k_prediction$week_number_year[1], c1k_prediction$week_number[1]))
test_AG<-ts(c1k_prediction$AG, frequency = 52.14, start = c(c1k_prediction$week_number_year[1], c1k_prediction$week_number[1]))
autoplot(test_history)+autolayer(test_pred)+autolayer(test_HW)+autolayer(test_AG)+ggtitle("Predictions: Model and Benchmark, 6101")
save.image(paste0("Output/Countries/Colombia/", club, "test.RData"))
```