-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSale_Forecast_Training.Rmd
1415 lines (1147 loc) · 70 KB
/
Sale_Forecast_Training.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Model training"
author: "Mark Wang & Paola Aleman"
date: "9/30/2019"
output: html_document
editor_options:
chunk_Output_type: console
---
# Instructions
For the following code to run smoothly it is necessary to have the folders that the line of code calls for.
The main folder is called Forecast, which holds all R-markdowns. The subfolders for the Forecast folder are: Code, Data, Output. The subfolder Code holds scripts.
The subfolder Data holds two more subfolders: Countries, Raw Data. The Countries subfolder [this is an optional subfolder] includes folders for each country to which we individually want to save data on. Raw Data folder only has the raw data on it.
The subfolder Output includes a subfolder called Countries and this folder has folders for each country. Each country then has 2 additional folders: Plots, Tables.
The Plots folder saves all graphs created in the script.
The Tables folder saves all excel files with the performance indices.
Note the models are label by c 1 k. which stands for c -> colombia, 1 -> club 6101 and k -> kitchen trash bags.
It is necessary to change c 1 k whenever you're analyzing a different set of item and club. This way graphs and models are saved into folders with their specific name.
For easier substitution ctrl+f. In the first box type the name you want to substitute, such as c 1 k. and on the second box include the prefix you want to substitute with, suck as c 2 k. Then click the last All option.
# Setup
This section sets up the language in which the codes are run, and working directory. Two parameters, namely len_dec and len_inv reflect the purchasing strategy.
Parameter len_dec refers to the frequency in which buyers make purchase decisions; len_inv refers to the time span from purcahse to selling of an item.
```{r setup, include=FALSE, results='hide'}
Sys.setenv(LANG = "en")
Sys.setlocale("LC_TIME", "English")
knitr::opts_knit$set(root.dir = normalizePath("C:/Users/kangyuwang/OneDrive/portfolio/sales_forecast"))
setwd("C:/Users/kangyuwang/OneDrive/portfolio/sales_forecast")
# The following parameters should be set to reflect the purchasing strategy of Pricemart that are different across different items and clubs. The parameters are in week terms.
len_dec<-1 #The frequency in which buyers make purchase decisions
len_inv<-17 #The time span from purchase to selling of an item
#Item, club and country we are analyzing
item <- "Kitchen Trash Bag"
club <- 6101
country <- "Colombia"
```
# Installing packages
```{r Installing packages, include=FALSE, results = 'hide'}
# Install the packages just once. Afterwards you only need to load the libraries
source("Code/Install Packages.R")
```
# Define functions
Three functions are defined in this section. firstly, performance_index_dtds function calculates the performance of prediction models on de-trended and de-seasonalized quantity data. Secondly, performance_index_raw function calculates the performance of prediction models on raw data. Thirdly, feature_engineering function performs feature engineering to prepare for machine learning data.
```{r performance index function}
# This scripts stores performance_index function
performance_index_dtds<-function(df_Actual, pred_name, pred_value){
assign(pred_name, exp(pred_value+(retrend_log-trend_log[1])+reseasonal_log))
df_Actual[,colnames(df_Actual)==pred_name]<-NULL
df_Actual<-cbind(df_Actual, get(pred_name)%>%as.numeric())
colnames(df_Actual)[colnames(df_Actual)=="get(pred_name) %>% as.numeric()"]<-pred_name
assign(paste0("MEAN_", pred_name, "_total"),
mean(get(pred_name))/df_Actual%>%select("Actual")%>%data.matrix()%>%mean()-1,
envir = .GlobalEnv
)
assign(paste0("MEAN_", pred_name, "_target"),
mean((get(pred_name))[(len_inv+1):(len_inv+len_dec)])/mean((df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)])-1,
envir=.GlobalEnv
)
assign(paste0("RMSE_", pred_name, "_total" ),
RMSE(get(pred_name),df_Actual%>%select("Actual")%>%data.matrix()),
envir = .GlobalEnv)
assign(paste0("RMSE_", pred_name, "_target" ),
RMSE((get(pred_name)[(len_inv+1):(len_inv+len_dec)]),(df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]),
envir = .GlobalEnv)
assign(paste0("RMSE_", pred_name, "_total" ),
RMSE(get(pred_name),df_Actual%>%select("Actual")%>%data.matrix()),
envir = .GlobalEnv)
assign(paste0("RMSE_", pred_name, "_target" ),
RMSE((get(pred_name)[(len_inv+1):(len_inv+len_dec)]),(df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]),
envir = .GlobalEnv)
assign(paste0("MAE_", pred_name, "_total" ),
MAE(get(pred_name),df_Actual%>%select("Actual")%>%data.matrix()),
envir = .GlobalEnv)
assign(paste0("MAE_", pred_name, "_target" ),
MAE((get(pred_name)[(len_inv+1):(len_inv+len_dec)]),(df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]),
envir = .GlobalEnv)
assign(paste0("MPE_", pred_name, "_total" ),
MPE(get(pred_name),df_Actual%>%select("Actual")%>%data.matrix()),
envir = .GlobalEnv)
assign(paste0("MPE_", pred_name, "_target" ),
MPE((get(pred_name)[(len_inv+1):(len_inv+len_dec)]),(df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]),
envir = .GlobalEnv)
assign(paste0("MAPE_", pred_name, "_total" ),
Metrics::mape(get(pred_name),df_Actual%>%select("Actual")%>%data.matrix()),
envir = .GlobalEnv)
assign(paste0("MAPE_", pred_name, "_target" ),
Metrics::mape((get(pred_name)[(len_inv+1):(len_inv+len_dec)]),(df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]),
envir = .GlobalEnv)
assign(paste0("MASE_", pred_name, "_total" ),
Metrics::mase(df_Actual%>%select("Actual")%>%data.matrix()%>%as.numeric(), get(pred_name)%>%as.numeric()),
envir = .GlobalEnv)
assign(paste0("MASE_", pred_name, "_target" ),
Metrics::mase((df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]%>%as.numeric(), (get(pred_name)[(len_inv+1):(len_inv+len_dec)])%>%as.numeric()),
envir = .GlobalEnv)
return(df_Actual)
}
performance_index_raw<-function(df_Actual, pred_name, pred_value){
assign(pred_name, exp(pred_value))
df_Actual[,colnames(df_Actual)==pred_name]<-NULL
df_Actual<-cbind(df_Actual, get(pred_name)%>%as.numeric())
colnames(df_Actual)[colnames(df_Actual)=="get(pred_name) %>% as.numeric()"]<-pred_name
assign(paste0("MEAN_", pred_name, "_total"),
mean(get(pred_name))/df_Actual%>%select("Actual")%>%data.matrix()%>%mean()-1,
envir = .GlobalEnv
)
assign(paste0("MEAN_", pred_name, "_target"),
mean((get(pred_name))[(len_inv+1):(len_inv+len_dec)])/mean((df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)])-1,
envir=.GlobalEnv
)
assign(paste0("RMSE_", pred_name, "_total" ),
RMSE(get(pred_name),df_Actual%>%select("Actual")%>%data.matrix()),
envir = .GlobalEnv)
assign(paste0("RMSE_", pred_name, "_target" ),
RMSE((get(pred_name)[(len_inv+1):(len_inv+len_dec)]),(df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]),
envir = .GlobalEnv)
assign(paste0("MAE_", pred_name, "_total" ),
MAE(get(pred_name),df_Actual%>%select("Actual")%>%data.matrix()),
envir = .GlobalEnv)
assign(paste0("MAE_", pred_name, "_target" ),
MAE((get(pred_name)[(len_inv+1):(len_inv+len_dec)]),(df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]),
envir = .GlobalEnv)
assign(paste0("MPE_", pred_name, "_total" ),
MPE(get(pred_name),df_Actual%>%select("Actual")%>%data.matrix()),
envir = .GlobalEnv)
assign(paste0("MPE_", pred_name, "_target" ),
MPE((get(pred_name)[(len_inv+1):(len_inv+len_dec)]),(df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]),
envir = .GlobalEnv)
assign(paste0("MAPE_", pred_name, "_total" ),
Metrics::mape(get(pred_name),df_Actual%>%select("Actual")%>%data.matrix()),
envir = .GlobalEnv)
assign(paste0("MAPE_", pred_name, "_target" ),
Metrics::mape((get(pred_name)[(len_inv+1):(len_inv+len_dec)]),(df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]),
envir = .GlobalEnv)
assign(paste0("MASE_", pred_name, "_total" ),
Metrics::mase(df_Actual%>%select("Actual")%>%data.matrix()%>%as.numeric(), get(pred_name)%>%as.numeric()),
envir = .GlobalEnv)
assign(paste0("MASE_", pred_name, "_target" ),
Metrics::mase((df_Actual%>%select("Actual")%>%data.matrix())[(len_inv+1):(len_inv+len_dec)]%>%as.numeric(), (get(pred_name)[(len_inv+1):(len_inv+len_dec)])%>%as.numeric()),
envir = .GlobalEnv)
return(df_Actual)
}
feature_engineering<-function(t_v, feature_list){
function_names<-c("avg" , "diff")
function_list<-list(function(x) roll_meanr(x, n= 4), function(x) c(diff(x), NA))
weeks<-c(1, 4, 5, 8, 9, 12, 13, 17, 18, 22, 26, 52)
for (feature in feature_list){
for (lag_number in (weeks[weeks>=len_dec+len_inv])){
assign(paste0("lag", "_", lag_number, "_", feature), lag((t_v%>%select(feature))[[1]], lag_number))
t_v<-cbind(t_v, get(paste0("lag", "_", lag_number, "_", feature))%>%as.numeric())
colnames(t_v)[colnames(t_v)=="get(paste0(\"lag\", \"_\", lag_number, \"_\", feature)) %>% as.numeric()"]<-paste0("lag", "_", lag_number, "_", feature)
for (i in (1:length(function_names))){
assign(paste0(function_names[i], "_", lag_number, "_", feature), lag(function_list[[i]]((t_v%>%select(feature))[[1]]), lag_number))
t_v<-cbind(t_v, get(paste0(function_names[i], "_", lag_number, "_", feature))%>%as.numeric())
colnames(t_v)[colnames(t_v)=="get(paste0(function_names[i], \"_\", lag_number, \"_\", feature)) %>% "]<-paste0(function_names[i], "_", lag_number, "_", feature)
}
}
}
return(t_v)
}
```
# Importing Training Dataset
```{r importing, include=FALSE, results='hide'}
train<-read_csv("Data/Raw Data/trainingSet.csv")
head(train, 10)
```
# Dealing with Variables
This section transforms the formats of some variables, and create dummies for date and month categorical variables.
```{r dealing with variables, include=FALSE, results='hide'}
#Changing the format of variables
##Changing "TransactionDate" from integer to date
train$TransactionDate<- as.Date(train$TransactionDate%>%as.character(), "%Y%m%d")
train$TransactionDate<- ymd(train$TransactionDate)
##Changing "Club Number" from integer to Factor
train$ClubNumber<- as.factor(train$ClubNumber)
##Changing "item" from integer to Factor
train$item<- as.factor(train$item)
#Subsetting to country
country_set<-subset(train, Country == country)
##Changing Month into Factor
country_set$month<-as.factor(strftime(country_set$TransactionDate, "%B"))
##Changing clubNumber into Factor
country_set$ClubNumber<- as.factor(country_set$ClubNumber)
##Day of week
country_set$Day_of_week <- as.factor(strftime(country_set$TransactionDate, "%A"))
#Create Dummy Variables for month and weekday
country_set<- cbind(country_set, fastDummies::dummy_cols(country_set$month))
country_set<- cbind(country_set, fastDummies::dummy_cols(country_set$Day_of_week))
#Take out unnecessary variables
country_set$.data<- NULL
country_set$.data<- NULL
country_set$Year<- NULL
country_set$.data_February<- NULL
country_set$.data_Monday <- NULL
#Rename Month Dummy Variables
colnames(country_set)[colnames(country_set)==".data_January"] <- "Jan"
colnames(country_set)[colnames(country_set)==".data_March"] <- "Mar"
colnames(country_set)[colnames(country_set)==".data_April"] <- "Apr"
colnames(country_set)[colnames(country_set)==".data_May"] <- "May"
colnames(country_set)[colnames(country_set)==".data_June"] <- "June"
colnames(country_set)[colnames(country_set)==".data_July"] <- "July"
colnames(country_set)[colnames(country_set)==".data_August"] <- "Aug"
colnames(country_set)[colnames(country_set)==".data_September"] <- "Sep"
colnames(country_set)[colnames(country_set)==".data_October"] <- "Oct"
colnames(country_set)[colnames(country_set)==".data_November"] <- "Nov"
colnames(country_set)[colnames(country_set)==".data_December"] <- "Dec"
#Rename Weekday dummy variables
colnames(country_set)[colnames(country_set)==".data_Tuesday"] <- "Tu"
colnames(country_set)[colnames(country_set)==".data_Thursday"] <- "Th"
colnames(country_set)[colnames(country_set)==".data_Wednesday"] <- "Wed"
colnames(country_set)[colnames(country_set)==".data_Friday"] <- "Fri"
colnames(country_set)[colnames(country_set)==".data_Saturday"] <- "Sat"
colnames(country_set)[colnames(country_set)==".data_Sunday"] <- "Sun"
#Subsetting
c1k<-country_set[country_set$Description==item & country_set$ClubNumber==club,]
c1k<-c1k[order(c1k$TransactionDate),]
c1k$TransactionDate<- as.Date(c1k$TransactionDate)
#Drop unnecessary variables
c1k$ClubNumber<-NULL
c1k$Country<-NULL
c1k$item<-NULL
c1k$Description<-NULL
c1k$Day_of_month<-NULL
c1k$month<- NULL
c1k$weekday<- NULL
c1k$Day_of_week<- NULL
c1k$Year<- NULL
```
# Summary statistics
c1k contains data for one item in one club.
```{r summary statistics}
head(c1k, 10)
# Time span
## Beginning and Ending of dataset
summary(c1k$TransactionDate)
## Gaps
c(min(c1k$TransactionDate):max(c1k$TransactionDate))[c(min(c1k$TransactionDate):max(c1k$TransactionDate))%in%c1k$TransactionDate==FALSE]%>%as.Date(origin="1970-1-1")
c1k_quant_dist<-c(c1k$quantity, rep(0, sum(c(min(c1k$TransactionDate):max(c1k$TransactionDate))%in%c1k$TransactionDate==FALSE)))
hist(c1k_quant_dist, breaks = 20)
c1k$SalePrice_local<- c1k$sales_local/c1k$quantity
c1k$SalePrice_usd<- c1k$sales_usd/c1k$quantity
# trends and cycles
## Price in local currency hiked in late 2015 and again in late 2018
ggplot(data=c1k, aes(x=TransactionDate, y=SalePrice_local))+geom_line()
## Price in USD increased gradually
ggplot(data=c1k, aes(x=TransactionDate, y=SalePrice_usd))+geom_line()
## Volumes of sales remained relatively stable
ggplot(data=c1k, aes(x=TransactionDate, y=quantity))+geom_line()
```
# Loading the Categorical Sale
Cetegorical data, namely sales data of all items in the same category as the item under study added up, are also used.
```{r category sales, include=FALSE, results='hide'}
category<- read_excel("Data/Book1.xlsx")
#renaming variables
colnames(category)[colnames(category)=="quantity"] <- "Category_quantity"
colnames(category)[colnames(category)=="salesusd"] <- "Category_Sales_usd"
colnames(category)[colnames(category)=="sales"] <- "Category_Sales_local"
#changing date formats
category$TransactionDate<-ymd(category$TransactionDate)
#subset to only club number 6101
category_club<- subset(category, ClubNumber == club)
category_club$ClubNumber<- NULL
#Merge c1k with category
c1k<- merge(c1k, category_club)
```
# Aggregate data to the weekly level
category_club contains data at the same club, but with sales of all goods in the same cateory added up.
```{r weekly level}
head(category_club)
c1k$TransactionYear<-year(c1k$TransactionDate) #Extract Transaction Year
c1k$week_number<-c1k$TransactionDate%>%date2ISOweek()%>%substr(7,8)
c1k$week_number_year<-c1k$TransactionDate%>%date2ISOweek()%>%substr(1,4)
#aggregation
c1k_week<-c1k%>%group_by(week_number_year,week_number)%>%summarise(
transaction_avrg=mean(number_transactions, na.rm = TRUE),
members_avrg=mean(number_members, na.rm = TRUE),
sales_local_avrg=mean(sales_local, na.rm = TRUE),
sales_usd_avrg=mean(sales_usd, na.rm = TRUE),
exchange_rate_avrg=mean(exchange_rate, na.rm = TRUE),
category_sales_usd_avrg=mean(Category_Sales_usd, na.rm = TRUE),
category_sales_local_avrg=mean(Category_Sales_local, na.rm = TRUE),
category_quantity_avrg=mean(Category_quantity, na.rm = TRUE),
quantity_avrg=mean(quantity, na.rm = TRUE),
salePrice_local_avrg = mean(SalePrice_local, na.rm = TRUE),
salePrice_usd_avrg = mean(SalePrice_usd, na.rm = TRUE))
c1k_week$TransactionTime<-paste0(c1k_week$week_number_year,"-W", c1k_week$week_number, "-3")%>%ISOweek2date()
c1k_week$week_number<- as.numeric(c1k_week$week_number)
c1k_week$week_number_year<-as.numeric(c1k_week$week_number_year)
head(c1k_week, 10)
```
# Identify Seasonality and create time series
The periodogram function does not detect high frequency seasons accurately. To solve this problem, one of the built in functions need to be modified.
Run the following line in r:
trace(spec.pgram, edit=TRUE)
Then change line 9 to:
N <- N0 <- nrow(x) * 128
Then click "save".
The function is hence temporarily changed. Judgement is still needed to determine what type of seasonality it is capturing. In this case is year, half-year and three-month.
The seasons detected are only rough estimates and we need to use knowledge about annual time series to determine the real seasons (semi-annual, annual etc.).Therefore, the identification of seasons here is not completely automated. Personal judgement is still needed.
Firstly, 52.14, the number of weeks per year, is always included.
Secondly, only periods shorter than 52.14 should be included, and their length should be adjusted to its closest whole-month length. For example, if 13.02 is reported is a strong period by Fourier transformation, 12.53 (three months) should be the season used in analyses.
Thirdly, none period should be multiple of the other (otherwise Fourier regressor does not work). Therefore, half-year period is changed from 26.07 to 25.07 weeks. We believe better ways to deal with this problem exist.
```{r seasons}
trace(spec.pgram, edit=TRUE)
#detecting seasonality
p <- periodogram(c1k_week$quantity_avrg[1:(nrow(c1k_week)-len_inv-len_dec)])
dd <- data.frame(freq=p$freq, spec=p$spec)
order <- dd[order(-dd$spec),]
top10 <- head(order, 10)
# convert frequency to time periods
time = 1/top10$freq
time
season1<-25.07
season2<-52.14
season3<-13.14
```
# Creating Time Series
ts function is used to create time series, and frequency is set to be 52.14, the number of weeks in a year.
```{r TS}
#creating time series: log
t_log<-ts(log(c1k_week$quantity_avrg)[1:(nrow(c1k_week)-len_inv-len_dec)], frequency = 52.14, start = c(c1k_week$week_number_year[1], c1k_week$week_number[1]))
plot(t_log)+title("Log sales quantity, club 6106")
ggplot(c1k_week)+geom_line(aes(x=TransactionTime, y=quantity_avrg))+xlab("Transaction time")+ylab("weekly average")+ggtitle(paste("Sales quantity", club))
```
## Deseasonalizing Methods
In stl function, s.window and t.window are both set to be 13, which simply follows the example [here](https://otexts.com/fpp2/stl.html).
```{r deseasonalizing}
decompose_log <- stl(t_log, s.window = 13, t.window = 13)
plot(decompose_log)
adjust_log<- t_log - decompose_log$time.series[,1]
```
## Detrend after taking out the outliers
tsclean function is used to take out outliers.
```{r detrend}
outlier_free_log<- tsclean(adjust_log)
trend_log<- decompose_log$time.series[, 2]
detrend_ts_log <- outlier_free_log-(trend_log - trend_log[1])
# adjust_log is trend and random combined, and outlier_free_log is adjust_log with outliers taken out
autoplot(adjust_log)+autolayer(outlier_free_log)
# detrend_ts_log is random part of time series
plot(detrend_ts_log)
```
# Training and Validation Set: uni-variate methods
Validation set is the last part of the existing data set, with a length of the sum of len_inv and len_dec. Training set is the rest of the data set.
```{r training and validation Set}
training_log<-ts(detrend_ts_log[1:(nrow(c1k_week)-len_inv-len_dec)], frequency = 52.14, start = c(c1k_week$week_number_year[1], c1k_week$week_number[1]))
# future trend and seasonality, which will be added back:
trend_fit_log <- auto.arima(trend_log)
trend_for_log <- forecast(trend_fit_log, len_inv+len_dec)$mean
retrend_log<-trend_for_log
reseasonal_log<-forecast(decompose_log$time.series[,1], len_inv+len_dec)$mean
##validation set
validation_log<-ts(log(c1k_week$quantity_avrg)[(nrow(c1k_week)-len_inv-len_dec+1):nrow(c1k_week)], frequency = 52.14, end = c(c1k_week$week_number_year[length(c1k_week$week_number_year)], c1k_week$week_number[length(c1k_week$week_number_year)]))
validation<-ts(c1k_week$quantity_avrg[(nrow(c1k_week)-len_inv-len_dec+1):nrow(c1k_week)], frequency = 52.14, end = c(c1k_week$week_number_year[length(c1k_week$week_number_year)], c1k_week$week_number[length(c1k_week$week_number_year)]))
```
# Visualization of Trend, season and random
```{r visualization}
theme_ts <- theme(panel.border = element_rect(fill = NA,
colour = "grey10"),
panel.background = element_blank(),
panel.grid.minor = element_line(colour = "grey85"),
panel.grid.major = element_line(colour = "grey85"),
panel.grid.major.x = element_line(colour = "grey85"),
axis.text = element_text(size = 13, face = "bold"),
axis.title = element_text(size = 15, face = "bold"),
plot.title = element_text(size = 16, face = "bold"),
strip.text = element_text(size = 16, face = "bold"),
strip.background = element_rect(colour = "black"),
legend.text = element_text(size = 15),
legend.title = element_text(size = 16, face = "bold"),
legend.background = element_rect(fill = "white"),
legend.key = element_rect(fill = "white"))
# Decomposition of log series
decomp_stl_log<- data.table(Quant = c(t_log, decompose_log$time.series[, 1], decompose_log$time.series[, 2]-decompose_log$time.series[, 2][1], detrend_ts_log),
Date = rep(c1k_week$TransactionTime[1:(nrow(c1k_week)-len_inv-len_dec)], ncol(decompose_log$time.series)+1),
Type = factor(rep(c("log quantity", colnames(decompose_log$time.series)),
each = nrow(decompose_log$time.series)),
levels = c("log quantity", colnames(decompose_log$time.series))))
ggplot(decomp_stl_log, aes(x = Date, y = Quant)) +
geom_line() +
facet_grid(Type ~ ., scales = "free_y", switch = "y") +
labs(x = "Time", y = NULL,
title = "Time Series Decomposition by STL (log quantity)") +
theme_ts
ggsave("Output/Countries/Colombia/Plots/decomp_training_log_c1k.jpg", width = 6, height = 10)
decomp_stl_training_log<- decomp_stl_log
decomp_stl_training_log$set<-"training"
decomp_stl_forecast_log<-data.table(Quant=c(rep(NA, len_dec+len_inv), reseasonal_log, retrend_log-trend_log[1], rep(NA, len_dec+len_inv)),
Date = rep(c1k_week$TransactionTime[(nrow(c1k_week)-len_dec-len_inv+1): nrow(c1k_week)], ncol(decompose_log$time.series)+1),
Type = factor(rep(c("log quantity", colnames(decompose_log$time.series)),
each = len_dec+len_inv),
levels = c("log quantity", colnames(decompose_log$time.series))))
decomp_stl_forecast_log$set<-"forecast"
decomp_stl_combined_log<-rbind(decomp_stl_training_log, decomp_stl_forecast_log)
ggplot(decomp_stl_combined_log, aes(x = Date, y = Quant, col=set)) +
geom_line() +
facet_grid(Type ~ ., scales = "free_y", switch = "y") +
labs(x = "Time", y = NULL,
title = "Time Series Decomposition by STL (log quantity)") +
theme_ts
ggsave("Output/Countries/Colombia/Plots/decomp_combined_log_c1k.jpg", width = 8, height = 8)
```
# Simple forecasting models
Four simple forecasting models are used in this section.
```{r simple forecasting}
#Average Method
Average_Method<-meanf(training_log, h = len_dec+len_inv)
#Naive Method
Naive_Method<-naive(training_log, h = len_dec+len_inv)
#Seasonal Naive Method
Seasonal_Naive_Method<-snaive(training_log , h = len_dec+len_inv)
#Drift Method
Drift_Method<-rwf(training_log, h = len_dec+len_inv, drift = TRUE)
autoplot(training_log) +
autolayer(validation_log-(retrend_log-trend_log[1])-reseasonal_log, series="Validation")+
autolayer(Average_Method,
series="Mean", PI=FALSE) +
autolayer(Naive_Method,
series="Naïve", PI=FALSE) +
autolayer(Seasonal_Naive_Method,
series="Seasonal naïve", PI=FALSE) +
autolayer(Drift_Method,
series = "Drift", PI = FALSE)+
ggtitle("Forecasts for Random Component: simple log models") +
xlab("Year") + ylab( paste(item, "\n logged, detrended and de-seasonalized"))+
guides(colour=guide_legend(title="Forecast"))
ggsave("Output/Countries/Colombia/Plots/simple_log_total_c1k.jpg", width = 10, height = 5)
```
## Summarize performance
c1k_week_quantity_models is the dataset that stores the forecast results of all models.
```{r performance}
c1k_week_quantity_models<-as.data.frame(validation)
colnames(c1k_week_quantity_models)<-"Actual"
c1k_week_quantity_models$TransactionYear<-c1k_week$week_number_year[(nrow(c1k_week)-len_dec-len_inv+1):nrow(c1k_week)]
c1k_week_quantity_models$week_number<-c1k_week$week_number[(nrow(c1k_week)-len_dec-len_inv+1):nrow(c1k_week)]
c1k_week_quantity_models$TransactionTime<-paste0(c1k_week_quantity_models$TransactionYear,"-", c1k_week_quantity_models$week_number, "-3")%>%as.Date("%Y-%W-%w")
c1k_week_quantity_models<-performance_index_dtds(df_Actual = c1k_week_quantity_models, pred_name = "Average_Method", pred_value = Average_Method$mean)
c1k_week_quantity_models<-performance_index_dtds(df_Actual = c1k_week_quantity_models, pred_name = "Naive_Method", pred_value = Naive_Method$mean)
c1k_week_quantity_models<-performance_index_dtds(df_Actual = c1k_week_quantity_models, pred_name = "Seasonal_Naive_Method", pred_value = Seasonal_Naive_Method$mean)
c1k_week_quantity_models<-performance_index_dtds(df_Actual = c1k_week_quantity_models, pred_name = "Drift_Method", pred_value = Drift_Method$mean)
# re-trended and re-seasonalized data: plots
c1k_week_quantity_models_plot<-gather(c1k_week_quantity_models%>%select(-c("Actual", "TransactionYear", "week_number")), model, quantity, Average_Method:Drift_Method, factor_key=TRUE)
#for club 6107 run the following:
#c1k_week_quantity_models_plot<-gather(c1k_week_quantity_models%>%select(-c("Actual", "TransactionYear", "week_number")), model, quantity, factor_key=TRUE)
c1k_week_quantity_models_plot$model<-as.character(c1k_week_quantity_models_plot$model)
c1k_week_quantity_models_plot$quantity<-as.numeric(c1k_week_quantity_models_plot$quantity)
c1k_week_int_Actual<-c1k_week%>%ungroup()%>%select(c("TransactionTime", "quantity_avrg"))
c1k_week_int_Actual$model<-"Actual"
c1k_week_int_Actual$quantity<-as.numeric(c1k_week_int_Actual$quantity_avrg)
c1k_week_int_Actual<- c1k_week_int_Actual[(nrow(c1k_week)-len_dec-len_inv+1):nrow(c1k_week),]
c1k_week_int_Actual$quantity_avrg<-NULL
c1k_week_quantity_models_plot<-rbind(as.data.frame(c1k_week_int_Actual), as.data.frame(c1k_week_quantity_models_plot))
```
# Arima models
## auto.arima
This section utilizes the auto-arima function.
```{r autoarima}
Simple_Arima<- auto.arima(training_log)
summary(Simple_Arima)
fc_Simple_Arima_1<- forecast(Simple_Arima, len_dec+len_inv, bootstrap = TRUE)
#Graph the forecasted results
autoplot(fc_Simple_Arima_1) +
autolayer(log(validation)-(retrend_log-trend_log[1])-reseasonal_log, series = "Logged Quantity")+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
#Checking accuracy
cr_Simple_Arima_1<-checkresiduals(fc_Simple_Arima_1)
print(cr_Simple_Arima_1$data.name)
print(cr_Simple_Arima_1$p.value)
# Bring fc_Simple_Arima_1$mean back through de-trending and de-seasonalizing
c1k_week_quantity_models<-performance_index_dtds(df_Actual = c1k_week_quantity_models, pred_name = "Simple_Arima_1", pred_value = fc_Simple_Arima_1$mean)
```
## ARIMA single season
This section uses grid to tune the K parameter to get the Fourier regressor.
```{r ARIMA single season}
##Finding best fit model
Arima_Fourier_AIC<-list(aicc=Inf)
for(K in seq(25)) {
fit <- auto.arima(training_log, xreg=fourier(training_log, K=K),
seasonal=FALSE)
if(fit[["aicc"]] < Arima_Fourier_AIC[["aicc"]]) {
Arima_Fourier_AIC <- fit
bestK <- K
}
}
Arima_Fourier_AIC
fc_Arima_Fourier_AIC <- forecast(Arima_Fourier_AIC,xreg=fourier(training_log, K=bestK, h=len_dec+len_inv))
autoplot(fc_Arima_Fourier_AIC) + autolayer(log(validation)-(retrend_log-trend_log[1])-reseasonal_log, series = "Logged Quantity")+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
cr_ARIMA_Fourier_AIC<-checkresiduals(fc_Arima_Fourier_AIC)
print(cr_ARIMA_Fourier_AIC$data.name)
# Bring fc_Arima_Fourier_AIC$mean back through de-trending and de-seasonalizing
c1k_week_quantity_models<-performance_index_dtds(df_Actual = c1k_week_quantity_models, pred_name = "Arima_Fourier_AIC", pred_value = fc_Arima_Fourier_AIC$mean)
```
## ARIMA double season
This arima model uses fourier and the two seasonality periods obtained above.
```{r ARIMA double season}
Arima_AIC <- auto.arima(training_log)
bestfit <- list(aicc=Arima_AIC$aicc, i=0, j=0, fit=Arima_AIC)
fc_ARIMA_fourier<- forecast(Arima_AIC, h = len_dec+len_inv)
autoplot(fc_ARIMA_fourier) +
autolayer(log(validation)-(retrend_log-trend_log[1])-reseasonal_log, series = "Logged Quantity")+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
#choose the best model by AICc
for(i in 1:3) {
for (j in 1:3){
z1 <- fourier(ts(training_log, frequency= season1), K=i)
z2 <- fourier(ts(training_log, frequency= season2), K=j)
Arima_Seasons1_2 <- auto.arima(training_log, xreg=cbind(z1, z2), seasonal=F)
if(Arima_Seasons1_2$aicc < bestfit$aicc) {
bestfit <- list(aicc=Arima_Seasons1_2$aicc, i=i, j=j, Arima_Seasons1_2=Arima_Seasons1_2)
}
}
}
bestfit
fc_Arima_Seasons1_2 <- forecast(bestfit$fit,
xreg=cbind(
fourier(ts(training_log, frequency=season1), K=bestfit$i, h=len_dec+len_inv),
fourier(ts(training_log, frequency=season2), K=bestfit$j, h=len_dec+len_inv)))
fc_Arima_Seasons1_2 <- forecast(bestfit$fit, h=len_dec+len_inv)
autoplot(fc_Arima_Seasons1_2) +
autolayer(validation_log-(retrend_log-trend_log[1])-reseasonal_log, series = "Logged Quantity")+
xlab("Year") + ylab("Logged Quantity Sold of Kitchen Trash Bags")+
guides(colour=guide_legend(title="Validation Set"))
cr_ARIMA_seasons1_2<-checkresiduals(fc_Arima_Seasons1_2)
print(cr_ARIMA_seasons1_2$data.name)
print(cr_ARIMA_seasons1_2$p.value)
# Bring fc_ARIMA$mean back through de-trending and de-seasonalizing
c1k_week_quantity_models<-performance_index_dtds(df_Actual = c1k_week_quantity_models, pred_name = "Arima_Seasons1_2", pred_value = fc_Arima_Seasons1_2$mean)
```
## Visualization
This section generates visualizations of three ARIMA models.
```{r ARIMA visualization}
# before de-trending and de-seasoning
autoplot(training_log) +
autolayer(validation_log-(retrend_log-trend_log[1])-reseasonal_log, series="Validation", PI=FALSE)+
autolayer(fc_Simple_Arima_1,
series="Default Arima", PI=FALSE) +
autolayer(fc_Arima_Seasons1_2,
series="Fourier double seasons", PI=FALSE) +
autolayer(fc_Arima_Fourier_AIC,
series="Fourier single season", PI=FALSE)+
ggtitle(paste("Forecast for Log Quantity Sold of", item, "\n", country, ": Club", club)) +
xlab("Year") + ylab(paste(item,"\n detrended and de-seasonalized"))+
guides(colour=guide_legend(title="Forecast"))
ggsave("Output/Countries/Colombia/Plots/ARIMA_log_total_weekly_c1k.jpg", width = 10, height = 5)
autoplot(validation_log-(retrend_log-trend_log[1])-reseasonal_log, series="Validation")+
autolayer(fc_Simple_Arima_1,
series="Default Arima", PI=FALSE) +
autolayer(fc_Arima_Seasons1_2,
series="Fourier double seasons", PI=FALSE) +
autolayer(fc_Arima_Fourier_AIC,
series="Fourier single season", PI=FALSE)+
ggtitle(paste("Forecast for Log Quantity Sold of", item, "\n", country, ": Club", club)) +
xlab("Year") + ylab(paste(item,"\n detrended and de-seasonalized"))+
guides(colour=guide_legend(title="Forecast"))
ggsave("Output/Countries/Colombia/Plots/ARIMA_log_total_weekly1_c1k.jpg", width = 10, height = 5)
decomp_stl<- data.table(Quant = c(t_log, as.numeric(decompose_log$time.series)),
Date = rep(c1k_week$TransactionTime[1:(nrow(c1k_week)-len_dec-len_inv)], ncol(decompose_log$time.series)+1),
Type = factor(rep(c("original data", colnames(decompose_log$time.series)),
each = nrow(decompose_log$time.series)),
levels = c("original data", colnames(decompose_log$time.series))))
decomp_ARIMA_forecast_log<-data.table(Quant=c(fc_Simple_Arima_1$mean+reseasonal_log+(retrend_log-trend_log[1]), fc_Arima_Seasons1_2$mean+reseasonal_log+(retrend_log-trend_log[1]), fc_Arima_Fourier_AIC$mean+reseasonal_log+(retrend_log-trend_log[1]), reseasonal_log, (retrend_log-trend_log[1]), fc_Simple_Arima_1$mean, fc_Arima_Seasons1_2$mean, fc_Arima_Fourier_AIC$mean),
Date = rep(c1k_week$TransactionTime[(nrow(c1k_week)-len_dec-len_inv+1): nrow(c1k_week)], ncol(decompose_log$time.series)+5),
Type = factor(rep(c(rep("log quantity", 3), colnames(decompose_log$time.series)[1:2], rep(colnames(decompose_log$time.series)[3], 3)),
each = len_dec+len_inv),
levels = c("log quantity", colnames(decompose_log$time.series))))
decomp_ARIMA_forecast_log$set<-rep(c("default arima", "Fourier double seasons", "Fourier single season", "forecast seasonal", "forecast trend", "default arima", "Fourier double seasons", "Fourier single season"), each=len_dec+len_inv)
decomp_ARIMA_combined_log<-rbind(decomp_stl_training_log, decomp_ARIMA_forecast_log)
ggplot(decomp_ARIMA_combined_log, aes(x = Date, y = Quant, col=set)) +
geom_line() +
facet_grid(Type ~ ., scales = "free_y", switch = "y") +
labs(x = "Time", y = NULL,
title = "Decomposition by STL (log quantity) and ARIMA predictions") +
theme_ts
ggsave("Output/Countries/Colombia/Plots/ARIMA_log_grid_weekly_c1k.jpg", width = 10, height = 5)
ggplot(decomp_ARIMA_combined_log[decomp_ARIMA_combined_log$Date>=(max(decomp_ARIMA_combined_log$Date, na.rm = TRUE)-119),], aes(x = Date, y = Quant, col=set)) +
geom_line() + facet_grid(Type ~ ., scales = "free_y", switch = "y") +
labs(x = "Time", y = NULL,
title = "Decomposition by STL (log quantity) and ARIMA predictions") +
theme_ts
ggsave("Output/Countries/Colombia/Plots/ARIMA_log_grid_weekly1_c1k.jpg", width = 10, height = 5)
```
# TBATS
Five TBATS models (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components) are tried out in these sections.
TBATS use a combination of Fourier terms with an exponential smoothing state space model and a Box-Cox transformation. Seasonality is allowed to change slowly over time.
## Feed Raw Data to this model (with seasonality, trend and outliers)
```{r TBATS 1}
fit_TBATS_Raw_1<- tbats(t_log, use.box.cox = NULL, use.trend = TRUE, use.damped.trend = NULL, seasonal.periods = 52, use.arma.errors = TRUE, biasadj = TRUE)
fc_TBATS_Raw_1<- forecast(fit_TBATS_Raw_1, h=len_dec+len_inv, bootstrap = TRUE)
autoplot(fc_TBATS_Raw_1) + autolayer(log(validation), series = "Logged Quantity")+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_TBATS_Raw_1)
c1k_week_quantity_models<-performance_index_raw(df_Actual = c1k_week_quantity_models, pred_name = "fc_TBATS_Raw_1", pred_value = fc_TBATS_Raw_1$mean)
```
## TBATS Model with top 1 and 2 seasonal periods
```{r TBATS 2}
fit_TBATS_Season1_2<- tbats(t_log, use.box.cox = NULL, use.trend = NULL, use.damped.trend = NULL, seasonal.periods = c(season2,season1), use.arma.errors = TRUE, biasadj = TRUE)
fc_TBATS_Season1_2<- forecast(fit_TBATS_Season1_2, h=len_dec+len_inv, bootstrap = TRUE)
autoplot(fc_TBATS_Season1_2) + autolayer(log(validation), series = "Logged Quantity")+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_TBATS_Season1_2)
c1k_week_quantity_models<-performance_index_raw(df_Actual = c1k_week_quantity_models, pred_name = "fc_TBATS_Season1_2", pred_value = fc_TBATS_Season1_2$mean)
```
## TBATS Model with top 2 and 3 seasonal periods
```{r TBATS 3}
fc_TBATS_Season2_3<- forecast(tbats(t_log, use.box.cox = NULL, use.trend = NULL, use.damped.trend = NULL, seasonal.periods = c(season2,season3), use.arma.errors = TRUE, biasadj = TRUE),h=len_dec+len_inv)
autoplot(fc_TBATS_Season2_3)+ autolayer(log(validation), series = "Logged Quantity")+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_TBATS_Season2_3)
c1k_week_quantity_models<-performance_index_raw(df_Actual = c1k_week_quantity_models, pred_name = "fc_TBATS_Season2_3", pred_value = fc_TBATS_Season2_3$mean)
```
## TBATS Model with top 1 and 3 seasonal periods
```{r TBATS 4}
fc_TBATS_Season1_3 <- forecast(tbats(t_log, seasonal.periods=c(season1,season3)), h=len_dec+len_inv)
autoplot(fc_TBATS_Season1_3)+ autolayer(log(validation), series = "Logged Quantity")+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_TBATS_Season1_3)
c1k_week_quantity_models<-performance_index_raw(df_Actual = c1k_week_quantity_models, pred_name = "fc_TBATS_Season1_3", pred_value = fc_TBATS_Season1_3$mean)
```
## really simple tbats
```{r TBATS 5}
fit_TBATS_NoSeason<- tbats(t_log)
fc_TBATS_NoSeason<- forecast(fit_TBATS_NoSeason, h=len_dec+len_inv)
autoplot(fc_TBATS_NoSeason) + autolayer(validation_log)
autoplot(fc_TBATS_NoSeason)+ autolayer(log(validation), series = "TBATS No Season")+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
ggtitle(paste("Forecast for Quantity Sold of", item, "\n", country, ": Club", club)) +
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_TBATS_NoSeason)
c1k_week_quantity_models<-performance_index_raw(df_Actual = c1k_week_quantity_models, pred_name = "fc_TBATS_NoSeason", pred_value = fc_TBATS_NoSeason$mean)
```
## visualization
This section generates visualizations of TBATS models.
```{r TBATS visualization}
autoplot(t_log) +
autolayer(validation_log, series="Validation", PI=FALSE)+
autolayer(fc_TBATS_Raw_1,
series="Single Season", PI=FALSE) +
autolayer(fc_TBATS_Season1_2,
series=paste("Season", season1, "and", season2), PI=FALSE) +
autolayer(fc_TBATS_Season2_3,
series=paste("Season", season2, "and", season3), PI=FALSE) +
autolayer(fc_TBATS_Season1_3,
series=paste( "Season", season1, "and", season3), PI=FALSE) +
autolayer(fc_TBATS_NoSeason,
series="Season Unspecified", PI=FALSE) +
ggtitle(paste("Forecast for Logged Quantity Sold of", item, "\n", country, ": Club", club)) +
xlab("Year") + ylab(paste(item, "\n detrended and de-seasonalized"))+
guides(colour=guide_legend(title="Forecast"))
ggsave("Output/Countries/Colombia/Plots/tbats_log_total_weekly_c1k.jpg", width = 10, height = 5)
autoplot(validation_log, series="Validation", PI=FALSE)+
autolayer(fc_TBATS_Raw_1,
series="Single Season", PI=FALSE) +
autolayer(fc_TBATS_Season1_2,
series=paste("Season", season1, "and", season2), PI=FALSE) +
autolayer(fc_TBATS_Season2_3,
series=paste("Season", season2, "and", season3), PI=FALSE) +
autolayer(fc_TBATS_Season1_3,
series=paste("Season", season1, "and", season3), PI=FALSE) +
autolayer(fc_TBATS_NoSeason,
series="Season Unspecified", PI=FALSE) +
ggtitle(paste("Forecast for Logged Quantity Sold of", item, "\n", country, ": Club", club)) +
xlab("Year") + ylab(paste(item, "\n detrended and de-seasonalized"))+
guides(colour=guide_legend(title="Forecast"))
ggsave("Output/Countries/Colombia/Plots/tbats_log_total_weekly1_c1k.jpg", width = 10, height = 5)
```
# Neural network
## Neural Networks (input detrneded and de-seasonalized data)
```{r NNETAR 1}
#In NNAR(p,P,k)m, p = lagged inputs, P = equivalent to ARIMA(p,0,0)(P,0,0)m, k = nods in the single hidden layer.
fit_NN_1<- nnetar(training_log, lambda = "auto")
fc_NN_1<- forecast(fit_NN_1, h=len_dec+len_inv)
autoplot(fc_NN_1) + autolayer(log(validation)-(retrend_log-trend_log[1])-reseasonal_log, series = "Logged Quantity")+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_NN_1)
c1k_week_quantity_models<-performance_index_dtds(df_Actual = c1k_week_quantity_models, pred_name = "fc_NN_1", pred_value = fc_NN_1$mean)
```
## Neural networks (input raw data)
```{r NNETAR 2}
fit_NN_Raw<- nnetar(t_log, lambda = "auto")
fc_NN_Raw<- forecast(fit_NN_Raw, h=len_dec+len_inv)
autoplot(fc_NN_Raw) + autolayer(log(validation), series = "Logged Quantity")+
xlab("Year") + ylab(paste("Logged Quantity Sold of", item))+
guides(colour=guide_legend(title="Validation Set"))
checkresiduals(fc_NN_Raw)
c1k_week_quantity_models<-performance_index_raw(df_Actual = c1k_week_quantity_models, pred_name = "fc_NN_Raw", pred_value = fc_NN_Raw$mean)
```
# Combination of Uni-variate Forecasts
The combined uni-variate model is the average of four models. Two were selected by RMSE on the whole forecast period; two were selected by RMSE on the target (len_dec) period.
```{r combination,echo=FALSE}
tsmodels<-c("Average_Method","Naive_Method","Seasonal_Naive_Method","Drift_Method","Simple_Arima_1","Arima_Seasons1_2", "Arima_Fourier_AIC","fc_TBATS_Raw_1","fc_TBATS_Season1_2","fc_TBATS_Season2_3","fc_TBATS_Season1_3","fc_TBATS_NoSeason","fc_NN_1","fc_NN_Raw")
best_two_total<-tsmodels[rank(mget(paste0("RMSE_", tsmodels, "_total"))%>%as.numeric())<=2]
best_two_target<-tsmodels[rank(mget(paste0("RMSE_", tsmodels, "_target"))%>%as.numeric())<=2]
Combination_uni<-(1/4*(c1k_week_quantity_models%>%select(best_two_total[1])+c1k_week_quantity_models%>%select(best_two_total[2])+c1k_week_quantity_models%>%select(best_two_target[1])+c1k_week_quantity_models%>%select(best_two_target[2])))%>%unlist()
c1k_week_quantity_models<-performance_index_raw(df_Actual = c1k_week_quantity_models, pred_name = "Combination_uni", pred_value = log(Combination_uni))
```
# Summarize performance of uni_variate models
```{r current models performance}
tsmodels<-c("Average_Method","Naive_Method","Seasonal_Naive_Method","Drift_Method","Simple_Arima_1","Arima_Seasons1_2", "Arima_Fourier_AIC","fc_TBATS_Raw_1","fc_TBATS_Season1_2","fc_TBATS_Season2_3","fc_TBATS_Season1_3","fc_TBATS_NoSeason","fc_NN_1","fc_NN_Raw", "Combination_uni")
indices<-c("MEAN", "RMSE", "MAE", "MPE", "MAPE", "MASE")
performance_week_total<-matrix(nrow = length(tsmodels), ncol = length(indices))
for (i in 1:length(tsmodels)){
for (j in 1:length(indices)){
performance_week_total[i, j]=get(paste0(indices[j], "_", tsmodels[i], "_total"))
}
}
performance_week_total<-as.data.frame(performance_week_total)
colnames(performance_week_total)<-indices
rownames(performance_week_total)<-tsmodels
write.csv(x = performance_week_total, file = "Output/Countries/Colombia/Tables/performance_week_total_c1k.csv")
performance_week_total
performance_week_target<-matrix(nrow = length(tsmodels), ncol = length(indices))
for (i in 1:length(tsmodels)){
for (j in 1:length(indices)){
performance_week_target[i, j]=get(paste0(indices[j], "_", tsmodels[i], "_target"))
}
}
performance_week_target<-as.data.frame(performance_week_target)
colnames(performance_week_target)<-indices
rownames(performance_week_target)<-tsmodels
write.csv(x = performance_week_total, file = "Output/Countries/Colombia/Tables/performance_week_target_c1k.csv")
```
# Machine learning setup
This part of the code is based on [this site](https://rpubs.com/mattBrown88/TimeSeriesMachineLearning)
Feature engineering functions defined in section "Define functions" are applied to both level and log forms (for some) of variables
```{r ML}
# feature engineering
c1k_week$month<-month(c1k_week$TransactionTime)
c1k_week$month<-as.factor(c1k_week$month)
c1k_week$log1p_transaction_avrg<-log1p(c1k_week$transaction_avrg) #
c1k_week$log1p_members_avrg<-log1p(c1k_week$members_avrg) #
c1k_week$log1p_sales_local_avrg<-log1p(c1k_week$sales_local_avrg) #
c1k_week$log1p_sales_usd_avrg<-log1p(c1k_week$sales_usd_avrg) #
c1k_week$log1p_category_sales_local_avrg<-log1p(c1k_week$category_sales_local_avrg) #
c1k_week$log1p_quantity_avrg<-log1p(c1k_week$quantity_avrg) #
c1k_week$log1p_category_sales_local_avrg<- log1p(c1k_week$category_sales_local_avrg)
c1k_week$log1p_category_sales_usd_avrg<-log1p(c1k_week$category_sales_usd_avrg)
c1k_week$log1p_category_quantity_avrg<- log1p(c1k_week$category_quantity_avrg)
c1k_week$log1p_salePrice_local_avrg<- log1p(c1k_week$salePrice_local_avrg)
c1k_week$log1p_salePrice_usd_avrg<- log1p(c1k_week$salePrice_usd_avrg)
t_v_c1k<-c1k_week
t_v_c1k<- t_v_c1k[order(t_v_c1k$TransactionTime),]%>%ungroup()
t_v_c1k<- feature_engineering(t_v_c1k, c("quantity_avrg", "transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg","salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg","log1p_category_sales_local_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg"))
# creating training matrix
t_v_c1k<- fastDummies::dummy_cols(t_v_c1k, select_columns = c("week_number_year", "week_number", "month"))
t_v_c1k$month<-NULL
t_v_c1k$week_number_year<-NULL
t_v_c1k$week_number<-NULL
train_c1k<-t_v_c1k[1:(nrow(t_v_c1k)-len_dec-len_inv),]
test_c1k<-t_v_c1k[(nrow(t_v_c1k)-len_dec-len_inv+1):nrow(t_v_c1k),]
c1k_week_quantity_models_train<-data.frame(Actual=train_c1k$quantity_avrg, TransactionTime=train_c1k$TransactionTime)
c1k_week_quantity_models_train<-c1k_week_quantity_models_train[rowSums(is.na(train_c1k))==0,]
```
# XGB model
```{r XGB MODEL}
trainc1k_XGB<-train_c1k%>%select(-c("transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg","salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg", "log1p_transaction_avrg", "log1p_members_avrg", "log1p_sales_local_avrg", "log1p_sales_usd_avrg", "log1p_category_sales_local_avrg","log1p_category_sales_usd_avrg", "log1p_category_quantity_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg","TransactionTime"))
testc1k_XGB<-test_c1k%>%select(-c("transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg","salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg", "log1p_transaction_avrg", "log1p_members_avrg", "log1p_sales_local_avrg", "log1p_sales_usd_avrg", "log1p_category_sales_local_avrg","log1p_category_sales_usd_avrg", "log1p_category_quantity_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg","TransactionTime"))
trainTask <- makeRegrTask(data = trainc1k_XGB, target = "quantity_avrg")
testTask <- makeRegrTask(data = testc1k_XGB, target = "quantity_avrg")
xgb_learner <- makeLearner(
"regr.xgboost",
predict.type = "response",
par.vals = list(
objective = "reg:linear",
eval_metric = "rmse",
nrounds = 200
)
)
# Create a model
xgb_model <- mlr::train(xgb_learner, task = trainTask)
xgb_params <- makeParamSet(
# The number of trees in the model (each one built sequentially)
makeIntegerParam("nrounds", lower = 100, upper = 500),
# number of splits in each tree
makeIntegerParam("max_depth", lower = 1, upper = 10),
# "shrinkage" - prevents overfitting
makeNumericParam("eta", lower = .1, upper = .5),
# L2 regularization - prevents overfitting
makeNumericParam("lambda", lower = -1, upper = 0, trafo = function(x) 10^x)
)
control <- makeTuneControlRandom(maxit = 1)
resample_desc <- makeResampleDesc("CV", iters = 10)
tuned_params <- tuneParams(
learner = xgb_learner,
task = trainTask,
resampling = resample_desc,
par.set = xgb_params,
control = control
)
xgb_tuned_learner <- setHyperPars(
learner = xgb_learner,
par.vals = tuned_params$x
)
xgb_model <- mlr::train(xgb_tuned_learner, trainTask)
XGBoost_pred <- predict(xgb_model ,testTask)
XGBoost_pred_train <- predict(xgb_model ,trainTask)
c1k_week_quantity_models_train<-performance_index_raw(df_Actual = c1k_week_quantity_models_train, pred_name = "XGBoost_train", pred_value = log(XGBoost_pred_train$data$response[rowSums(is.na(trainc1k_XGB))==0]))
c1k_week_quantity_models<-performance_index_raw(df_Actual = c1k_week_quantity_models, pred_name = "XGBoost", pred_value = log(XGBoost_pred$data$response))
```
# Random Forest 1
This section uses very simple pre-set parameters (ntree = 1000, mtry = 3, nodesize = 5, importance = TRUE) in the randomForecast function.
```{r RF 1, fig.width=12}
trainc1k_RF<-train_c1k%>%select(-c("transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg","salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg", "log1p_transaction_avrg", "log1p_members_avrg", "log1p_sales_local_avrg", "log1p_sales_usd_avrg", "log1p_category_sales_local_avrg","log1p_category_sales_usd_avrg", "log1p_category_quantity_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg","TransactionTime"))
testc1k_RF<-test_c1k%>%select(-c("transaction_avrg", "members_avrg", "sales_local_avrg", "exchange_rate_avrg", "sales_usd_avrg", "category_sales_local_avrg", "category_sales_usd_avrg", "category_quantity_avrg","salePrice_local_avrg","salePrice_usd_avrg", "log1p_quantity_avrg", "log1p_transaction_avrg", "log1p_members_avrg", "log1p_sales_local_avrg", "log1p_sales_usd_avrg", "log1p_category_sales_local_avrg","log1p_category_sales_usd_avrg", "log1p_category_quantity_avrg","log1p_salePrice_local_avrg","log1p_salePrice_usd_avrg","TransactionTime"))
RF_1 <- randomForest(quantity_avrg ~. , data = trainc1k_RF,
ntree = 1000, mtry = 3, nodesize = 5, importance = TRUE, na.action = na.omit)
varImpPlot(RF_1, main = "Variable importance")
RF1_pred<-predict(RF_1, testc1k_RF)
RF1_pred_training<- predict(RF_1, trainc1k_RF)
c1k_week_quantity_models_train<-performance_index_raw(df_Actual = c1k_week_quantity_models_train, pred_name = "RF1_train", pred_value = log(RF1_pred_training[rowSums(is.na(trainc1k_XGB))==0]))
c1k_week_quantity_models<-performance_index_raw(df_Actual = c1k_week_quantity_models, pred_name = "RF_1", pred_value = log(RF1_pred))
```
# Random Forest 2