Welcome to Financial Machine Learning in Python!
This package is used to apply machine learning methods to financial data, which generally has very low SNR(signal to noise ratio) and thus hard to apply ML directly. You can find more detailed explaination of methods implemented in this package in Advances in Financial Machine Learning. Also you can find R version of this package at fmlr.
There are mainly three obstacles people may encouter when they are trying to apply machine learning on financial data:
- Financial data are usually very heavy. The memory needed to store the limit order book of a single stock is usually at TB scale, so it's extremly slow if the algorithm needs to train a lot of parameters;
- Signal to Noise ratio in financial data is very low. Since the market has too many noises, it's hard to detect or even define what signla is, which may easily lead to overfitting;
- Financial data are highly correlated, which violates the independent assumption of most machine learning model. Since financial data are mostly time series data, it's hard to do cross validation with it because if we use "future data" to predict "past data", the accuracy can't really show the real performance of the model.
To deal with the three problems above, we use the following scheme before we apply any traditional machine learning algorithm to financial data:
- Sample data into information bars. The goal of this step is to reduce the size of data and only preserve the data with information.
- Use meta-label method to label the bars. The goal of this step is to build a feature matrix so that traditional machine learning algorithm can be applied.
- Split data using purged cross validation. The goal of this step is to avoid information leakage when cross validate the model by training on "future data" and testing on "past data".
This package included four modules listed below
- preprocessing
Used to preprocess raw price series. Including generate all kinds of structured bars, meta-labelling, generate fractionally differentiated series etc. - model
Used to train machine learning models. Mainly deal with cross-validation and sequential boostrap method. - backtest
Used to back test quantatitive investment strategies.
* This module will not be included in the first version - tests
Used to test the correctness of the code during development and provide examples to users after the package is deployed.
- pandas 0.24.1
- numpy 1.16.1
Use
pip install fmlpy
to install
See this pipeline for how to use most of the functions in this packagge.