Skip to content

Latest commit

 

History

History
33 lines (19 loc) · 4.23 KB

README.md

File metadata and controls

33 lines (19 loc) · 4.23 KB

Machine_Learning

This repository contains some of the most important machine learning and data-analysis techniques.

When the new files will be added, corresponding description will also be added with file name and DDMMYY.

base of these programs are machine learning codes influenced from Muller's Machine Learning with Python book. Later on many extra techniques are implemented.

PCA_Muller.py 190818: Principal component analysis example with breast cancer data-set. Detailed description of this code is discussed in Towards Data Science.

270918: RidgeandLin.py, LassoandLin.py: Lasso and Ridge regression examples: From coefficient shrinakge in Ridge to feature selection in Lasso are shown in the code. The concepts and discussion of the results are described here.

081018: bank.csv, data set of selling products of a portuguese company to random customers over phone call(s). Detailed description are available here.

161018: gender_purchase.csv, data-set of two columns describing customers buying a product depending on gender.

111118: winequality-red.csv, red wine data set, where the output is the quality column which ranges from 0 to 10.

121118: pipelineWine.py, Contains a simple example of applying pipeline and gridsearchCV together using the red wine data. More description can be found here.

24112018: lagmult.py, this program just demonstrate a simple constrained optimization problem using figures. Uses Lagrange Multiplier method.

11122018: Consumer_Complaints_short.csv, 3 columns describing the complaints, product_label and category. Complete file can be obtained from Govt.data. File size is around 650 MB. More details about the usage of this file will be uploaded soon when the text classification program is ready.

13122018: Text-classification_compain_suvo.py, Classify the consumer complaints data, which is already described above. The file deals with the complete data-set (650 MB). After testing several ML algorithms, Linear SVM works best. Higher the computer resources, higher amount of rows can be considered for TfidfVectorizer.

1912018: SVMdemo.py, this program shows the effect of using RBF kernel to map from 2d space to 3d space. Animation requires ffmpeg in unix system.

05032019: IBM_Python_Web_Scrapping.ipynb, Deals with basic web scrapping, string handling, image manipulation while we generate fake cover for our band.

06042019: datacleaning, Folder containing files and images related to data cleaning with pandas. For more details check Medium.

09062019: DBSCAN_Complete, Folder containing files and images related to application of DBSCAN algorithm to cluster Weather Stations in Canada. Apart from ususal Scikit-learn, numpy, pandas, I have used Basemap to show the clusters on a map. More details can be found in Medium.

13072019: SVM_Decision_Boundary, I set up a pipeline with StandardScaler, PCA, SVM and, performed grid-search cross-validation to find best-fit parameters, using which the decision function contours of SVM classifier for binary classification are plotted. Read in TDS for more.

28122019: DecsTree, Folder contains notebook using a decision tree classifier on the Bank Marketing Data-Set. Best parameters are obtained using Grid Search Cross Validation. Also materials used for the TDS post are included.