Hoping that my work might help beginners better understand what machine learning is and what can we do with it I decided to put a small explanation of ML techniques types before some of my machine learning projects.
I hope that after reading following definition it will be easier to understand what is it that I'm trying to show and how it's done.
At this point feel free to skip to the introduction section.
Data in machine learning
A sample is a single row of data in our data set.
In machine learning problems we distinguish 2 types of data:
- Labeled data - here wikipedia gives a really good explanantion: "Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags. For example, a data label might indicate whether a photo contains a horse or a cow". The features are the columns describing specific sample. Label is the tag(can be a both continious value(500,99$) and a discrete value("cat","dog" - members of finite set) that has been assigned to the sample based on features.
- Unlabeled data - contrary to the labeled data, here we only have the features describing a sample. The tags are not pre-assigned but instead the algorithm we are using, is going to come up with reasonable clusters of the data.
Machine learning problems can be approached using 3 different methods.
-
Supervised learning - deals with labeled. Having input data X and some output data Y, we want to find parameters θ of the function f such that X * θ ≈ f
Example: could be the famous titanic survival classification or house price prediction. Algortithm uses the data set to learn how location, size, number of rooms impact the price of the house. Afterwards it can be used to predict unknown price of the house using its features(size, location, number of rooms etc.) -
Unsupervised learning - deals with unlabeled data. This technique determines similarities/patterns between certain samples and then clusters simmilar samples together.
Example: Song recommendation system based on liked songs- to make a recommendation algorithm would first determine a cluster to which an user belong. Then the algorithm would use favorite songs of different users from that cluster as recommendations. -
Reinforcement learning - Deals with rewards systems such as winning video games, reducing traffic problems. The agent is an entity being able to modify enviroment - in a game it could for example move our characters, or make it perform certain actions. The agent is supposed to undertake actions in such way that the reward function is maximized. The reward function for example could be the distance our character had covered in Mario Bros.
PS: Feel free to merge request better definition.
This repository presents step by step usage of KMeans algorithm from sklearn library using python in jupyter notebook (open source scientific tool supporting many languages). I am going to first show how to cluster the data assuming the number of cluster on a guess and then present Elbow Technique used to determine right number of clusters.
KMeans is an unsupervised clustering method that clusters the data based on the similarity between the samples.
It requires from the user to provide it with number of clusters - K and then finds K centroids such that the centroid of specific cluster is the closest representation of all of the samples belonging to that cluster.
The most common similarity metric is squared Euclidean Distance.
The iterative naive KMeans (slow version but, giving good intuition behind the algorithm):
Step one for all of the samples assign them to the cluster which squared eucledian distance is the lowest
Step two for all cluster centroids, assign them to the mean of samples belonging to that cluster
python 3.6+
jupyter notebook
sklearn
matplotlib