-
Notifications
You must be signed in to change notification settings - Fork 1
/
Test.py
144 lines (133 loc) · 31.1 KB
/
Test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# https://translate.google.co.in/?sl=auto&tl=es&text=language&op=translate
#https://py-googletrans.readthedocs.io/en/latest/
text_telugu = '''మాతృభాష ఏదైనా అది తల్లి తో సమానం.మనం తల్లిని ఎంత గౌరవము ఇస్తామో, మన మాతృభాష ని కూడా అంతే గౌరవించాలి.అది తెలుగు కావచ్చు, ఆంగ్లం కావచ్చు, హింది కావచ్చు.'''
"""from googletrans import Translator
translator = Translator()
print(translator.detect(text_telugu).lang)
translation = translator.translate(text_telugu, dest='en')
"""
'''
LANGUAGES = {
'af': 'afrikaans',
'sq': 'albanian',
'am': 'amharic',
'ar': 'arabic',
'hy': 'armenian',
'az': 'azerbaijani',
'eu': 'basque',
'be': 'belarusian',
'bn': 'bengali',
'bs': 'bosnian',
'bg': 'bulgarian',
'ca': 'catalan',
'ceb': 'cebuano',
'ny': 'chichewa',
'zh-cn': 'chinese (simplified)',
'zh-tw': 'chinese (traditional)',
'co': 'corsican',
'hr': 'croatian',
'cs': 'czech',
'da': 'danish',
'nl': 'dutch',
'en': 'english',
'eo': 'esperanto',
'et': 'estonian',
'tl': 'filipino',
'fi': 'finnish',
'fr': 'french',
'fy': 'frisian',
'gl': 'galician',
'ka': 'georgian',
'de': 'german',
'el': 'greek',
'gu': 'gujarati',
'ht': 'haitian creole',
'ha': 'hausa',
'haw': 'hawaiian',
'iw': 'hebrew',
'he': 'hebrew',
'hi': 'hindi',
'hmn': 'hmong',
'hu': 'hungarian',
'is': 'icelandic',
'ig': 'igbo',
'id': 'indonesian',
'ga': 'irish',
'it': 'italian',
'ja': 'japanese',
'jw': 'javanese',
'kn': 'kannada',
'kk': 'kazakh',
'km': 'khmer',
'ko': 'korean',
'ku': 'kurdish (kurmanji)',
'ky': 'kyrgyz',
'lo': 'lao',
'la': 'latin',
'lv': 'latvian',
'lt': 'lithuanian',
'lb': 'luxembourgish',
'mk': 'macedonian',
'mg': 'malagasy',
'ms': 'malay',
'ml': 'malayalam',
'mt': 'maltese',
'mi': 'maori',
'mr': 'marathi',
'mn': 'mongolian',
'my': 'myanmar (burmese)',
'ne': 'nepali',
'no': 'norwegian',
'or': 'odia',
'ps': 'pashto',
'fa': 'persian',
'pl': 'polish',
'pt': 'portuguese',
'pa': 'punjabi',
'ro': 'romanian',
'ru': 'russian',
'sm': 'samoan',
'gd': 'scots gaelic',
'sr': 'serbian',
'st': 'sesotho',
'sn': 'shona',
'sd': 'sindhi',
'si': 'sinhala',
'sk': 'slovak',
'sl': 'slovenian',
'so': 'somali',
'es': 'spanish',
'su': 'sundanese',
'sw': 'swahili',
'sv': 'swedish',
'tg': 'tajik',
'ta': 'tamil',
'te': 'telugu',
'th': 'thai',
'tr': 'turkish',
'uk': 'ukrainian',
'ur': 'urdu',
'ug': 'uyghur',
'uz': 'uzbek',
'vi': 'vietnamese',
'cy': 'welsh',
'xh': 'xhosa',
'yi': 'yiddish',
'yo': 'yoruba',
'zu': 'zulu'
}
s=''
for i , j in LANGUAGES.items():
s+='<option value="' +i +'">'+j+'</option>'+'\n'
print(s)
'''
import HelperTools as ht
text="""
Machine Learning (ML) is essentially extracting knowledge from data sets.
It is a topic at the intersection of statistics, artificial intelligence, and computer science and covers the topic of predictive analytics and statistical learning. The application of machine learning methods has in recent years become common in our lives. Over the past decade, machine learning has produced self-driving cars, practical speech recognition, effective web search, and a understanding of the human genome. Additionally, Artificial Intelligence (AI) is a branch of computer science which studies building machines capable of intelligent behavior, while Stanford University defines machine learning as the science of getting computers to act without being explicitly programmed . 1 Clearly Machine Learning is a subset of Artificial Intelligence topic. Also, Deep Learning (DL) is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks. Deep learning in terms of the algorithms ability to discover and learn good representations using feature learning. This paper will discuss some aspects of deep learning as well. Machine Learning extracts value from big and disparate data sources with far less reliance on human direction. It is data driven and runs at machine scale. It is well suited to the complexity of dealing with disparate data sources and the huge variety of variables and amounts of data involved. And unlike traditional analysis, machine learning thrives on growing datasets. The more data fed into a machine learning system, the more it can learn and apply the results to insights. The goal and usage of Machine Learning is to build new and leverage existing algorithms to learn from datasets, in order to build generalizable models that give accurate predictions, or to find patterns, particularly with new and unseen similar data. Traditionally, the insights were gathered from data sets by manually developing decision rules. This is feasible for some applications, particularly those in which humans have a good understanding of the process to model. However, using hand coded rules to make decisions has disadvantages. First, process logic requires a decision specific to a single task. Changing the task even slightly might require rewrite of the whole rule system. Second, designing rules requires a deep understanding of how a decision should be made by a human expert. Using machine learning, however, simply presenting a program with a dataset is enough for an algorithm to determine the insights. 2 In a typical Machine Learning dataset, the rows can be thought observations or data points and the columns for each data point represents the features of that observation and their values. Each entity or row is called a sample (or data point), the columns, which describe these entities as features. In a machine learning process, a dataset is usually split into multiple subsets. The minimum subsets are the training and test datasets, and often an optional third validation dataset is created as well. Once these data subsets are created from the primary dataset, a predictive model or classifier is trained using the training data, and then the model’s predictive accuracy is determined using the test data. As discussed, machine learning leverages algorithms to automatically model and find patterns in data, usually with the intention of predicting some target output. ML algorithms are heavily based on statistics and mathematical optimization. In a summary, machine learning automatically learns a highly accurate predictive or classifier model, or finding unknown patterns in data, by leveraging learning algorithms and optimization techniques. Machine learning algorithms can be categorized in following major areas, including, supervised, unsupervised, and semi supervised learning. We will focus on the first two set of algorithms in this paper. In supervised learning, the data contains a label which is the response variable that is being modeled, and with the goal being that algorithm predicts the value or class of the unseen data. Machine learning algorithms that learn from input/output pairs are called supervised learning algorithms because a supervisor provides guidance to the algorithms in the form of the desired outputs for each dataset they learn from. Supervised learning algorithms are well understood and their performance is easy to measure. Supervised machine learning will likely solve the problem, if the application can be modelled as a supervised learning problem, and the dataset that includes the desired outcome. There are many examples of supervised machines learning algorithms. For example, predicting the equipment health and failure based on historical data of equipment data, Image analysis of an astronomical phenomenon based on a large collection of images, prediction of sport team performance based on historical datasets and many more. Again, the key here is to collect a dataset and run it through a machine learning algorithm and seek desired outcomes. Unsupervised learning involves learning from a dataset that has no label or response variable, and is therefore more about finding patterns than prediction. In an unsupervised learning, only the input data is known, and output data is not given to the algorithm. Unsupervised learning can be harder to understand and evaluate. There are several examples of unsupervised learning. Tagging people with their photos from a collection of pictures, tagging drugs based on the molecules structure and many more. Hence, considering the above machine learning types, the alternative is to write explicit code program to seek through the data, and that understands the stats, thresholds to take into account for each stat, and so forth. It would take a substantial amount of time to write the code, and different programs would need to be written for every problem needing an answer. 3. SUPERVISED LEARNING ALGORITHMS There are two major types of supervised machine learning algorithms, namely classification and regression. All the supervised machine learning algorithm can be seen as either classification or regression. In classification, the goal is to predict a class label, which is a choice from a predefined list of possibilities. For example, identifying a set of people from photos. Classification is sometimes separated into binary classification, which is the special case of identifying two classes, and multiclass classification, which is classification between more than two classes. Binary classification is answering a yes/no question. Classifying emails as either spam or not spam is an example of a binary classification problem. In regression, the objective is to predict a continuous number with a value. Predicting a person’s annual income from their education, their age, and where they live is an example of a regression task. 2 When predicting income, the predicted value is an amount, and can be any number in a given range. An easy way to differentiate between classification and regression tasks is to look for continuity in the possible outcomes, if it is, then the problem is a regression problem. In supervised learning, if a model is able to make accurate predictions on data, it is said to generalize from the training set to the test data set. The goal is to build a model that is able to generalize as accurately as possible. Usually the models are built in such a way that it can make accurate predictions on the training set. If the training and test sets have enough in common, one would expect the model to also be accurate on the test set. However, there are some cases where this can go wrong. If we allow ourselves to build very complex models, we can always be as accurate as we like on the training set. Hence the objective is to find the simplest model. Building a model that is too complex for the dataset is called overfitting. Overfitting occurs when the model is too close to the nature of the training set and a model that works well on the training set but is not able to generalize to new datasets. Now, if model is too simple, then you are unable to capture all the aspects of and variability in the data, and your model will do badly even on the training set. Choosing too simple of a model is called under fitting. The more complex the model, the better we will be able to predict on the training data. However, if model 3 becomes too complex, then it becomes too individualized for the given data point in the training set, and the model will not generalize well to new data. Hence, the key is to find the sweet spot that will yield the best generalization performance. The k-NN algorithm is simple machine learning algorithm. The algorithm makes a prediction for a new data point, by finding the closest data points in the training dataset, hence its nearest neighbor. In its simplest version, the k-NN algorithm only considers exactly one nearest neighbor, which is the closest training data point to the point for prediction. The prediction is then simply the known output for this training point. In addition to considering only the closest neighbors, we can also consider an arbitrary number of, k neighbors, hence the naming, k-nearest neighbor’s algorithm. When considering more than one neighbor, voting can be used to to assign a label. In other words, for each test point, count many neighbors belong to class 0 and class 1. Finally, assigning the class that is most frequent, the majority class among the k-nearest neighbors. There is also a regression variant of the k-nearest neighbors algorithm. Here also, one can use more than the single closest neighbor for regression. When using multiple nearest neighbors, the prediction is the average, or mean, of the relevant neighbors. One of the strengths of k-NN model is that, it is very easy to understand, and often gives good performance. Developing the nearest neighbors model is also very agile, however when training set is very large (either in number of features or in number of samples) prediction can be slow. Hence, when using the k-NN algorithm, it’s important to preprocess data, which may serve many purposes including not allowing one feature to dominate. This approach often does not perform well on datasets with large features. So, while the nearest k neighbors algorithm is easy to understand, it is not often used in practice, due to prediction being slow and its inability to handle many features. 2 Linear Models Linear models make a prediction using a linear function of the input features In this model w 0 is the slope and b is the y-axis offset. For more features, w contains the slopes along each feature axis. Lets look at 2 forms of Linear Regression models. Ordinary least squares Linear regression, or ordinary least squares (OLS), is the simplest linear method for regression, which calculates the parameters w and b that minimize the mean squared error between predictions and the true regression targets, y, on the training set. The mean squared error is the sum of the squared differences between the predictions and the true values, divided by the number of samples. There are five OLS assumptions that are extremely important. If the OLS assumptions hold, then according to Gauss-Markov Theorem, OLS estimator is Best Linear Unbiased Estimator. The assumptions can be summarized into the following. First, linear regression model is linear in parameters. Second, there should be a random sampling of observations. Third, conditional mean should be zero. Fourth, there is no multi collinearity (or perfect collinearity). Fifth, There is no auto correlation. Ridge Regression Ridge regression is also a linear model for regression, and leverages the same model as ordinary least squares. In ridge regression, the coefficients (w) fit an additional constraints and to predict well. Additionally, w should be close to zero. This implies that feature should have as little effect on the outcome as possible. This constraint is called regularization, which restricts a model to avoid overfitting. Linear models for classification Classification also utilizes Linear models. In this case, a prediction is made using the following formula: The formula is very similar to linear regression, the only difference being, instead of returning the weighted sum of the features, the threshold prediction values at zero. If the function is smaller than zero, the prediction is class –1; if it is larger than zero, prediction is class +1. In linear models for classification, the decision boundary is a linear function of the input. Hence, linear classifier is a classifier that separates two classes using a line, a plane, or a hyperplane. There are a variety of learning algorithms for linear models. These algorithms have several uniqueness. For example, the approach to measure a particular combination of coefficients and intercept fits the training data. Different algorithms choose their own approaches to measure fitting the training set. The two most common linear classification algorithms are logistic regression, and linear support vector machines (linear SVMs). Linear models for multiclass classification Typically, linear classification models are utilized for binary classification only. A common technique to extend a binary classification algorithm to a multiclass classification algorithm is the one-vs.-rest approach. In this approach, a binary model is learned for each class, resulting in as many binary models as there are classes. To make a prediction, all binary classifiers are run on a test point. The classifier with the highest score in its single class is selected and its class label is returned as the prediction. In a single binary classifier per class results in one vector of coefficients (w) and one intercept (b) for each class. The class for which the result of the classification confidence formula given here is highest is the assigned class label: w 0 * x 0 + w 1 * x 1 + ... + w p * x p + b The approach in multiclass logistic regression differ somewhat from the one-vs.-rest approach, but they also result in one coefficient vector and intercept per class, and the same method of making a prediction is applied. Naive Bayes Classifiers Naive Bayes classifiers are a family of classifiers that are similar to the linear models, however is more efficient and faster in performance. Bayes models are efficient because they learn parameters by looking at each feature individually and collect simple per-class statistics from each feature. The naive Bayes models share many of the strengths and weaknesses of the linear models. They are very fast to train and to predict, and the training procedure is easy to understand. The models implement well with high dimensional sparse data and are relatively robust to the parameters. Naive Bayes models are great baseline models and are often used on very large datasets, where training even a linear model might take too long. Decision Trees Decision trees are common models for classification and regression learning. Decision Trees learn a hierarchy of if/else questions, hence deriving a decision. Decision trees Construction Building a decision tree requires to partition the data sets in sections and applying if/then else questions to the datasets. In the machine learning, these questions are called tests. A leaf of the tree that contains data points that all share the same target value is called pure. Typically, data does not come in the form of binary yes/no, hence the tests that are used on continuous data are of the form feature x larger than value y. A prediction on a new data point is made by testing which region of the partition of the feature space the point lies in. The region can be found by traversing the tree from the root and going left or right, depending on whether the test is fulfilled or not. Trees can also be utilized for regression tasks, using exactly the same technique. To make a prediction, the tree is traversed based on the tests in each node to find the leaf the new data point falls into. The output for this data point is the mean target of the training points in this leaf. Decision trees have two advantages as compared to other algorithms. First, the ease of model visualization, and second, the algorithms are completely invariant to scaling of the data. As each feature is processed separately, and the possible splits of the data don’t depend on scaling, no preprocessing like normalization or standardization of features is required for decision tree algorithms. In particular, decision trees work well when you have features that are on completely different scales, or a mix of binary and continuous features. The main disadvantage of decision trees is that even with the use of pre pruning, they tend to overfit and provide poor generalization performance. Therefore, in most applications, the ensemble methods we discuss next are usually used in place of a single decision tree. Ensembles of Decision Trees Ensembles are methods that combine multiple machine learning models to create more powerful models. The two 5 ensemble models that get deployed on a wide range of datasets for classification and regression, that also use decision trees are random forests and gradient boosted decision trees. Random forests Random forest address the overfitting the training data, which is a short coming of decision trees. A random forest is essentially a collection of decision trees, where each tree is slightly different from the others. The approach behind random forests is that each tree individually will predict well, but will likely overfit on part of the data. Hence, one can average the overfitting results of all the decision trees. This approach is called random forest because it injects randomness into the tree building hence ensuring each tree is different. There are other ways to randomize: by selecting the data points used to build a tree and by selecting the features in each split test. Gradient Boosted Trees The gradient boosted tree is also an ensemble method that combines multiple decision trees to create a more powerful model. These models can be used for regression and classification. Gradient boosting algorithm works by building trees in a serial manner, where each tree corrects the mistakes of the previous one. There is no randomization in gradient boosted trees and are utilized for shallow trees, of depth one to five, which makes the model smaller in terms of memory and makes predictions faster. Gradient boosting iteratively improve performance by combining many simple models, where each tree can only provide good predictions on it part. Gradient Boosted Trees are typically sensitive to parameter settings than random forests, but can provide better accuracy if the parameters are set correctly. Kernelized Support Vector Machines Another type of supervised learning is kernelized support vector machines. Kernelized support vector machines (just referred to as SVMs) are an extension that allows for more complex models that are not defined simply by hyperplanes in the input space. During training, the SVM learns the data points representation of the decision boundary between the two classes. Typically only a subset of the training points matter for defining the decision boundary: the ones that lie on the border between the classes. These are called support vectors and give the support vector machine its name. To make a prediction for a new point, the distance to each of the support vectors is measured and a classification decision is made based on the distances to the support vector. The distance between data points is measured by the Gaussian kernel: 4. UNSUPERVISED LEARNING ALGORITHMS In unsupervised learning there is no known output, no training data set to instruct the learning algorithm. In unsupervised learning, the learning algorithm is just given the input data to extract knowledge from this data. Transformation and Clustering Unsupervised transformations of a dataset are algorithms that create a new formation of the data which could easier for humans or other machine learning algorithms to understand compared to the original representation of the data. 3 Dimension reduction, where high-dimensional representation of the data, consisting of many features, and summarizes the essential characteristics with fewer features, is a common application of transformation. This is useful for visualization purposes. There are other applications for unsupervised transformations, which entail finding the components that of the data. An example of this is topic extraction on collections of text documents. The transformational algorithm finds topics that are discussed in each document. Clustering algorithms, partitions data into separate groups of similar items. Consider the example of selecting species of certain animal kingdom based on photos. Here is the approach would be to extract all the phots and divide them into groups of groups of species that look similar. Transforming data applications can be visualization, compressing the data, and finding a representation that is more useful for other applications. Principal Component Analysis (PCA) approach is most commonly used for data transformations. There are also other algorithms like non negative matrix factorization (NMF), Singular Value Decomposition (SVD), Linear discriminant analysis (LDA) and t-SNE. Principal Component Analysis (PCA) Principal component analysis is a method that rotates the dataset in a way such that the rotated features are statistically uncorrelated. 4 This rotation is often followed by selecting only a subset of the new features, according to how important they are for explaining the data. PCA is an unsupervised method, and does not use any class information when finding the rotation, it only considers the correlations in the data. Non-Negative Matrix Factorization (NMF) Non-negative matrix factorization algorithm extract useful features from a data set. It works similarly to PCA and can also be used for dimensionality reduction. In NMF, want the components and the coefficients for manipulation have to be greater than or equal to zero. As a result, this approach applies to data where each feature is non-negative, as a non-negative sum of non-negative components cannot become negative. NMF leads to more understandable components than PCA, as negative components and coefficients can lead to hard-to interpret cancellation effects. Learning with t-SNE Just like PCA, there is a class of algorithms for visualization allow which allow more complex mappings, and provide better visualizations. t-SNE algorithm is one of this class of algorithm. Manifold learning algorithms are usually used for visualization, to compute a new representation of the training data, but don’t allow transformations of new data. Hence, they can only transform the data they were trained for. Manifold learning is useful for exploring data, but is typically not used for the final goal is supervised learning. The main goal behind t-SNE is to find a two-dimensional representation of the data that preserves the distances between points as best as possible. t-SNE starts with a random two-dimensional representation for each data point, and then makes points that are close in the original feature space closer, and points that are far apart in the original feature space farther apart. t-SNE focus on points that are close by, rather than preserving distances between far-apart points. As a result, it preserves the information on points that are neighbors to each other. Clustering Similar to Classification clustering, it is the task of partitioning the dataset into groups. Clustering splits up the data points within a single cluster and predict a number to each data point, depicting cluster a particular point belongs to. k-Means Clustering k-means clustering is a commonly used clustering algorithms, that finds cluster centers for certain sample of the data. The algorithm follows two steps. First, assigning each data point to the closest cluster center, and second, each cluster center as the mean of the data points that are assigned to it. The algorithm is finished when the assignment of instances to clusters no longer changes. Agglomerative Clustering Agglomerative clustering is a collection of clustering algorithms that leverages traditional clustering approaches. Including, declaring each point its own cluster, and then merges the two most similar clusters until some stopping criterion is met. There are several connected criteria that specify the most similar cluster is measured. This measure is always defined between two existing clusters. 4. SAMPLE USE CASES Data Security Malware has several applications for security. The signature of malware are collected and ML algorithms are used to predict the nature of attack. In other situations, machine learning algorithms can look for patterns in how data in the cloud is accessed, and report anomalies that could predict security breaches. Financial Trading Humans can’t possibly do the same tasks as machines when it comes to consuming vast quantities of data or the speed with which they can execute a trade. Machine learning algorithms are getting closer to predict the stock market behavior. Many trading firms use ML systems to predict and execute trades at high speeds and high volume and, can turn huge profits for the firms. Healthcare Machine learning algorithms can process more information and spot more patterns than their human medical doctors. One study used computer assisted diagnose (CAD) when to review the early mammography scans of women who later 7 developed breast cancer, and the computer spotted 52% of the cancers as much as a year before the women were officially diagnosed. 6 Additionally, machine learning can be used to understand risk factors for disease in large populations. Also another example of ML in Health care is image analysis of radiology images, ML algorithms can do this easily and enhance patient care and diagnosis. Aerospace There are many use cases in aerospace industry, including image analysis and prediction of celestial objects behavior and attributes. Also predictive maintenance of aerospace equipment as the components age and are use. Aerospace was an early adopter of AI/ML. In fact, many pilots have been flying with very primitive forms of ML for years, autopilots systems all use computer power to make intelligent decisions. Additionally, applying real AI to an autopilot, instead of just programming it to fly certain pre-planned profiles, machine learning makes it a more resilient autopilot that can adapt to changing conditions. 5 Computer Network Machines Learning is quickly being applied to Network in form of Network Analytics. Telemetry from network elements is being gather and machine learning algorithms are applied to provide application mapping, security anomalies detection, security forensics and networking configuration recommendations.
"""
text1="""Machine Learning (ML) is essentially extracting knowledge from data sets. It is a topic at the intersection of statistics, artificial intelligence, and computer science and covers the topic of predictive analytics and statistical learning. The application of machine learning methods has in recent years become common in our lives. Over the past decade, machine learning has produced self-driving cars, practical speech recognition, effective web search, and a understanding of the human genome. Additionally, Artificial Intelligence (AI) is a branch of computer science which studies building machines capable of intelligent behavior, while Stanford University defines machine learning as the science of getting computers to act without being explicitly programmed ."""
from transformers import pipeline
smr_bart = pipeline(task="summarization", model="facebook/bart-large-cnn")
smbart = smr_bart(text1, max_length=150)
print(smbart[0]['summary_text'])