This project was created to build an NLP solution for programming language detection out of source code.
How I Built a Classification Model for Source Code Languages
The trained model supports the following languages:
C | C++ | Objective-C | C# | Swift |
Ruby | Julia | Lua | Java | Groovy |
Kotlin | Scala | Shell | Batchfile | PowerShell |
Python | Markdown | HTML | PHP | CSS |
TypeScript | JavaScript | CoffeeScript | Haskell | Perl |
Go | SQL | Rust | TeX | Erlang |
Visual Basic | Dart | Pascal | Jupyter Notebook |
To try the model out, you can follow this link to the Demo App deployed on Heroku.
To train the model, you need to download the dataset we used through this kaggle notebook. You can read it, to see how we extracted it from "Github Repos" dataset or run the all cells to skip to the download link at the end directly.
Once you have the dataset, replace the DATA_PATH
variable with the appropriate value in the train.py
and run the code to see the accuracy it gives you. It should be around 97%.
You can use libraries such as joblib
, pickle
or piskle
to serialize it, if you need to use it at a later time.
If you like the project and want to support us, you can buy us a coffee here: