Skip to content

Latest commit

 

History

History
81 lines (69 loc) · 2.37 KB

README.md

File metadata and controls

81 lines (69 loc) · 2.37 KB

Spot-Language Model

This project was created to build an NLP solution for programming language detection out of source code.

banner

Link to Medium Article

How I Built a Classification Model for Source Code Languages

Supported Languages

The trained model supports the following languages:

C C++ Objective-C C# Swift
Ruby Julia Lua Java Groovy
Kotlin Scala Shell Batchfile PowerShell
Python Markdown HTML PHP CSS
TypeScript JavaScript CoffeeScript Haskell Perl
Go SQL Rust TeX Erlang
Visual Basic Dart Pascal Jupyter Notebook

Demonstration

To try the model out, you can follow this link to the Demo App deployed on Heroku.

Training:

To train the model, you need to download the dataset we used through this kaggle notebook. You can read it, to see how we extracted it from "Github Repos" dataset or run the all cells to skip to the download link at the end directly.

Once you have the dataset, replace the DATA_PATH variable with the appropriate value in the train.py and run the code to see the accuracy it gives you. It should be around 97%.

You can use libraries such as joblib, pickle or piskle to serialize it, if you need to use it at a later time.


If you like the project and want to support us, you can buy us a coffee here:

Buy Me A Coffee