Spot-Language Model

This project was created to build an NLP solution for programming language detection out of source code.

Link to Medium Article

How I Built a Classification Model for Source Code Languages

Supported Languages

The trained model supports the following languages:

C	C++	Objective-C	C#	Swift
Ruby	Julia	Lua	Java	Groovy
Kotlin	Scala	Shell	Batchfile	PowerShell
Python	Markdown	HTML	PHP	CSS
TypeScript	JavaScript	CoffeeScript	Haskell	Perl
Go	SQL	Rust	TeX	Erlang
Visual Basic	Dart	Pascal	Jupyter Notebook

Demonstration

To try the model out, you can follow this link to the Demo App deployed on Heroku.

Training:

To train the model, you need to download the dataset we used through this kaggle notebook. You can read it, to see how we extracted it from "Github Repos" dataset or run the all cells to skip to the download link at the end directly.

Once you have the dataset, replace the DATA_PATH variable with the appropriate value in the train.py and run the code to see the accuracy it gives you. It should be around 97%.

You can use libraries such as joblib, pickle or piskle to serialize it, if you need to use it at a later time.

If you like the project and want to support us, you can buy us a coffee here:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spot-Language Model

Link to Medium Article

Supported Languages

Demonstration

Training:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Spot-Language Model

Link to Medium Article

Supported Languages

Demonstration

Training: