Indonesian Language Model
This program trains Indonesian model using ngram technique
The data used are the 9 most popular news sites in Indonesia
There is an automatic sentence generator feature from given word using Shannon Visualization technique
I use the python article scraper tool in this link.
I captured 9 automatic news sites then printed into a .txt file format
On the demo website I only use 4 corpus because free memory size hosting is very bad.
Please see the datasets/
for the datasets
- datajpnn.txt
- datakompas.txt
- datamerdeka.txt
- datametrotv.txt
- datarepublika.txt
- datasuara.txt
- datatempo.txt
- datatribunn.txt
- dataviva.txt
Please see the application/model/Tools.php
for the function.
The Following function can be used:
unigramCount($data,$indexes);
bigramCount($data,$indexes);
trigramCount($data,$indexes);
shannonVisual($model,$first,$min);
- Clone repo using Git
# clone repository into your htdocs dir
git clone https://github.com/faisalsyfl/IndoLangModel.git
- Open your localhost/apache ex: http://localhost/IndoLangModel
/* Your datasets filename */
$corpus = Array('datatribunn.txt','datakompas.txt','datatempo.txt','datajpnn.txt','datamerdeka.txt');
$modelUni = array();
$modelBi = array();
$modelTri = array();
foreach($corpus as $i){
$modelUni = $this->Tools->unigramCount(file_get_contents(FCPATH.'datasets/'.$i),$modelUni);
$modelBi = $this->Tools->bigramCount(file_get_contents(FCPATH.'datasets/'.$i),$modelBi);
$modelTri = $this->Tools->trigramCount(file_get_contents(FCPATH.'datasets/'.$i),$modelTri);
}
$this->Tools->pre_print_r($modelUni);
$this->Tools->pre_print_r($modelBi);
$this->Tools->pre_print_r($modelTri);