Skip to content

Latest commit

 

History

History
115 lines (84 loc) · 4.23 KB

README.md

File metadata and controls

115 lines (84 loc) · 4.23 KB

csplogo

cmd.csp.stemmer

License: MIT Maintenance GitHub release GitHub tag GitHub commits GitHub contributors

Simple implementation of Snowball Stemmer (http://snowballstem.org/) in Java with Stemmers for 20+ languages. Helpful to reduce tokens to their core syntax esp. when processing them in Machine Learning Models (ML). Used in the Cognitive Service Platform cmd.csp as part of the NLP (Natural Language Processing) features.

Prerequisites

There are no prerequisites or dependencies others than java core

Installing/Usage

To use, merge the following into your Maven POM (or the equivalent into your Gradle build script):

<repository>
  <id>github</id>
  <name>GitHub swelcker Apache Maven Packages</name>
  <url>https://maven.pkg.github.com/swelcker</url>
</repository>

<dependency>
  <groupId>cmd.csp</groupId>
  <artifactId>cspstemmer</artifactId>
  <version>1.0.0</version>
</dependency>

Then, import cmd.csp.stemmer.*;` in your application :

// Example
import cspstemmer.*;

private SnowballStemmer stemmer;
private Locale locale = null;
...
		if(this.locale==null) {
			this.locale = Locale.getDefault();
		}
...
		switch(locale.getISO3Language().toLowerCase()){
			case "ara":stemmer=new ArabicStemmer();break;
			case "dan":stemmer=new DanishStemmer();break;
			case "nld":stemmer=new DutchStemmer();break;
			case "eng":stemmer=new EnglishStemmer();break;
			case "fin":stemmer=new FinnishStemmer();break;
			case "fra":stemmer=new FrenchStemmer();break;
			case "deu":stemmer=new GermanStemmer();break;
			case "hun":stemmer=new HungarianStemmer();break;
			case "ind":stemmer=new IndonesianStemmer();break;
			case "gle":stemmer=new IrishStemmer();break;
			case "ita":stemmer=new ItalianStemmer();break;
			case "nep":stemmer=new NepaliStemmer();break;
			case "nor":stemmer=new NorwegianStemmer();break;
			case "por":stemmer=new PortugueseStemmer();break;
			case "ron":stemmer=new RomanianStemmer();break;
			case "spa":stemmer=new SpanishStemmer();break;
			case "rus":stemmer=new RussianStemmer();break;
			case "swe":stemmer=new SwedishStemmer();break;
			case "tam":stemmer=new TamilStemmer();break;
			case "tur":stemmer=new TurkishStemmer();break;
			default:stemmer=new NaiveStemmer();break;
		}
        
        // Then set the token to be stemmed
        String tkn = "Testvariable";
        String result = "";
            stemmer.setCurrent(tkn);
        // call to stemm
            stemmer.stem();
        // get/use the result
            result = stemmer.getCurrent();

...

Built With

  • Maven - Dependency Management

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

  • Stefan Welcker - Modifications

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

  • Forked and modified from the original with Copyright (c) 2001, Dr Martin Porter, Copyright (c) 2002, Richard Boulton. All rights reserved.