GitHub - spencermountain/compromise at v0.5.2

Name	Name	Last commit message	Last commit date
Latest commit spencermountain bump client side version May 10, 2015 5e9f8d5 · May 10, 2015 History 239 Commits
client_side	client_side	bump client side version	May 10, 2015
src	src	initials and blacklist in is_person	May 10, 2015
tests	tests	bump client side version	May 10, 2015
Gruntfile.js	Gruntfile.js	noun.is_person method, with honourifics seperate, and == bugfix	May 9, 2015
README.md	README.md	0.5	May 10, 2015
bower.json	bower.json	bump client side version	May 10, 2015
changelog.md	changelog.md	add irregular verb list to lexicon	May 1, 2015
cmd.js	cmd.js	jshint, combine unit tests, three equal sings	Apr 4, 2015
contributing.md	contributing.md	ideas for roadmap roughly	May 10, 2015
index.js	index.js	oops	May 10, 2015
known_issues.md	known_issues.md	basic demo fixes	Mar 31, 2015
package.json	package.json	bump client side version	May 10, 2015

#No training, no prolog. a Natural-Language-Processing library in Javascript, small-enough for the browser, and quick-enough to run on keypress 👬

it does tons of clever things. it's smaller than jquery, and scores 86% on the Penn treebank.

nlp.pos('she sells seashells by the seashore').to_past().text()
//she sold seashells by the seashore

##Check it out

##Justification If the 80-20 rule applies for most things, the ''94-6 rule'' applies when working with language - by Zipfs law:

The top 10 words account for 25% of used language.

The top 100 words account for 50% of used language.

The top 50,000 words account for 95% of used language.

On the Penn treebank, for example, this is possible:

just a 1 thousand word lexicon: 45% accuracy
... then falling back to nouns: 70% accuracy
... then some suffix regexes: 74% accuracy
... then some sentence-level postprocessing: 81% accuracy

The process is to get some curated data, find the patterns, and list the exceptions. Bada bing, bada boom. In this way a satisfactory NLP library can be built with breathtaking lightness.

Namely, it can be run right on the user's computer instead of a server.

Client-side

<script src="https://rawgit.com/spencermountain/nlp_compromise/master/client_side/nlp.min.js"> </script>
<script>
  nlp.noun("dinosaur").pluralize()
  //dinosaurs
</script>

or, use the angular module

Server-side

$ npm install nlp_compromise

nlp = require("nlp_compromise")
nlp.syllables("hamburger")
//[ 'ham', 'bur', 'ger' ]

API

###Sentence methods

  var s= nlp.pos("Tony Danza is dancing").sentences[0]
  s.tense()
  //present
  s.text()
  //"Tony Danza is dancing"
  s.to_past().text()
  //Tony Danza was dancing
  s.to_present().text()
  //Tony Danza is dancing
  s.to_future().text()
  //Tony Danza will be dancing
  s.negate().text()
  //Tony Danza is not dancing
  s.tags()
  //[ 'NNP', 'CP', 'VB' ]
  s.entities()
  //[{text:"Tony Danza"...}]
  s.people()
  //[{text:"Tony Danza"...}]
  s.nouns()
  //[{text:"Tony Danza"...}]
  s.adjectives()
  //[]
  s.adverbs()
  //[]
  s.verbs()
  //[{text:"dancing"}]
  s.values()
  //[]

###Noun methods:

nlp.noun("earthquakes").singularize()
//earthquake

nlp.noun("earthquake").pluralize()
//earthquakes

nlp.noun('veggie burger').is_plural
//false

nlp.noun('hour').article()
//an

nlp.inflect('mayors of toronto'))
//{ plural: 'mayors of toronto', singular: 'mayor of toronto' }

###Verb methods:

nlp.verb("walked").conjugate()
//{ infinitive: 'walk',
//  present: 'walks',
//  past: 'walked',
//  gerund: 'walking'}
nlp.verb('swimming').to_past()
//swam

###Adjective methods:

nlp.adjective("quick").conjugate()
//  { comparative: 'quicker',
//    superlative: 'quickest',
//    adverb: 'quickly',
//    noun: 'quickness'}

###Adverb methods

nlp.adverb("quickly").conjugate()
//  { adjective: 'quick'}

Part-of-speech tagging

86% on the Penn treebank

nlp.pos("Tony Hawk walked quickly to the store.").tags()
// [ [ 'NN', 'VB', 'RB', 'IN', 'DT', 'NN' ] ]

nlp.pos("they would swim").tags()
// [ [ 'PRP', 'MD', 'VBP' ] ]
nlp.pos("the obviously good swim").tags()
// [ [ 'DT', 'RB', 'JJ', 'NN' ] ]

Named-Entity recognition

nlp.spot("Tony Hawk walked quickly to the store.")
// ["Tony Hawk", "store"]
nlp.spot("joe carter loves toronto")
// ["joe carter", "toronto"]

Sentence segmentation

nlp.sentences("Hi Dr. Miller the price is 4.59 for the U.C.L.A. Ph.Ds.").length
//1

nlp.tokenize("she sells sea-shells").length
//3

Syllable hyphenization

70% on the moby hyphenization corpus 0.5k

nlp.syllables("hamburger")
//[ 'ham', 'bur', 'ger' ]

US-UK Localization

nlp.americanize("favourite")
//favorite
nlp.britishize("synthesized")
//synthesised

N-gram

str= "She sells seashells by the seashore. The shells she sells are surely seashells."
nlp.ngram(str, {min_count:1, max_size:5})
// [{ word: 'she sells', count: 2, size: 2 },
// ...
options.min_count // throws away seldom-repeated grams. defaults to 1
options.max_gram // prevents the result from becoming gigantic. defaults to 5

Unicode Normalisation

a hugely-ignorant, and widely subjective transliteration of latin, cryllic, greek unicode characters to english ascii.

nlp.normalize("Björk")
//Bjork

and for fun,

nlp.denormalize("The quick brown fox jumps over the lazy dog", {percentage:50})
// The ɋӈїck brown fox juӎÞs over tӊe laζy dog

Details

Tags

  "verb":
    "VB" : "verb, generic (eat)"
    "VBD" : "past-tense verb (ate)"
    "VBN" : "past-participle verb (eaten)"
    "VBP" : "infinitive verb (eat)"
    "VBZ" : "present-tense verb (eats, swims)"
    "CP" : "copula (is, was, were)"
    "VBG" : "gerund verb (eating,winning)"
  "adjective":
    "JJ" : "adjective, generic (big, nice)"
    "JJR" : "comparative adjective (bigger, cooler)"
    "JJS" : "superlative adjective (biggest, fattest)"
  "adverb":
    "RB" : "adverb (quickly, softly)"
    "RBR" : "comparative adverb (faster, cooler)"
    "RBS" : "superlative adverb (fastest (driving), coolest (looking))"
  "noun":
    "NN" : "noun, singular (dog, rain)"
    "NNP" : "singular proper noun (Edinburgh, skateboard)"
    "NNPS" : "plural proper noun (Smiths)"
    "NNS" : "plural noun (dogs, foxes)"
    "NNO" : "possessive noun (spencer's, sam's)"
    "NG" : "gerund noun (eating,winning" : "but used grammatically as a noun)"
    "PRP" : "personal pronoun (I,you,she)"
  "glue":
    "PP" : "possessive pronoun (my,one's)"
    "FW" : "foreign word (mon dieu, voila)"
    "IN" : "preposition (of,in,by)"
    "MD" : "modal verb (can,should)"
    "CC" : "co-ordating conjunction (and,but,or)"
    "DT" : "determiner (the,some)"
    "UH" : "interjection (oh, oops)"
    "EX" : "existential there (there)"
  "value":
    "CD" : "cardinal value, generic (one, two, june 5th)"
    "DA" : "date (june 5th, 1998)"
    "NU" : "number (89, half-million)"

####Lexicon Because the library can conjugate all sorts of forms, it only needs to store one grammatical form. The lexicon was built using the American National Corpus, then intersected with the regex rule-list. For example, it lists only 300 verbs, then blasts-out their 1200+ derived forms.

####Contractions Unlike other nlp toolkits, this library puts a 'silent token' into the phrase for contractions. Otherwise something would be neglected.

nlp.pos("i'm good.")
   [{
     text:"i'm",
     normalised:"i",
     pos:"PRP"
   },
   {
     text:"",
     normalised:"am",
     pos:"CP"
   },
   {
     text:"good.",
     normalised:"good",
     pos:"JJ"
   }]

####Tokenization Neighbouring words with the same part of speech are merged together, unless there is punctuation, different capitalisation, or special cases.

nlp.pos("tony hawk won")
//tony hawk   NN
//won   VB

To turn this off:

nlp.pos("tony hawk won", {dont_combine:true})
//tony   NN
//hawk   NN
//won   VB

Licence

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Client-side

Server-side

API

Part-of-speech tagging

Named-Entity recognition

Sentence segmentation

Syllable hyphenization

US-UK Localization

N-gram

Unicode Normalisation

Details

Tags

Licence

About

Releases 86

Packages

Used by 5.2k

Contributors 111

Languages

License

spencermountain/compromise

Folders and files

Latest commit

History

Repository files navigation

Client-side

Server-side

API

Part-of-speech tagging

Named-Entity recognition

Sentence segmentation

Syllable hyphenization

US-UK Localization

N-gram

Unicode Normalisation

Details

Tags

Licence

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 86

Packages 0

Used by 5.2k

Contributors 111

Languages

Packages