How to deal with indeterminacy? #41

holtzermann17 · 2016-01-14T15:44:11Z

Evaluating (treebank-parser ["What can happen in a second ."]) using the set-up in the README file here, I get the following parse:

(TOP
 (SBARQ
  (WHNP (WP What))
  (SQ
   (VP (MD can) (VP (VB happen) (PP (IN in) (NP (DT a) (JJ second))))))
  (. .)))

Actually I'm pretty sure the JJ should be an NN. Is this alternative known to the OpenNLP engine at some point in its parse, and if so, can I get it to report on the known alternative(s)?

The text was updated successfully, but these errors were encountered:

jacopofar · 2016-05-19T15:23:15Z

OpenNLP allows to retrieve that information, I'd be happy to try and add it to the library, but I'm a Clojure newbie and can't be sure how much it'll take me :)

dakrone · 2016-05-19T15:59:50Z

@jacopofar that would be great if it could be added, perhaps on metadata? (or wherever it fits)

jacopofar · 2016-05-19T16:54:14Z

I think should return a vector of parse trees along with their probabilities (currently it forces the value to 1).

I made a first attempt at allowing to specify this here but have yet to write a test for it

eihli · 2020-11-04T19:29:35Z

Here's some code I added to one of my projects that gives access to the probabilities for part-of-speech tagging. I imagine something similar could be done for parsing.

(ns com.owoga.prhyme.util.nlp
  (:require [opennlp.nlp :as nlp]
            [opennlp.treebank :as tb]
            [clojure.string :as string]
            [clojure.java.io :as io]
            [clojure.zip :as zip]
            [com.owoga.prhyme.nlp.tag-sets.treebank-ii :as tb2])
  (:import (opennlp.tools.postag POSModel POSTaggerME)))

(def tokenize (nlp/make-tokenizer (io/resource "models/en-token.bin")))
(def get-sentences (nlp/make-sentence-detector (io/resource "models/en-sent.bin")))
(def parse (tb/make-treebank-parser (io/resource "models/en-parser-chunking.bin")))
(def pos-tagger (nlp/make-pos-tagger (io/resource "models/en-pos-maxent.bin")))

;;;; The tagger that onennlp.nlp gives us doesn't provide access
;;;; to the probabilities of all tags. It gives us the probability of the
;;;; top tag through some metadata. But to get probs for all tags, we
;;;; need to do something like implement our own tagger.
(defprotocol Tagger
  (tags [this sent])
  (probs [this])
  (top-k-sequences [this sent]))

(defn make-pos-tagger
  [modelfile]
  (let [model (with-open [model-stream (io/input-stream modelfile)]
                (POSModel. model-stream))
        tagger (POSTaggerME. model)]
    (reify Tagger
      (tags [_ tokens]
        (let [token-array (into-array String tokens)]
          (map vector tokens (.tag tagger #^"[Ljava.lang.String;" token-array))))
      (probs [_] (seq (.probs tagger)))
      (top-k-sequences [_ tokens]
        (let [token-array (into-array String tokens)]
          (.topKSequences tagger #^"[Ljava.lang.String;" token-array))))))

(def prhyme-pos-tagger (make-pos-tagger (io/resource "models/en-pos-maxent.bin")))

(comment
  (let [phrase "The feeling hurts."]
    (map (juxt #(.getOutcomes %)
               #(map float (.getProbs %)))
         (top-k-sequences prhyme-pos-tagger (tokenize phrase))))
  ;; => ([["DT" "NN" "VBZ" "."] (0.9758878 0.93964833 0.7375927 0.95285994)]
  ;;     [["DT" "VBG" "VBZ" "."] (0.9758878 0.03690145 0.27251 0.9286113)])
  )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with indeterminacy? #41

How to deal with indeterminacy? #41

holtzermann17 commented Jan 14, 2016

jacopofar commented May 19, 2016

dakrone commented May 19, 2016

jacopofar commented May 19, 2016 •

edited

Loading

eihli commented Nov 4, 2020 •

edited

Loading

How to deal with indeterminacy? #41

How to deal with indeterminacy? #41

Comments

holtzermann17 commented Jan 14, 2016

jacopofar commented May 19, 2016

dakrone commented May 19, 2016

jacopofar commented May 19, 2016 • edited Loading

eihli commented Nov 4, 2020 • edited Loading

jacopofar commented May 19, 2016 •

edited

Loading

eihli commented Nov 4, 2020 •

edited

Loading