Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with indeterminacy? #41

Open
holtzermann17 opened this issue Jan 14, 2016 · 4 comments
Open

How to deal with indeterminacy? #41

holtzermann17 opened this issue Jan 14, 2016 · 4 comments

Comments

@holtzermann17
Copy link

Evaluating (treebank-parser ["What can happen in a second ."]) using the set-up in the README file here, I get the following parse:

(TOP
 (SBARQ
  (WHNP (WP What))
  (SQ
   (VP (MD can) (VP (VB happen) (PP (IN in) (NP (DT a) (JJ second))))))
  (. .)))

Actually I'm pretty sure the JJ should be an NN. Is this alternative known to the OpenNLP engine at some point in its parse, and if so, can I get it to report on the known alternative(s)?

@jacopofar
Copy link

OpenNLP allows to retrieve that information, I'd be happy to try and add it to the library, but I'm a Clojure newbie and can't be sure how much it'll take me :)

@dakrone
Copy link
Owner

dakrone commented May 19, 2016

@jacopofar that would be great if it could be added, perhaps on metadata? (or wherever it fits)

@jacopofar
Copy link

jacopofar commented May 19, 2016

I think should return a vector of parse trees along with their probabilities (currently it forces the value to 1).

I made a first attempt at allowing to specify this here but have yet to write a test for it

@eihli
Copy link

eihli commented Nov 4, 2020

Here's some code I added to one of my projects that gives access to the probabilities for part-of-speech tagging. I imagine something similar could be done for parsing.

(ns com.owoga.prhyme.util.nlp
  (:require [opennlp.nlp :as nlp]
            [opennlp.treebank :as tb]
            [clojure.string :as string]
            [clojure.java.io :as io]
            [clojure.zip :as zip]
            [com.owoga.prhyme.nlp.tag-sets.treebank-ii :as tb2])
  (:import (opennlp.tools.postag POSModel POSTaggerME)))

(def tokenize (nlp/make-tokenizer (io/resource "models/en-token.bin")))
(def get-sentences (nlp/make-sentence-detector (io/resource "models/en-sent.bin")))
(def parse (tb/make-treebank-parser (io/resource "models/en-parser-chunking.bin")))
(def pos-tagger (nlp/make-pos-tagger (io/resource "models/en-pos-maxent.bin")))

;;;; The tagger that onennlp.nlp gives us doesn't provide access
;;;; to the probabilities of all tags. It gives us the probability of the
;;;; top tag through some metadata. But to get probs for all tags, we
;;;; need to do something like implement our own tagger.
(defprotocol Tagger
  (tags [this sent])
  (probs [this])
  (top-k-sequences [this sent]))

(defn make-pos-tagger
  [modelfile]
  (let [model (with-open [model-stream (io/input-stream modelfile)]
                (POSModel. model-stream))
        tagger (POSTaggerME. model)]
    (reify Tagger
      (tags [_ tokens]
        (let [token-array (into-array String tokens)]
          (map vector tokens (.tag tagger #^"[Ljava.lang.String;" token-array))))
      (probs [_] (seq (.probs tagger)))
      (top-k-sequences [_ tokens]
        (let [token-array (into-array String tokens)]
          (.topKSequences tagger #^"[Ljava.lang.String;" token-array))))))

(def prhyme-pos-tagger (make-pos-tagger (io/resource "models/en-pos-maxent.bin")))

(comment
  (let [phrase "The feeling hurts."]
    (map (juxt #(.getOutcomes %)
               #(map float (.getProbs %)))
         (top-k-sequences prhyme-pos-tagger (tokenize phrase))))
  ;; => ([["DT" "NN" "VBZ" "."] (0.9758878 0.93964833 0.7375927 0.95285994)]
  ;;     [["DT" "VBG" "VBZ" "."] (0.9758878 0.03690145 0.27251 0.9286113)])
  )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants