Restrict annotators to some spans (annotations) #125

joancf · 2021-09-06T23:05:15Z

joancf
Sep 6, 2021

In Java Gate, when calling an annotator, this one can be applied to a given subset of the document (to some annotations) for example the title, or the index of a document.
In the python version it seems that it must be run over the full document. Of course the annotator could introduce restrictions, but if it's a general purpose annotator provided by the project, this is not possible.
So if I want to use Spacy, then I have to run it over the full document. Maybe I would be interested in running my sentence split, and run it sentence by sentence, or apply different pipelines to different parts of the document

Thanks for the tool, it looks great!

johann-petrak · 2021-09-07T08:40:13Z

johann-petrak
Sep 7, 2021
Maintainer

Hi @joancf thanks for the suggestion! When you refer to Java Gate doing this, I assume you mean the segment processing PR?

I think functionality like that would be a great addition for gatenlp as well and I have added a FR issue for this:
#127

0 replies

joancf · 2021-09-07T11:56:19Z

joancf
Sep 7, 2021
Author

I'm not sure if it's a "segment PR" I used GATE for some years, but not for the last 5 years :-( , now in Python, there is a new opportunity to use GATE again ;-) and the way it manages documents and annotations. I'm just talking of how I think it was.
What I mean is that, when you add a PR in a Java GATE pipeline ( Applications) one of the parameters is
inputASName - The name of the annotation set used for input
So the PR gets the text spans (with the corresponding annotations) of the selected annotation set, or the full text if none is selected.

2 replies

johann-petrak Sep 7, 2021
Maintainer

Ah this really depends on the Java GATE PR: some PRs have a parameter that specifies the input annotation type, while other PRs always use the text (e.g. the Tokenizer).
This is not different in gatenlp - depending on the annotator there is a way to parametrize the input annotation type, output ann type or both.

What I thought you refer to is this: if a PR processes text or input annotations and creates output annotations, should that processing be restricted to only part of the document as indicated by another annotation type (e.g. "Abstract"). This is possible in Java GATE via the Segement Processing PR and I think this kind of generic limitation to just part of a document would also be useful in gatenlp.

joancf Sep 7, 2021
Author

I did the changes to allow running Spacy on a part of a document. I did a fork you can see the code there (not yet a push-request)
https://github.com/joancf/python-gatenlp/tree/Spacy_on_subdocument
I also wrote a test.
But I can't run it, Sorry. I could build it doing pip install . , But I could not run the tests, Which is the right way to proceed?

johann-petrak · 2021-09-07T18:30:19Z

johann-petrak
Sep 7, 2021
Maintainer

The most generic way to run the tests should be to run python setup.py test from your forked repo.

or just run pytest tests/test_spacy.py to just run the Spacy tests.

For this, obviously spacy, the spacy model and pytest need to be installed into your environment.

The proper way to install from the repo would be pip install -e . or pip install -e .[alldev] to include all dependencies including the ones needed for development.

2 replies

joancf Sep 10, 2021
Author

I've been integrating the spacy with the capability to process annotations individually instead of the full document.
But then I got some extra problems (whch I already solved)
1: how to pass parameters to spacy components (using component_cfg... )
2: how to get annotations from different spacy componnets (apart from the token, sentence...)

To solve the first one I added a component_cfg parameter it gets the name of the component (or maybe better the list of components) and creates a compoent_cfg with the features of the annotation (the component must be prepared to get "any" parameter list **kwargs, not restricted to the expected one)

For the second one I added a parameter that indicates a list of span types to import

I can upload the code (a few lines at the end)

johann-petrak Sep 12, 2021
Maintainer

hi - yes, I am not sure I understand everything you describe here, feel free to submit a pull request or attach code here and we can discuss how to merge it!
Thanks!

joancf · 2021-09-13T13:51:25Z

joancf
Sep 13, 2021
Author

I will try to explain adding code and then we can open an issue.

1: how to pass parameters to spacy components (using component_cfg... )
Spacy components may have parameters. So when you call the spacy processor you can add the parameters to the call.

So for example here I have the definition of a new component

@Language.factory( 'my_factory)'
def create_my_factory(nlp: Language, name: str):
    return myFactory(nlp)
...
# the class must implement the call with doc as parameter, but extra parameters are allowed like
 def __call__(self, doc: Doc,number:str,**kwargs) -> Doc:
...
# number is the parameter, and kwarks is there to avoid errors when other features are imported

When a language factory has parameters the way to pass them is adding the component_cfg parameter when calling nlp: nlp(text,component_cfg={'my_factory':{'number'='3'}})
component_cfg is a [Dict[str, Dict[str, Any]]] where the fist key is the component while the second one is the parameter.

As from gateNlp we don't know which are the components and we don't know the parameters of them so, one simple solution (it could be more sophisticated) is to pass all the annotation features to a given component
So, the function apply_spacy becomes

def apply_spacy(nlp, gatenlpdoc, setname="",containing_anns=None,component_cfg=None,retrieveSpans=[]):
  .....
            if component_cfg:
                component_config= {component_cfg: ann.features.to_dict()}
               spacydoc = nlp(covered,component_cfg=component_config)

** 2: how to get annotations from different spacy componnets (apart from the token, sentence...) **
Once the component has run... how can we get the data back? It can be done in two ways, adding new annotations or adding features to existing ones.
The current code retrieves a predefined set of spans that will be copied from spacy to Gate, adding the retrieveSpans parameter it user defined or different spans than (Token or noun chunks) can be put back to the gate document

def spacy2gatenlp(....     retrieveSpans=[] ):
 ....  # at the end we do...
    for spanType in retrieveSpans:
        try:
            for span in spacydoc.spans[spanType]:
                annset.add(span.start_char+start_offset, span.end_char+start_offset, spanType, {})
        except Exception as e:
            print("exception raised"+e)
    return retdoc

And finally , we can also retrieve the "spacy document features" as gate features. If we process spans, these features will become span-features:

def apply_spacy(nlp, gatenlpdoc, setname="",containing_anns=None,component_cfg=None,retrieveSpans=[]):
....
             spacy2gatenlp(spacydoc, gatenlpdoc=gatenlpdoc, setname=setname,start_offset=ann.start,retrieveSpans=retrieveSpans)
            elems=dir(spacydoc._)
            for elem in elems:
                if elem not in ['get', 'set', 'has']:
                    ann.features[elem]=spacydoc._.get(elem)

So the document features are copied to the current span.
and basically that's all the code

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrict annotators to some spans (annotations) #125

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Restrict annotators to some spans (annotations) #125

joancf Sep 6, 2021

Replies: 4 comments · 4 replies

johann-petrak Sep 7, 2021 Maintainer

joancf Sep 7, 2021 Author

johann-petrak Sep 7, 2021 Maintainer

joancf Sep 7, 2021 Author

johann-petrak Sep 7, 2021 Maintainer

joancf Sep 10, 2021 Author

johann-petrak Sep 12, 2021 Maintainer

joancf Sep 13, 2021 Author

joancf
Sep 6, 2021

Replies: 4 comments 4 replies

johann-petrak
Sep 7, 2021
Maintainer

joancf
Sep 7, 2021
Author

johann-petrak Sep 7, 2021
Maintainer

joancf Sep 7, 2021
Author

johann-petrak
Sep 7, 2021
Maintainer

joancf Sep 10, 2021
Author

johann-petrak Sep 12, 2021
Maintainer

joancf
Sep 13, 2021
Author