-
Notifications
You must be signed in to change notification settings - Fork 28
pqai encoder
This service is intended to contain code that creates data representations that are more amenable to be used by machine learning pipelines. For example, creating a vector representation for piece of text by passing it through a pretrained transformer model, or by creating a list of keyword features through tokenization.
Representations are a critical part of any intelligent system. In patent retrieval, representations are often required to be created for a textual part of the patent (such as abstract or claim) or its metadata (such as a CPC class).
These representations are then fed to machine learning pipelines that may perform end-operations like classification, clustering, or ranking.
root
|-- core
|-- encoders.py // defines a number of encoders
|-- representations.py // defines wrappers around representations
|-- vectorizers.py // defines encoders that return real-valued vectors
|-- utils.py // contains utility functions
|-- assets // files needed by encoders (ML models, vocab, etc.)
|-- tests
|-- test_server.py // Tests for the REST API
|-- test_encoders.py
|-- test_representations.py
|-- test_vectorizers.py
|-- main.py // Defines the REST API
|
|-- requirements.txt // List of Python dependencies
|
|-- Dockerfile // Docker files
|-- docker-compose.yml
|
|-- env // .env file template
|-- deploy.sh // Script for setting up on local system
This module defines the following classes:
-
Encoder
: An abstract class for defining the interface of encoders. It makes provision for the basic and common operations to be performed by all encoders. These are defined by the following methods:-
encoder_fn
: the core encoding function used internally for implementation -
is_valid_input
: a validating method returning a boolean value signifying whether the input is encodable -
encode
: the public method to combines the above two operations: validation and encoding -
encode_many
: the public method to encode an array of inputs, optionally by leveraging latency-reducing techniques such as batch processing
-
-
TextEncoder
: An abstract class for defining an interface for those encoders that take a string as input (for example a vectorizer that embeds words or sentences). -
BagOfEntitiesEncoder
: Converts a piece of text into an unordered set (bag) of entities. Entities can be considered as key concepts in the given text and are themselves strings and thus may be further encodable by aTextEncoder
.In the current implementation,
BagOfEntitiesEncoder
works by discovering any of a set of predefined entities within the given text. Therefore, it can be said that it's a static, text-matching operation, as opposed to an ML-based operation.
This module defines the following classes:
-
BagOfEntities
: this is a generalized version of bag of words representation, which can include multi-word entities as its elements. -
BagOfVectors
: this represents a set of vectors (no particular order). Typically, the vectors correspond to a bunch of embeddings in a vector space.
Vectorizers are encoders that create representations in the form of an array of numerical feature vectors. For example, the activity vector of output layer in a deep neural network can be considered as the vector representation of the input.
This module defines the following vectorizers:
-
SIFTextVectorizer
: creates vector representation of word sequences by treating them as bag of words and retrieving and averaging their dense vector representations. The averaging is weighted according to the smooth-inverse-frequency (hence the name SIF) of that word. This way, common words get lower weights while rare words are assigned higher weights.Typical usage:
sent = "This invention is a mouse trap." vector = SIFTextVectorizer().encode(sent) # np.ndarray of shape (dim,) sents = [ "This invention is a mouse trap.", "This invention presents a bird cage.", ] vectors = SIFTextVectorizer().encode_many(sents)
-
SentBERTVectorizer
: creates vector representations of text snippets using a transformer network. The network is first pretrained in an unsupervised manner, then trained on a semantic similarity task, such as paraphrase detection, and finally fine-tuned in a learning-to-rank or contrastive learning setting.Typical usage is as follows:
sent = "This invention is a mouse trap." vector = SentBERTVectorizer().encode(sent) sents = [ "This invention is a mouse trap.", "This invention presents a bird cage.", ] vectors = SentBERTVectorizer().encode_many(sents)
-
CPCVectorizer
: creates vector representation of CPC class codes by using an embedding matrix. The embeddings were created by treating CPC classes as tokens and applying the GloVe algorithm.cpc = "H04W52/02" vector = CPCVectorizer().encode(cpc) cpcs = ["H04W52/02", "H04W72/00"] vectors = CPCVectorizer().encode_many(cpcs)
-
EmbeddingMatrix
: a wrapper on an embedding matrix. It provides functionality such as retrieving embeddings (vectors) for discrete items.file = "path/to/tsv/file" em = EmbeddingMatrix.from_tsv(file) # get number of dimensions of the vector space print(em.dims) # check if it contains an embedding for "base" print("base" in em) # get vector for a given word print(em["base"])
-
BagOfVectorsEncoder
: creates aBagOfVectors
representation from a given text snippet. It works by first extracting entities from the text, then creating aBagOfEntities
representation from it, and finally using anEmbeddingMatrix
to embed the (known) entities using a precomputedEmbeddingMatrix
.Typical usage is as follows:
emb_matrix_file = "path/to/tsv/file" emb_matrix = EmbeddingMatrix.from_tsv(emb_matrix_file) entities = set([ 'base', 'station' ]) bov = BagOfVectorsEncoder(emb_matrix).encode(entities) # bag of vectors
This module carries some general purpose functions used by other core modules:
-
get_sentences
: a custom sentence tokenizer for patent text (includes rules for not breaking sentences in places such as "App. No. 16/234,543") -
get_paragraphs
: a simple newline based paragraph tokenizer -
is_cpc_code
: for checking whether a given string is a Cooperative Patent Classification code; also works for IPC (International Patent Classification) codes since they have the same format. -
normalize_along_axis
: normalize a multi-dimensional numpy array along a given axis -
normalize_rows
: normalize rows of a 2d numpy array (normalization = converting to unit vectors by dividing each row by its magnitude) -
normalize_columns
: normalize columns of a 2d numpy array -
Singleton
: a metaclass used for ensuring that an encoder can only be instantiated once in the memory and then the same instance is reused (to save memory)
The assets required to run this service are stored in the /assets
directory.
When you clone the Github repository, the /assets
directory will have nothing but a README file. You will need to download actual asset files as a zip archive from the following link:
https://https://s3.amazonaws.com/pqai.s3/public/assets-pqai-encoders.zip
After downloading, extract the zip file into the /assets
directory.
(alternatively, you can also use the deploy.sh
script to do this step automatically - see next section)
The assets contain the following files/directories:
-
all-MiniLM-L6-v2-2022-07-11
: directory containing a transformer based neural network model that acts as a vectorizer. This model has been trained in a contrast learning setting based on the CPC dataset. -
vectorizer_distilbert_poc
: directory containing another transformer based neural network model that acts as a vectorizer. This model has been trained in a learning-to-rank setting using the PoC (Patents with one Citation) dataset. -
cpc_vectors_256d.items.json
: labels (CPC class codes) for the vectors contained incpc_vectors_256d.npy
file -
cpc_vectors_256d.npy
: Dense vectors for CPC classes, computed with a Glove-like technique for creating word vectors -
dfs.json
: Document frequencies for common terms in patent abstracts -
entities_blacklist.txt
: A list of keywords used by bag-of-entities (boe) encoder (BagOfEntitiesEncoder
defined inencoders.py
) -
entities.npy
: Static dense vectors for a set of entities -
entities.txt
: Labels for the vectors stored inentities.npy
-
glove-dictionary.json
: term-index mapping for a vocabulary -
glove-dictionary.variations.json
: lemma-to-surface-forms mapping for a vocabulary, e.g. create => create, creating, created, creates -
glove-vocab.json
: vocabular for Glove word embeddings -
glove-vocab.lemmas.json
: term to lemma mapping for a vocabulary -
glove-We.npy
: Glove word embeddings -
glove-Ww.npy
: Term weights (based on smooth inverse frequencies) -
stopwords.txt
: A list of stopwords specific for patents
Prerequisites
The following deployment steps assume that you are running a Linux distribution and have Git and Docker installed on your system.
Setup
The easiest way to get this service up and running on your local system is to follow these steps:
-
Clone the repository
git clone https://github.com/pqaidevteam/pqai-encoder.git
-
Using the
env
template in the repository, create a.env
file and set the environment variables.cd pqai-encoder cp env .env nano .env
-
Run
deploy.sh
script.chmod +x deploy.sh bash ./deploy.sh
This will create a docker image and run it as a docker container on the port number you specified in the .env
file.
Alternatively, after following steps (1) and (2) above, you can use the command python main.py
to run the service in a terminal.
This service is not dependent on any other PQAI service for its operation.
The following services depend on this service:
- pqai-gateway
- pqai-reranker
- pqai-indexer (while indexing, not while searching)
- pqai-snippet