pqai encoder

PQAI Encoder

This service is intended to contain code that creates data representations that are more amenable to be used by machine learning pipelines. For example, creating a vector representation for piece of text by passing it through a pretrained transformer model, or by creating a list of keyword features through tokenization.

Representations are a critical part of any intelligent system. In patent retrieval, representations are often required to be created for a textual part of the patent (such as abstract or claim) or its metadata (such as a CPC class).

These representations are then fed to machine learning pipelines that may perform end-operations like classification, clustering, or ranking.

Code structure

root
  |-- core
        |-- encoders.py			    // defines a number of encoders
        |-- representations.py      // defines wrappers around representations
        |-- vectorizers.py			// defines encoders that return real-valued vectors
        |-- utils.py				// contains utility functions
  |-- assets						// files needed by encoders (ML models, vocab, etc.)
  |-- tests
        |-- test_server.py			// Tests for the REST API
        |-- test_encoders.py
        |-- test_representations.py
        |-- test_vectorizers.py
  |-- main.py						// Defines the REST API
  |
  |-- requirements.txt				// List of Python dependencies
  |
  |-- Dockerfile					// Docker files
  |-- docker-compose.yml
  |
  |-- env							// .env file template
  |-- deploy.sh						// Script for setting up on local system

Core modules

Encoders

This module defines the following classes:

Encoder: An abstract class for defining the interface of encoders. It makes provision for the basic and common operations to be performed by all encoders. These are defined by the following methods:
- encoder_fn: the core encoding function used internally for implementation
- is_valid_input: a validating method returning a boolean value signifying whether the input is encodable
- encode: the public method to combines the above two operations: validation and encoding
- encode_many: the public method to encode an array of inputs, optionally by leveraging latency-reducing techniques such as batch processing
TextEncoder: An abstract class for defining an interface for those encoders that take a string as input (for example a vectorizer that embeds words or sentences).
BagOfEntitiesEncoder: Converts a piece of text into an unordered set (bag) of entities. Entities can be considered as key concepts in the given text and are themselves strings and thus may be further encodable by a TextEncoder.

In the current implementation, BagOfEntitiesEncoder works by discovering any of a set of predefined entities within the given text. Therefore, it can be said that it's a static, text-matching operation, as opposed to an ML-based operation.

Representations

This module defines the following classes:

BagOfEntities: this is a generalized version of bag of words representation, which can include multi-word entities as its elements.
BagOfVectors: this represents a set of vectors (no particular order). Typically, the vectors correspond to a bunch of embeddings in a vector space.

Vectorizers

Vectorizers are encoders that create representations in the form of an array of numerical feature vectors. For example, the activity vector of output layer in a deep neural network can be considered as the vector representation of the input.

This module defines the following vectorizers:

SIFTextVectorizer: creates vector representation of word sequences by treating them as bag of words and retrieving and averaging their dense vector representations. The averaging is weighted according to the smooth-inverse-frequency (hence the name SIF) of that word. This way, common words get lower weights while rare words are assigned higher weights.

Typical usage:
```
sent = "This invention is a mouse trap."
vector = SIFTextVectorizer().encode(sent) # np.ndarray of shape (dim,)

sents = [
    "This invention is a mouse trap.",
    "This invention presents a bird cage.",
]
vectors = SIFTextVectorizer().encode_many(sents)
```
SentBERTVectorizer: creates vector representations of text snippets using a transformer network. The network is first pretrained in an unsupervised manner, then trained on a semantic similarity task, such as paraphrase detection, and finally fine-tuned in a learning-to-rank or contrastive learning setting.

Typical usage is as follows:
```
sent = "This invention is a mouse trap."
vector = SentBERTVectorizer().encode(sent)

sents = [
	"This invention is a mouse trap.",
	"This invention presents a bird cage.",
]
vectors = SentBERTVectorizer().encode_many(sents)
```
CPCVectorizer: creates vector representation of CPC class codes by using an embedding matrix. The embeddings were created by treating CPC classes as tokens and applying the GloVe algorithm.
```
cpc = "H04W52/02"
vector = CPCVectorizer().encode(cpc)

cpcs = ["H04W52/02", "H04W72/00"]
vectors = CPCVectorizer().encode_many(cpcs)
```

EmbeddingMatrix: a wrapper on an embedding matrix. It provides functionality such as retrieving embeddings (vectors) for discrete items.

file = "path/to/tsv/file"
em = EmbeddingMatrix.from_tsv(file)

# get number of dimensions of the vector space
print(em.dims)

# check if it contains an embedding for "base"
print("base" in em)

# get vector for a given word
print(em["base"])

BagOfVectorsEncoder: creates a BagOfVectors representation from a given text snippet. It works by first extracting entities from the text, then creating a BagOfEntities representation from it, and finally using an EmbeddingMatrix to embed the (known) entities using a precomputed EmbeddingMatrix.

Typical usage is as follows:
```
emb_matrix_file = "path/to/tsv/file"
emb_matrix = EmbeddingMatrix.from_tsv(emb_matrix_file)
entities = set([ 'base', 'station' ])
bov = BagOfVectorsEncoder(emb_matrix).encode(entities) # bag of vectors
```

Utils

This module carries some general purpose functions used by other core modules:

get_sentences: a custom sentence tokenizer for patent text (includes rules for not breaking sentences in places such as "App. No. 16/234,543")
get_paragraphs: a simple newline based paragraph tokenizer
is_cpc_code: for checking whether a given string is a Cooperative Patent Classification code; also works for IPC (International Patent Classification) codes since they have the same format.
normalize_along_axis: normalize a multi-dimensional numpy array along a given axis
normalize_rows: normalize rows of a 2d numpy array (normalization = converting to unit vectors by dividing each row by its magnitude)
normalize_columns: normalize columns of a 2d numpy array
Singleton: a metaclass used for ensuring that an encoder can only be instantiated once in the memory and then the same instance is reused (to save memory)

Assets

The assets required to run this service are stored in the /assets directory.

When you clone the Github repository, the /assets directory will have nothing but a README file. You will need to download actual asset files as a zip archive from the following link:

https://https://s3.amazonaws.com/pqai.s3/public/assets-pqai-encoders.zip

After downloading, extract the zip file into the /assets directory.

(alternatively, you can also use the deploy.sh script to do this step automatically - see next section)

The assets contain the following files/directories:

all-MiniLM-L6-v2-2022-07-11: directory containing a transformer based neural network model that acts as a vectorizer. This model has been trained in a contrast learning setting based on the CPC dataset.
vectorizer_distilbert_poc: directory containing another transformer based neural network model that acts as a vectorizer. This model has been trained in a learning-to-rank setting using the PoC (Patents with one Citation) dataset.
cpc_vectors_256d.items.json: labels (CPC class codes) for the vectors contained in cpc_vectors_256d.npy file
cpc_vectors_256d.npy: Dense vectors for CPC classes, computed with a Glove-like technique for creating word vectors
dfs.json: Document frequencies for common terms in patent abstracts
entities_blacklist.txt: A list of keywords used by bag-of-entities (boe) encoder (BagOfEntitiesEncoder defined in encoders.py)
entities.npy: Static dense vectors for a set of entities
entities.txt: Labels for the vectors stored in entities.npy
glove-dictionary.json: term-index mapping for a vocabulary
glove-dictionary.variations.json: lemma-to-surface-forms mapping for a vocabulary, e.g. create => create, creating, created, creates
glove-vocab.json: vocabular for Glove word embeddings
glove-vocab.lemmas.json: term to lemma mapping for a vocabulary
glove-We.npy: Glove word embeddings
glove-Ww.npy: Term weights (based on smooth inverse frequencies)
stopwords.txt: A list of stopwords specific for patents

Deployment

Prerequisites

The following deployment steps assume that you are running a Linux distribution and have Git and Docker installed on your system.

Setup

The easiest way to get this service up and running on your local system is to follow these steps:

Clone the repository

git clone https://github.com/pqaidevteam/pqai-encoder.git

Using the env template in the repository, create a .env file and set the environment variables.
```
cd pqai-encoder
cp env .env
nano .env
```
Run deploy.sh script.
```
chmod +x deploy.sh
bash ./deploy.sh
```