SQL-Embedding is a versatile solution that simplifies the process of creating and utilizing embeddings for vector search directly within SQL queries. By providing a unified interface, you can seamlessly access a wide range of embedding models, enabling efficient and effective vector search operations.
- Simplified Embedding Creation: Easily generate embeddings from your data within SQL queries.
- Diverse Model Support: Access a variety of embedding models to suit your specific needs.
- Efficient Vector Search: Perform fast and accurate vector searches directly in your SQL database.
Make sure you have git and Docker desktop installed.
Note, due instability reasons when downloading models that we can't figure out why, some models are disabled by default. Thus, only the model langchain/fake-provider is available in the demo. Our apologies for this.
To access the demo, go to SQL Portal.
Then use the following SQL query to generate an embedding for the word 'test':
select dc.embedding('test', 'langchain/fake-provider')
-- table with two vector columns to store the embeddings
-- pay attention to the vectors dimensions, they must be the same as the model used
create table testtable(
document varchar(1024), -- the text to be embedded
embeddings vector(double, 384) -- store fastembed embeddings using its model bge-small-en-v1.5
)
insert into testtable (
document,
embeddings
) values (
'my text',
dc.embedding('my text', 'langchain/fake-provider')
)
-- show the results
select * from testtable
-- using VECTOR_DOT_PRODUCT function on the generated embeddings and ad-hoc embeddings
select
VECTOR_DOT_PRODUCT(
embeddings,
dc.embedding('my text', 'langchain/fake-provider')
)
from testtable
If you're planning to use an online embeddings service, you'll need an API key. Currently, the project supports OpenAI's Embeddings service.
Obtain your OpenAI API key by creating an account on their platform.
Environment variables are used to store sensitive information like API keys and tokens. These variables need to be set before building the Docker image.
There are two ways to configure these variables:
When launching the Docker container, you can set the OPENAI_API_KEY
the environment variables on a dotenv file using the -e flag:
# OpenAI API key
export OPENAI_API_KEY=$OPENAI_API_KEY
Currently, the project supports the following embeddings models/services, here called providers:
Provider ID | Provider supported models |
---|---|
fastembed | FastEmbed |
openai | OpenAI |
sentence_transformers | Sentence Transformers |
langchain | Langchain (just fake emebeddings for testing purposes) |
By default, the project will use FastEmbed embeddings model. It's free and you don't need an API key to use it.
Open IRIS installation with IPM client installed. Call in any namespace:
USER>zpm "install sql-embeddings"
Or call the following for installing programmatically:
set sc=$zpm("install sql-embeddings")
Note that you'll need to install the libraries manually. If you don't, you'll get an error when you try to use the dc.embedding
function.
To use custom models like spaCy, run the following commands in the IRIS terminal:
$ docker-compose exec iris iris session iris -U IRISAPP
Execute the commands to install spaCy and download the specified model:
USER> !pip3 install spacy
USER> !pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl
For your convenience, the project uses Docker Compose to build and run the containers.
Note that in the case of SentenceTransformers, you'll need to install the SentenceTransformers library manually, since this library is big and takes a long time to install making the build process slow.
Clone/git pull the repo into any local directory
$ git clone https://github.com/musketeers-br/sql-embeddings
Open the terminal in this directory and run:
$ docker-compose build
$ docker-compose up -d
dc.embedding('<text>', '<provider>/<modelName>')
You can control the logging level and file name using the following environment variables:
export SQLEMBEDDINGS_LOGGER_SETTINGS_LOG_LEVEL=INFO
export SQLEMBEDDINGS_LOGGER_SETTINGS_LOG_FILE=/tmp/sql-embeddings.log
To get started, you can use the following SQL query to generate an embedding for the word 'test':
-- using the default embedding provider/model to get the embeddings vector
select dc.embedding('test')
The following a more complex examples that creates a table to store the embeddings, generate and insert embeddings, query the table and show the results and get the similarity between 'my text' and 'lorem ipsum':
-- table with two vector columns to store the embeddings
-- pay attention to the vectors dimensions, they must be the same as the model used
create table testvector(
document varchar(1024), -- the text to be embedded
embFastEmbedModel1 vector(double, 384), -- store fastembed embeddings using its model bge-small-en-v1.5
embFastEmbedModel2 vector(double, 384), -- store fastembed embeddings using its model jina-embeddings-v2-small-en
embOpenAI vector(double, 1536) -- store openai embeddings
)
insert into testvector (
document,
embFastEmbedModel1,
embFastEmbedModel2,
embOpenAI
) values (
'my text',
-- get the fastembed embeddings using the BAAI/bge-small-en-v1.5 model
dc.embedding('my text', 'fastembed/BAAI/bge-small-en-v1.5'),
-- get the fastembed embeddings using another model (jinaai/jina-embeddings-v2-small-en)
dc.embedding('my text', 'fastembed/jinaai/jina-embeddings-v2-small-en'),
-- get the openai embeddings using the text-embedding-3-small model
dc.embedding('my text', 'openai/text-embedding-3-small')
)
-- show the results
select * from testvector
-- show the similarity between 'my text' and 'lorem ipsum'
select
VECTOR_DOT_PRODUCT(
embFastEmbedModel1,
dc.embedding('my text', 'fastembed/BAAI/bge-small-en-v1.5')
) "Similariy between 'my text' and itself",
VECTOR_DOT_PRODUCT(
embFastEmbedModel1,
dc.embedding('lorem ipsum', 'fastembed/BAAI/bge-small-en-v1.5')
) "Similariy between the 'my text' and 'lorem ipsum'"
from testvector
Note that the first execution of each model will take a few seconds to generate the embeddings since the libraries not loaded yet. The subsequent executions will be much faster.