This repository was used for the development of a solution that used word embeddings to gain insight from public documents. A lot of the code is written in a way to make it easily applicable to other projects with other question and other data. The embedding model multilingual-e5-small was used. The term for this method of embedding a text and comparing similarity in an embedding space is a RAG method.
- Clone and
cd
into the repository - Run
poetry install
to install the dependencies. You're now ready to run code
Note: Poetry installs a virtual environment with all of your packages. This means that in order to run the python
command in the CLI, you need to first activate the venv by running poetry shell
.
To get started you will need the following software:
Docker
. For exampleDocker Desktop
VSCode
and the following extension-pack:Remote Development
. This is used to connect to the runningdocker
-image.
- The Kudos database .sql file downloaded.
- IF developing on a
Windows
-installation:
- Clone and cd into the repository on your UNIX-based filesystem (
WSL
/MacOS
/Linux
). - Place your downloaded Kudos database file in the
/db
folder. - Open the
Command Palette
(shortcut:ctrl + shift + p
/cmd + shift + p
) insidevscode
. Type and selectDev Containers: Reopen in Container
. Yourdev container
should now start initializing. This will take a while, but you are ready to run code after this process is done. If you get an error with connecting to the database in the first 10 minutes after building, wait for a bit and try again, otherwise try entering the container database container to see if theMariaDB
service is operational.
Anything in the src/embedding_model
folder does not need to be changed in order to use this code for other use cases. It contains the code for the embedding model, and is not dependent on the data used.
For all of the other code, it depends on what data is used. If all of the same columns with the same data types are used, the code should be able to run without issue. The following is an explanation for all of the columns in the DataFrame for the documents that needs to be in the source DataFrame:
split_paragraphs
: a column where each element is list[tuple[int, str, np.array]]. The first index is the page number that the paragraph is from, the second index is the paragraph itself, and the third index is the embedding of the paragraph.name
: the name of the organisation owning the documenttitle
: the title of the documenttype
: the type of the document. For our use case this was either"Årsrapport"
or"Tildelingsbrev"
, but can easily be changed`concerned_year
: which year is the document connected to. If this is not filled,published_at
is neededid.4
andparent_id
: these are somewhat optional, and is used for connecting directorates with their parent departments
The columns that are populated by the code in the utils
folder are:
length_of_split_paragraphs
: the character length of each paragraph insplit_paragraphs
. For each row, the length of the list corresponds to the length of the list insplit_paragraphs
lengths
: the length of the list insplit_paragraphs
sims
: the cosine similarity of the document to the querysims_scaled
:sims
scaled with theMinMaxScaler
fromsklearn
pure_mean_sims
: the unweightedmean ofsims_scaled
weighted_mean_sims
: the mean ofsims_scaled
withlength_of_split_paragraphs
as the weightsdeviation_scaled_with_length
: A custom function to weight the deviation of the score of the paragraph with the length of the paragraphtop_k_mean
: the mean of the top kdeviation_scaled_with_length
valuesall_mean
: the mean of the score oftop_k_mean
,pure_mean_sims
,weighted_mean_sims
, anddeviation_scaled_with_length
- Perhaps the easiest way to improve the performance would be to use a more powerful model. The model used in this project was
multilingual-e5-small
. A larger model, likemultilingual-e5-base
ormultilingual-e5-large
would likely give better results, but would also require more computational power. - As stated previously, the premise of this project was to use word embeddings to gain insight from public documents. This could be expanded to include more documents, or to include documents from other sources.
- The code could be further optimized to run faster. This includes, but is not limited to, methods to reduce embedding time, reduce the time needed for finding the most similar documents, or by running the code on a more specialised machines.