Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor HF loader and add poolingMethod #954

Open
wants to merge 80 commits into
base: mainline
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 76 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
550c1f0
Finish initial commit
wanliAlex Aug 16, 2024
0685691
Finish tests
wanliAlex Aug 21, 2024
a8c9bf5
Upgrade requirements.txt
wanliAlex Aug 21, 2024
0d40b58
Upgrade requirements.txt
wanliAlex Aug 21, 2024
a2bef12
Remove max sequence length
wanliAlex Aug 21, 2024
c3fd6f5
Remove outdated open_clip tests
wanliAlex Aug 21, 2024
0cb12e9
Fix unit tests error message
wanliAlex Aug 21, 2024
aaa9878
Add mobile clipmodel
wanliAlex Aug 29, 2024
8c3ec59
Merge branch 'mainline' into li/update-oc
wanliAlex Aug 29, 2024
1d3fa81
Resolve farshid's comments
wanliAlex Sep 2, 2024
8995d54
Merge branch 'mainline' into li/update-oc
wanliAlex Sep 2, 2024
2e8e616
Fix tests
wanliAlex Sep 2, 2024
e51405e
Update version to 2.12.0
wanliAlex Sep 2, 2024
4f8f486
Change base version to 29
wanliAlex Sep 2, 2024
9e7116d
Fix exmaples
wanliAlex Sep 3, 2024
9a6089f
Fix tests
wanliAlex Sep 3, 2024
9a5922c
Fix tests
wanliAlex Sep 3, 2024
c21775f
Add some new multilingual clip models
wanliAlex Sep 4, 2024
39c1557
Add subtests for large clip models
wanliAlex Sep 4, 2024
a124c0c
Update file name
wanliAlex Sep 4, 2024
dd61ac7
Finish HF class
wanliAlex Sep 4, 2024
badcd44
Add max_seq_length back
wanliAlex Sep 4, 2024
3c6be4d
Merge branch 'li/update-oc' into li/add-pooling-hf
wanliAlex Sep 4, 2024
0123c82
Update open clip code
wanliAlex Sep 4, 2024
4e2b040
update open clip class
wanliAlex Sep 4, 2024
4d4ea24
Finish open_clip refactoring
wanliAlex Sep 4, 2024
4415f13
Merge branch 'li/update-oc' into li/add-pooling-hf
wanliAlex Sep 4, 2024
945c589
Finish the implementation. Need tests
wanliAlex Sep 5, 2024
c10e5fa
Catch mainline
wanliAlex Sep 17, 2024
3b451c7
Finish tests for test_hugging_face_model_properties
wanliAlex Sep 18, 2024
86dc3bc
Add tests for new hugging face module
wanliAlex Sep 20, 2024
6fe1f40
Fixing tests
wanliAlex Sep 20, 2024
c9e4cdf
Fixing tests
wanliAlex Sep 20, 2024
5ba9e69
Merge branch 'mainline' into li/add-pooling-hf
wanliAlex Sep 24, 2024
45c99bc
Fix tests
wanliAlex Sep 24, 2024
fc2c0d9
Fix all the tests
wanliAlex Sep 25, 2024
a0f9cd4
Fix marqo_docs()
wanliAlex Sep 25, 2024
f1661b7
Merge branch 'mainline' into li/add-pooling-hf
farshidz Oct 2, 2024
6ba7973
Fix tests
wanliAlex Oct 6, 2024
5e78e9b
Merge branch 'li/add-pooling-hf' of https://github.com/marqo-ai/marqo…
wanliAlex Oct 6, 2024
3e9717f
Fix tests
wanliAlex Oct 7, 2024
38aa780
Fix tests
wanliAlex Oct 7, 2024
878ace5
Fix tests
wanliAlex Oct 7, 2024
4c42251
Merge branch 'mainline' into li/add-pooling-hf
wanliAlex Oct 7, 2024
bc354ae
Fix tests
wanliAlex Oct 7, 2024
3640a03
Change name to inference models
wanliAlex Oct 7, 2024
5e53047
Update abstraction
wanliAlex Oct 7, 2024
0b2cf0f
Fix tests
wanliAlex Oct 7, 2024
c427407
Catch mainline
wanliAlex Oct 7, 2024
42a56c5
Fix tests
wanliAlex Oct 7, 2024
7aa5bef
Fix tests
wanliAlex Oct 7, 2024
835f6bc
Fix regression
wanliAlex Oct 10, 2024
bcb47fd
Fix tests
wanliAlex Oct 10, 2024
18e8cfd
Catch mainline
wanliAlex Oct 10, 2024
2a1132a
Fix load path regression
wanliAlex Oct 10, 2024
9109618
Catch mainline
wanliAlex Oct 10, 2024
a6ff0c6
Merge remote-tracking branch 'origin/mainline' into li/add-pooling-hf
wanliAlex Oct 11, 2024
4ad03a4
update dependencies
wanliAlex Oct 11, 2024
17a17e9
Upgrade comments
wanliAlex Oct 11, 2024
3098ce6
Finish abstract
wanliAlex Oct 14, 2024
8c70e7d
Merge remote-tracking branch 'origin/mainline' into li/add-pooling-hf
wanliAlex Oct 15, 2024
e28434e
Merge branch 'mainline' into li/add-pooling-hf
wanliAlex Oct 21, 2024
b9618de
Finish tests
wanliAlex Oct 21, 2024
f7c5d47
Add private model tests
wanliAlex Oct 21, 2024
f8fd031
Add private model tests
wanliAlex Oct 21, 2024
10f73dc
Add poolingMethod: mean to bge models
wanliAlex Oct 21, 2024
5c0f0dd
Remove unused code
wanliAlex Oct 21, 2024
213a02c
Add model_auth regression fix
wanliAlex Oct 21, 2024
c27d567
Add back the localpath for open_clip model properties:
wanliAlex Oct 21, 2024
0833120
Add validation for dimensions
wanliAlex Oct 21, 2024
e1f3e33
Add os.path.exists tests
wanliAlex Oct 21, 2024
20b6c0b
Add dimensions for model properties tests
wanliAlex Oct 21, 2024
9198acc
Add some extra tests for HF loader
wanliAlex Oct 22, 2024
4b4a804
Add secrets to largemodel unittests
wanliAlex Oct 22, 2024
3f3cb4d
Add model properties
wanliAlex Oct 22, 2024
8be1948
Merge branch 'mainline' into li/add-pooling-hf
farshidz Oct 22, 2024
0ed6606
Fix Yihan's comments
wanliAlex Oct 22, 2024
0e4c25d
Merge branch 'li/add-pooling-hf' of https://github.com/marqo-ai/marqo…
wanliAlex Oct 22, 2024
5ed9a21
Merge remote-tracking branch 'origin/mainline' into li/add-pooling-hf
wanliAlex Oct 22, 2024
81c2c4a
Catch mainline
wanliAlex Oct 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/largemodel_unit_test_CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,10 @@ jobs:
export MARQO_MAX_CPU_MODEL_MEMORY=15
export MARQO_MAX_CUDA_MODEL_MEMORY=15
export PRIVATE_MODEL_TESTS_AWS_ACCESS_KEY_ID=${{ secrets.PRIVATE_MODEL_TESTS_AWS_ACCESS_KEY_ID }}
export PRIVATE_MODEL_TESTS_AWS_SECRET_ACCESS_KEY=${{ secrets.PRIVATE_MODEL_TESTS_AWS_SECRET_ACCESS_KEY }}
export PRIVATE_MODEL_TESTS_HF_TOKEN=${{ secrets.PRIVATE_MODEL_TESTS_HF_TOKEN }}
export PYTHONPATH="./marqo/tests:./marqo/src:./marqo"
pytest marqo/tests --largemodel --ignore=marqo/tests/test_documentation.py
Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/unit_test_200gb_CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,10 @@ jobs:
export VESPA_DOCUMENT_URL=http://localhost:8080
export VESPA_QUERY_URL=http://localhost:8080
export PRIVATE_MODEL_TESTS_AWS_ACCESS_KEY_ID=${{ secrets.PRIVATE_MODEL_TESTS_AWS_ACCESS_KEY_ID }}
export PRIVATE_MODEL_TESTS_AWS_SECRET_ACCESS_KEY=${{ secrets.PRIVATE_MODEL_TESTS_AWS_SECRET_ACCESS_KEY }}
export PRIVATE_MODEL_TESTS_HF_TOKEN=${{ secrets.PRIVATE_MODEL_TESTS_HF_TOKEN }}
cd marqo
export PYTHONPATH="./tests:./src:."
pytest --ignore=tests/test_documentation.py --durations=100 --cov=src --cov-branch --cov-context=test --cov-report=html:cov_html --cov-report=lcov:lcov.info tests
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from typing import Optional
from huggingface_hub import hf_hub_download
from marqo.s2_inference.logger import get_logger
from huggingface_hub.errors import RepositoryNotFoundError
from huggingface_hub.utils import RepositoryNotFoundError
from marqo.s2_inference.errors import ModelDownloadError

logger = get_logger(__name__)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
from abc import abstractmethod

import numpy as np
import torch
from PIL import UnidentifiedImageError

from marqo.core.inference.models.abstract_embedding_model import AbstractEmbeddingModel
from marqo.s2_inference.types import *
from marqo.core.inference.image_download import (_is_image, format_and_load_CLIP_images,
format_and_load_CLIP_image)
from marqo.core.inference.inference_models.abstract_embedding_model import AbstractEmbeddingModel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be consistent when renaming classes/directories. I notice there's a new directory called core/inference/inference_models. Maybe it should be core/inference/embedding_models to keep consistency if we're referring to the same objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes for any other reference to inference models vs. embeddings models

from marqo.core.inference.inference_models.image_download import (_is_image, format_and_load_CLIP_images,
format_and_load_CLIP_image)
from marqo.s2_inference.logger import get_logger
import torch
from marqo.s2_inference.types import *
from marqo.tensor_search.models.private_models import ModelAuth

logger = get_logger(__name__)

Expand All @@ -25,14 +27,14 @@ class AbstractCLIPModel(AbstractEmbeddingModel):
"""

def __init__(self, device: Optional[str] = None, model_properties: Optional[dict] = None,
model_auth: Optional[dict] = None):
model_auth: Optional[ModelAuth] = None):
"""Instantiate the abstract CLIP model.

Args:
device (str): The device to load the model on, typically 'cpu' or 'cuda'.
model_properties (dict): A dictionary containing additional properties or configurations
specific to the model. Defaults to an empty dictionary if not provided.
model_auth (dict): The authentication information for the model. Defaults to `None` if not provided
model_auth (ModelAuth): The authentication information for the model. Defaults to `None` if not provided
"""

super().__init__(model_properties, device, model_auth)
Expand All @@ -42,20 +44,20 @@ def __init__(self, device: Optional[str] = None, model_properties: Optional[dict
self.preprocess = None

@abstractmethod
def encode_text(self, inputs: Union[str, List[str]], normalize: bool = True) -> FloatTensor:
def encode_text(self, inputs: Union[str, List[str]], normalize: bool = True) -> np.ndarray:
pass

@abstractmethod
def encode_image(self, inputs, normalize: bool = True, image_download_headers: dict = None) -> FloatTensor:
def encode_image(self, inputs, normalize: bool = True, image_download_headers: dict = None) -> np.ndarray:
pass

def encode(self, inputs: Union[str, ImageType, List[Union[str, ImageType]]],
default: str = 'text', normalize=True, **kwargs) -> FloatTensor:
default: str = 'text', normalize=True, **kwargs) -> np.ndarray:
infer = kwargs.pop('infer', True)

if infer and _is_image(inputs):
is_image = True
else:
is_image = False
if default == 'text':
is_image = False
elif default == 'image':
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
from abc import ABC, abstractmethod
from typing import Optional

from marqo.tensor_search.models.private_models import ModelAuth


class AbstractEmbeddingModel(ABC):
"""This is the abstract base class for all models in Marqo."""

def __init__(self, model_properties: Optional[dict] = None, device: Optional[str] = None,
model_auth: Optional[dict] = None):
model_auth: Optional[ModelAuth] = None):
"""Load the model with the given properties.

Args:
Expand All @@ -20,7 +22,6 @@ def __init__(self, model_properties: Optional[dict] = None, device: Optional[str
if model_properties is None:
model_properties = dict()

self.model_properties = self._build_model_properties(model_properties)
self.device = device
self.model_auth = model_auth

Expand All @@ -33,11 +34,6 @@ def load(self):
self._load_necessary_components()
self._check_loaded_components()
Comment on lines 34 to 35
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to separate these two method?


@abstractmethod
def _build_model_properties(self, model_properties: dict):
"""Parse the model properties from the user input and convert it to a pydantic model."""
pass

@abstractmethod
def _load_necessary_components(self):
"""Load the necessary components for the model."""
Expand All @@ -54,4 +50,5 @@ def _check_loaded_components(self):

@abstractmethod
def encode(self):
pass
"""Encode the input data."""
pass
33 changes: 33 additions & 0 deletions src/marqo/core/inference/inference_models/hf_tokenizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import html
from typing import Union, List

import ftfy
import regex as re
import torch


def whitespace_clean(text):
text = re.sub(r'\s+', ' ', text)
text = text.strip()
return text

def basic_clean(text):
text = ftfy.fix_text(text)
text = html.unescape(html.unescape(text))
return text.strip()

class HFTokenizer:
# HuggingFace _tokenizer wrapper
# Check https://github.com/mlfoundations/open_clip/blob/16e229c596cafaec46a4defaf27e0e30ffcca12d/src/open_clip/tokenizer.py#L188-L201
def __init__(self, tokenizer_name: str):
from transformers import AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

def __call__(self, texts: Union[str, List[str]]) -> torch.Tensor:
# same cleaning as for default _tokenizer, except lowercasing
# adding lower (for case-sensitive tokenizers) will make it more robust but less sensitive to nuance
if isinstance(texts, str):
texts = [texts]
texts = [whitespace_clean(basic_clean(text)) for text in texts]
input_ids = self.tokenizer(texts, return_tensors='pt', padding='max_length', truncation=True).input_ids
return input_ids
Loading
Loading