-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor HF loader and add poolingMethod #954
Open
wanliAlex
wants to merge
80
commits into
mainline
Choose a base branch
from
li/add-pooling-hf
base: mainline
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 76 commits
Commits
Show all changes
80 commits
Select commit
Hold shift + click to select a range
550c1f0
Finish initial commit
wanliAlex 0685691
Finish tests
wanliAlex a8c9bf5
Upgrade requirements.txt
wanliAlex 0d40b58
Upgrade requirements.txt
wanliAlex a2bef12
Remove max sequence length
wanliAlex c3fd6f5
Remove outdated open_clip tests
wanliAlex 0cb12e9
Fix unit tests error message
wanliAlex aaa9878
Add mobile clipmodel
wanliAlex 8c3ec59
Merge branch 'mainline' into li/update-oc
wanliAlex 1d3fa81
Resolve farshid's comments
wanliAlex 8995d54
Merge branch 'mainline' into li/update-oc
wanliAlex 2e8e616
Fix tests
wanliAlex e51405e
Update version to 2.12.0
wanliAlex 4f8f486
Change base version to 29
wanliAlex 9e7116d
Fix exmaples
wanliAlex 9a6089f
Fix tests
wanliAlex 9a5922c
Fix tests
wanliAlex c21775f
Add some new multilingual clip models
wanliAlex 39c1557
Add subtests for large clip models
wanliAlex a124c0c
Update file name
wanliAlex dd61ac7
Finish HF class
wanliAlex badcd44
Add max_seq_length back
wanliAlex 3c6be4d
Merge branch 'li/update-oc' into li/add-pooling-hf
wanliAlex 0123c82
Update open clip code
wanliAlex 4e2b040
update open clip class
wanliAlex 4d4ea24
Finish open_clip refactoring
wanliAlex 4415f13
Merge branch 'li/update-oc' into li/add-pooling-hf
wanliAlex 945c589
Finish the implementation. Need tests
wanliAlex c10e5fa
Catch mainline
wanliAlex 3b451c7
Finish tests for test_hugging_face_model_properties
wanliAlex 86dc3bc
Add tests for new hugging face module
wanliAlex 6fe1f40
Fixing tests
wanliAlex c9e4cdf
Fixing tests
wanliAlex 5ba9e69
Merge branch 'mainline' into li/add-pooling-hf
wanliAlex 45c99bc
Fix tests
wanliAlex fc2c0d9
Fix all the tests
wanliAlex a0f9cd4
Fix marqo_docs()
wanliAlex f1661b7
Merge branch 'mainline' into li/add-pooling-hf
farshidz 6ba7973
Fix tests
wanliAlex 5e78e9b
Merge branch 'li/add-pooling-hf' of https://github.com/marqo-ai/marqo…
wanliAlex 3e9717f
Fix tests
wanliAlex 38aa780
Fix tests
wanliAlex 878ace5
Fix tests
wanliAlex 4c42251
Merge branch 'mainline' into li/add-pooling-hf
wanliAlex bc354ae
Fix tests
wanliAlex 3640a03
Change name to inference models
wanliAlex 5e53047
Update abstraction
wanliAlex 0b2cf0f
Fix tests
wanliAlex c427407
Catch mainline
wanliAlex 42a56c5
Fix tests
wanliAlex 7aa5bef
Fix tests
wanliAlex 835f6bc
Fix regression
wanliAlex bcb47fd
Fix tests
wanliAlex 18e8cfd
Catch mainline
wanliAlex 2a1132a
Fix load path regression
wanliAlex 9109618
Catch mainline
wanliAlex a6ff0c6
Merge remote-tracking branch 'origin/mainline' into li/add-pooling-hf
wanliAlex 4ad03a4
update dependencies
wanliAlex 17a17e9
Upgrade comments
wanliAlex 3098ce6
Finish abstract
wanliAlex 8c70e7d
Merge remote-tracking branch 'origin/mainline' into li/add-pooling-hf
wanliAlex e28434e
Merge branch 'mainline' into li/add-pooling-hf
wanliAlex b9618de
Finish tests
wanliAlex f7c5d47
Add private model tests
wanliAlex f8fd031
Add private model tests
wanliAlex 10f73dc
Add poolingMethod: mean to bge models
wanliAlex 5c0f0dd
Remove unused code
wanliAlex 213a02c
Add model_auth regression fix
wanliAlex c27d567
Add back the localpath for open_clip model properties:
wanliAlex 0833120
Add validation for dimensions
wanliAlex e1f3e33
Add os.path.exists tests
wanliAlex 20b6c0b
Add dimensions for model properties tests
wanliAlex 9198acc
Add some extra tests for HF loader
wanliAlex 4b4a804
Add secrets to largemodel unittests
wanliAlex 3f3cb4d
Add model properties
wanliAlex 8be1948
Merge branch 'mainline' into li/add-pooling-hf
farshidz 0ed6606
Fix Yihan's comments
wanliAlex 0e4c25d
Merge branch 'li/add-pooling-hf' of https://github.com/marqo-ai/marqo…
wanliAlex 5ed9a21
Merge remote-tracking branch 'origin/mainline' into li/add-pooling-hf
wanliAlex 81c2c4a
Catch mainline
wanliAlex File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,14 @@ | ||
from abc import ABC, abstractmethod | ||
from typing import Optional | ||
|
||
from marqo.tensor_search.models.private_models import ModelAuth | ||
|
||
|
||
class AbstractEmbeddingModel(ABC): | ||
"""This is the abstract base class for all models in Marqo.""" | ||
|
||
def __init__(self, model_properties: Optional[dict] = None, device: Optional[str] = None, | ||
model_auth: Optional[dict] = None): | ||
model_auth: Optional[ModelAuth] = None): | ||
"""Load the model with the given properties. | ||
|
||
Args: | ||
|
@@ -20,7 +22,6 @@ def __init__(self, model_properties: Optional[dict] = None, device: Optional[str | |
if model_properties is None: | ||
model_properties = dict() | ||
|
||
self.model_properties = self._build_model_properties(model_properties) | ||
self.device = device | ||
self.model_auth = model_auth | ||
|
||
|
@@ -33,11 +34,6 @@ def load(self): | |
self._load_necessary_components() | ||
self._check_loaded_components() | ||
Comment on lines
34
to
35
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need to separate these two method? |
||
|
||
@abstractmethod | ||
def _build_model_properties(self, model_properties: dict): | ||
"""Parse the model properties from the user input and convert it to a pydantic model.""" | ||
pass | ||
|
||
@abstractmethod | ||
def _load_necessary_components(self): | ||
"""Load the necessary components for the model.""" | ||
|
@@ -54,4 +50,5 @@ def _check_loaded_components(self): | |
|
||
@abstractmethod | ||
def encode(self): | ||
pass | ||
"""Encode the input data.""" | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
import html | ||
from typing import Union, List | ||
|
||
import ftfy | ||
import regex as re | ||
import torch | ||
|
||
|
||
def whitespace_clean(text): | ||
text = re.sub(r'\s+', ' ', text) | ||
text = text.strip() | ||
return text | ||
|
||
def basic_clean(text): | ||
text = ftfy.fix_text(text) | ||
text = html.unescape(html.unescape(text)) | ||
return text.strip() | ||
|
||
class HFTokenizer: | ||
# HuggingFace _tokenizer wrapper | ||
# Check https://github.com/mlfoundations/open_clip/blob/16e229c596cafaec46a4defaf27e0e30ffcca12d/src/open_clip/tokenizer.py#L188-L201 | ||
def __init__(self, tokenizer_name: str): | ||
from transformers import AutoTokenizer | ||
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) | ||
|
||
def __call__(self, texts: Union[str, List[str]]) -> torch.Tensor: | ||
# same cleaning as for default _tokenizer, except lowercasing | ||
# adding lower (for case-sensitive tokenizers) will make it more robust but less sensitive to nuance | ||
if isinstance(texts, str): | ||
texts = [texts] | ||
texts = [whitespace_clean(basic_clean(text)) for text in texts] | ||
input_ids = self.tokenizer(texts, return_tensors='pt', padding='max_length', truncation=True).input_ids | ||
return input_ids |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be consistent when renaming classes/directories. I notice there's a new directory called
core/inference/inference_models
. Maybe it should becore/inference/embedding_models
to keep consistency if we're referring to the same objects.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes for any other reference to inference models vs. embeddings models