-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ml): better multilingual search with nllb models #13567
base: main
Are you sure you want to change the base?
Conversation
c29a7ad
to
8ea15a9
Compare
📖 Documentation deployed to pr-13567.preview.immich.app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome
@@ -66,6 +66,65 @@ | |||
SUPPORTED_PROVIDERS = ["CUDAExecutionProvider", "OpenVINOExecutionProvider", "CPUExecutionProvider"] | |||
|
|||
|
|||
WEBLATE_TO_FLORES200 = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder where this mapping should live. To some extent the server knows which model it is using when it sends the request so you could argue this belongs there. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it's better for it to live here - it makes things easier if you query ML directly or use the models in another context like in notebooks.
@@ -144,6 +144,7 @@ | |||
page: nextPage, | |||
withExif: true, | |||
isVisible: true, | |||
language: $lang, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a new setting? I like to have my UI in English, but I prefer to use my native language for searching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we could have a Search Language user preference separate from the UI language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing is that the setting only applies to four non-default models, so I wonder if it'd be confusing to users. Maybe we can hide the setting if they're not using a relevant model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also have this be something the admin can override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean making it use a certain search language by default for all users? That could be nice, possibly doing the same for the UI language as well. But I think we should avoid making too many changes at once in this PR, so the admin stuff can be done separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few related FRs came up that I consolidated into #13651.
Description
The NLLB models have the highest quality results for multilingual text in our model catalog. However, these models expect the input to explicitly indicate the language of the text. Since we don't provide this, the results are worse than other multilingual models.
This PR makes the machine learning service accept a
language
option that it maps to the corresponding FLORES200 token that NLLB expects. The server is updated to accept this parameter and forward it to the machine learning service, while web and mobile are updated to provide the language based on current user settings. Search still works without it, and the option only has an impact when using an NLLB model.How Has This Been Tested?
I tested on web and can confirm the results for Turkish are significantly better. Changing the language in the user settings has a marked effect on the ranking. I haven't tested mobile so I'm not sure if all the language codes there line up with the map in the machine learning service.