Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow payload request to support extra inference method kwargs #1505

Closed
wants to merge 23 commits into from
Closed

Allow payload request to support extra inference method kwargs #1505

wants to merge 23 commits into from

Conversation

nanbo-liu
Copy link
Contributor

This is for issue 1345
This allow us to pass inference_kwargs into payload

example below:

{
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["My kitten's name is JoJo,","Tell me a story:"],
        }
    ],
    "parameters": {
        "extra":{"max_new_tokens": 200}
    }
}

@nanbo-liu
Copy link
Contributor Author

@adriangonz, Please take a look when you have a chance

@nanbo-liu nanbo-liu marked this pull request as draft December 6, 2023 15:03
@nanbo-liu nanbo-liu marked this pull request as ready for review December 6, 2023 15:03
@sakoush sakoush self-requested a review December 11, 2023 08:36
@ajsalow
Copy link
Contributor

ajsalow commented Dec 14, 2023

@adriangonz @sakoush could we please have this reviewed when either of you have a chance?

Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for your contribution, I left some comments mainly around testing.

Also please update docs accordingly of the hf runtime.

(
{"max_length": 20},
{"max_length": 10},
True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear what expected means given the test case, i suggest to refactor a bit to make it more clearer. It might be just you just need to assert the number of tokens in each request is as follows expected (effectively converting it to 2 test cases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this into 2 test cases. Also asserting the number of predicted tokens are the expected number of tokens

Comment on lines 53 to 55
import pdb

pdb.set_trace()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

runtimes/huggingface/tests/test_common.py Show resolved Hide resolved
@nanbo-liu
Copy link
Contributor Author

@sakoush , would you take a look when you have a chance

@emmettprexus
Copy link

This seems to be the solution to my problem. I'm using MLServer with something like bhadresh-savani/distilbert-base-uncased-emotion and the default (top_k=1) just gives me the highest score - but I need all of them. Setting parameters.extra.top_k to null gives me the complete response from the model.

Fingers crossed that this will get merged soon.

Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the changes so far, I left a few minor suggestions mainly on additional testing cases.

Apologies for the slow response due to the holiday season.

@@ -170,6 +171,10 @@ def encode_request(cls, payload: Dict[str, Any], **kwargs) -> InferenceRequest:

@classmethod
def decode_request(cls, request: InferenceRequest) -> Dict[str, Any]:
"""
Decode Inference requst into dictionary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Decode Inference requst into dictionary
Decode Inference request into dictionary

@@ -170,6 +171,10 @@ def encode_request(cls, payload: Dict[str, Any], **kwargs) -> InferenceRequest:

@classmethod
def decode_request(cls, request: InferenceRequest) -> Dict[str, Any]:
"""
Decode Inference requst into dictionary
extra Inference kwargs can be kept in 'InferenceRequest.parameters.extra'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extra Inference kwargs can be kept in 'InferenceRequest.parameters.extra'
extra Inference kwargs are extracted from 'InferenceRequest.parameters.extra'

values.update(extra)
else:
logging.warn(
"Extra inference kwargs should be kept in a dictionary."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you output the value of the parameter as well in the warning message? And perhaps change the warning message to be in the form of "Extra parameters cannot be parsed, expected a dictionary" to be more descriptive in the message?

runtimes/huggingface/tests/test_codecs.py Show resolved Hide resolved
tokenizer = runtime._model.tokenizer

prediction = await runtime.predict(payload)
generated_text = json.loads(prediction.outputs[0].data[0])["generated_text"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try to use the hf codec decode_response method and check if it makes this line a bit more readable?

@nanbo-liu
Copy link
Contributor Author

@sakoush , would you take a look. All issues you mentioned above should be addressed

@ajsalow
Copy link
Contributor

ajsalow commented Jan 9, 2024

@ajsalow
Copy link
Contributor

ajsalow commented Jan 16, 2024

Can we please get another review? @adriangonz @sakoush @seldondev @RafalSkolasinski

What's the status of this project? Is it going to continue to be maintained?

Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@emmettprexus
Copy link

🥳

@ahousley
Copy link
Member

ahousley commented Jan 17, 2024

What's the status of this project? Is it going to continue to be maintained?

Yes, MLServer continues to be maintained and we (Seldon) welcome contributions. There are releases scheduled for Seldon ecosystem projects and products in the coming weeks, which incorporate a significant amount of customer and community feedback and contributions.

@emmettprexus
Copy link

What's the status of this project? Is it going to continue to be maintained?

Yes, MLServer continues to be maintained and we (Seldon) welcome contributions. There are releases scheduled for Seldon ecosystem projects and products in the coming weeks, which incorporate a significant amount of customer and community feedback and contributions.

Does the upcoming release include these changes?

@ajsalow
Copy link
Contributor

ajsalow commented Feb 5, 2024

Any reason this hasn't been merged yet?

@nanbo-liu
Copy link
Contributor Author

@ajsalow @emmettprexus @sakoush @adriangonz @seldondev @ahousley
Our teammate @geodavic recently found that current Mlserver-huggingface already support loading inference kwargs,
but they just need to be formatted in kserver way.
An example below:

import requests
import json
payload = {
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["My kitten's name is JoJo,","Tell me a story:"],
        },  
        {
          "name": "max_new_tokens",
          "shape": [1],
          "datatype": "INT32",
          "data": [50],
          "parameters": {
                "content_type": "raw"
          }
        },
        {
          "name": "temperature",
          "shape": [1],
          "datatype": "FP64",
          "data": [0.9],
          "parameters": {
                "content_type": "raw",
          }
        }       
    ]
}

response = requests.post(
    "http://localhost:8080/v2/models/tinyllama/infer", json=payload
)

data = json.loads(response.text)
print(data["outputs"])

So, maybe we don't want to merge this PR since it's already supported. We may want to update the README file to show an example how to use extra inference kwargs in payload.

@emmettprexus
Copy link

Our teammate @geodavic recently found that current Mlserver-huggingface already support loading inference kwargs,

It looks like I owe @geodavic a cake for finding that out! I scoured through the code but didn't find any hints, so good job! Now I can move on with my project.

Thanks a ton.

@nanbo-liu nanbo-liu closed this by deleting the head repository Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants