You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am getting an error that the prompt length exceeds the maximum input length when calling meta/llama-2-70b through the API. I have included the error log from the Replicate dashboard online (see below). I have called the same model in the past without error, and I am almost sure that the prompts were identical or similar in length (prediction data expires for older predictions so I can't verify to be 100% sure). The prompt is also not very long---just 6 question-answering demonstrations with a few intermediate reasoning steps.
Inspecting further, I discovered that there are two different replicate-internal models that are being called to serve the request.
Do these models have have different maximum input lengths? If so, how can I call replicate-internal/staging-llama-2-70b-mlc or another llama-2-70b model with a large enough maximum input length?
The error:
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1 0x7f48929b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f48929b41cd]
2 0x7f48949dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3 0x7f48949dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4 0x7f49c93f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f49c93f2253]
5 0x7f49c9181ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f49c9181ac3]
6 0x7f49c9212a04 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 1728927168: Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1 0x7f48929b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f48929b41cd]
2 0x7f48949dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3 0x7f48949dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4 0x7f49c93f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f49c93f2253]
5 0x7f49c9181ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f49c9181ac3]
6 0x7f49c9212a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1 0x7fbb0e9b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7fbb0e9b41cd]
2 0x7fbb109dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3 0x7fbb109dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4 0x7fbc3cbf2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fbc3cbf2253]
5 0x7fbc3c981ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fbc3c981ac3]
6 0x7fbc3ca12a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1 0x7fa92a9b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7fa92a9b41cd]
2 0x7fa92c9dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3 0x7fa92c9dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4 0x7faa585f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7faa585f2253]
5 0x7faa58381ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7faa58381ac3]
6 0x7faa58412a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1 0x7f9a329b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f9a329b41cd]
2 0x7f9a349dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3 0x7f9a349dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4 0x7f9b5fbf2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9b5fbf2253]
5 0x7f9b5f981ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f9b5f981ac3]
6 0x7f9b5fa12a04 clone + 68
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/cog/server/worker.py", line 224, in _handle_predict_error
yield
File "/usr/local/lib/python3.10/dist-packages/cog/server/worker.py", line 253, in _predict_async
async for r in result:
File "/src/predict.py", line 180, in predict
output = event.json()["text_output"]
KeyError: 'text_output'
The text was updated successfully, but these errors were encountered:
jdkanu
changed the title
Do [replicate-internal/staging-llama-2-70b-mlc](https://replicate.com/replicate-internal/staging-llama-2-70b-mlc) and [replicate-internal/llama-2-70b-triton](https://replicate.com/replicate-internal/llama-2-70b-triton) have different maximum input lengths?
Do replicate-internal/staging-llama-2-70b-mlc and replicate-internal/llama-2-70b-triton have different maximum input lengths?
Mar 15, 2024
I am getting an error that the prompt length exceeds the maximum input length when calling
meta/llama-2-70b
through the API. I have included the error log from the Replicate dashboard online (see below). I have called the same model in the past without error, and I am almost sure that the prompts were identical or similar in length (prediction data expires for older predictions so I can't verify to be 100% sure). The prompt is also not very long---just 6 question-answering demonstrations with a few intermediate reasoning steps.Inspecting further, I discovered that there are two different
replicate-internal
models that are being called to serve the request.replicate-internal/staging-llama-2-70b-mlc (this one gave me no error)
and
replicate-internal/llama-2-70b-triton (this one gives an error)
Do these models have have different maximum input lengths? If so, how can I call
replicate-internal/staging-llama-2-70b-mlc
or anotherllama-2-70b
model with a large enough maximum input length?The error:
The text was updated successfully, but these errors were encountered: