fix: clear KV cache and reset batch state between sequential decode c#393
Conversation
📝 WalkthroughWalkthroughPre-generation KV-cache clearing and decode-state resets were added to the LlamaCPP backend to avoid stale KV-cache and track decode failures; generate now reports "error" on decode failures. The RAC glue now returns an error on "error" finish_reason. JNI surfaces backend error messages and frees partial results on failure. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to 3fb52db in 21 seconds. Click for details.
- Reviewed
53lines of code in3files - Skipped
0files when reviewing. - Skipped posting
0draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
Workflow ID: wflow_cSME8K2VXynoKlI2
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp (1)
850-853:racLlmComponentGenerateStreamdoes not propagate errors as Java exceptions.The non-streaming path now throws a
RuntimeExceptionon failure, but the streaming path (racLlmComponentGenerateStream, line 850-853) still silently returnsnullptron non-SUCCESS status. Kotlin callers using streaming receivenullwith no exception, creating inconsistent error-handling semantics across the two code paths.Consider applying the same
ThrowNewpattern at line 851 for consistency with the non-streaming path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp` around lines 850 - 853, In racLlmComponentGenerateStream, when status != RAC_SUCCESS, throw a Java RuntimeException via env->ThrowNew (same pattern used in the non-streaming path) instead of silently returning nullptr; compose the exception message to include context (e.g., "rac_llm_component_generate_stream failed") and the status code, call ThrowNew with "java/lang/RuntimeException", then return nullptr after throwing to preserve JNI semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 551-557: The post-generation KV cache clear currently calls
llama_memory_clear(llama_get_memory(context_), true) without checking for null;
change that call to mirror the pre-generation logic by retrieving llama_memory_t
mem = llama_get_memory(context_), then if (mem) { llama_memory_clear(mem, true);
} so that llama_get_memory(context_) is null-guarded before calling
llama_memory_clear (refer to the existing pre-generation block that uses
llama_get_memory, mem, and llama_memory_clear).
In `@sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp`:
- Around line 210-213: generate_stream currently returns a boolean that
conflates cancellation with a decode failure (called during the token loop when
llama_decode fails), so update generate_stream to propagate an explicit error
state (e.g., return an enum or set an internal flag like decode_failed_) when
llama_decode fails mid-loop instead of relying on cancel_requested_; adjust the
llama_cpp::generate caller to check this new error indicator and preserve/leave
finish_reason as "error" (do not override to "stop" or "length") when
decode_failed_ is set; ensure the initial prompt-decode path still returns false
directly and that cancel_requested_ remains only for user cancellation, with
clear handling in generate_stream and generate to distinguish cancel vs decode
failure.
- Around line 210-213: The new early-return for generation failure in
rac_llm_llamacpp_generate should set error details before returning and compare
the finish reason via the enum rather than a string: call
rac_error_set_details(...) with a descriptive message (e.g., "generation failed
(llama_decode error)") immediately before returning RAC_ERROR_GENERATION_FAILED,
replace the std::string literal check result.finish_reason == "error" with a
typed comparison against TextGenerationFinishReason::ERROR (or the appropriate
enum member) on the TextGenerationResult, and keep the RAC_LOG_ERROR call for
logging; this ensures callers see fresh diagnostics and removes the raw-string
comparison.
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 568-579: The code path in racLlmComponentGenerate returns nullptr
without throwing if rac_error_message(status) returns null/empty; change the
error handling so that whenever status != RAC_SUCCESS you always throw a Java
exception via env->ThrowNew (use the same exClass "java/lang/RuntimeException")
with a fallback message when msg is null/empty (e.g. "rac error <status> (no
message)"), include the status integer in the message for diagnostics, and
ensure you still call env->DeleteLocalRef(exClass) before returning nullptr so
no Java exception case is silently suppressed.
---
Nitpick comments:
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 850-853: In racLlmComponentGenerateStream, when status !=
RAC_SUCCESS, throw a Java RuntimeException via env->ThrowNew (same pattern used
in the non-streaming path) instead of silently returning nullptr; compose the
exception message to include context (e.g., "rac_llm_component_generate_stream
failed") and the status code, call ThrowNew with "java/lang/RuntimeException",
then return nullptr after throwing to preserve JNI semantics.
…rror details, and JNI exception fallback
df9c939 to
2473e1b
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp (1)
1015-1019:⚠️ Potential issue | 🟠 Major
racLlmComponentGenerateStreamWithCallbackreturnsnullptrsilently on failure — no Java exception thrown.Both
racLlmComponentGenerate(lines 568–584) andracLlmComponentGenerateStream(lines 876–890) now throw aRuntimeExceptionon non-SUCCESS status. This variant only deletes the global ref and returnsnullptr, leaving Kotlin callers with no recoverable error signal.🛡️ Proposed fix — mirrors the streaming path at lines 876-890
if (status != RAC_SUCCESS) { env->DeleteGlobalRef(globalCallback); LOGe("rac_llm_component_generate_stream failed with status=%d", status); + const char* msg = rac_error_message(status); + jclass exClass = env->FindClass("java/lang/RuntimeException"); + if (exClass) { + char fallback[64]; + if (!msg || !*msg) { + snprintf(fallback, sizeof(fallback), + "LLM stream with callback failed (status=%d)", status); + msg = fallback; + } + env->ThrowNew(exClass, msg); + env->DeleteLocalRef(exClass); + } return nullptr; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp` around lines 1015 - 1019, When racLlmComponentGenerateStreamWithCallback returns a non-SUCCESS status it currently only deletes the globalCallback ref and returns nullptr; update this path to mirror racLlmComponentGenerate and racLlmComponentGenerateStream by throwing a Java RuntimeException (use env->ThrowNew with java/lang/RuntimeException) that includes the formatted status message, ensure you still call env->DeleteGlobalRef(globalCallback) before throwing, and keep the LOGe call for native logging so Kotlin callers receive a recoverable exception instead of a silent nullptr.
🧹 Nitpick comments (2)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.h (1)
137-137:decode_failed_should bestd::atomic<bool>for consistency and thread safety.
cancel_requested_(line 136) isstd::atomic<bool>, butdecode_failed_is a plainbool. Both are written ingenerate_stream()undermutex_and read ingenerate()without any lock. While single-threaded sequential calls are safe (sequenced-before), concurrent calls togenerate()create a formal C++ data race: the mutex release ingenerate_stream()does not establish a happens-before relationship with the unlocked read ingenerate().♻️ Proposed fix
- bool decode_failed_ = false; + std::atomic<bool> decode_failed_{false};🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.h` at line 137, The field decode_failed_ is a plain bool causing a potential data race because it's written under mutex_ in generate_stream() but read without locking in generate(), whereas cancel_requested_ is std::atomic<bool>; change decode_failed_ to std::atomic<bool> and update any initialization/uses to operate atomically (keep assignments and reads compatible with std::atomic<bool>) so both cancel_requested_ and decode_failed_ are consistent and thread-safe across generate_stream() and generate().sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (1)
532-534: Two optional improvements: string-basedfinish_reasonand misleadinggenerate_stream()return value on decode failure.
String literals for
finish_reason— Lines 532–538 and the surrounding logic use raw string comparisons ("error","cancelled","stop","length"). Per the coding guidelines, structured types should replace raw strings for consistency and scalability. Anenum class FinishReasonwould make exhaustive handling enforceable at compile time.
generate_stream()returnstrueon decode failure — Line 746 returns!cancel_requested_.load(), which evaluates totruewhendecode_failed_is set (since cancellation was not requested).generate()compensates by directly inspectingdecode_failed_, but any other direct caller ofgenerate_stream()would silently receivetruewhile a decode has failed.♻️ Sketch for improvement (2)
- return !cancel_requested_.load(); + return !cancel_requested_.load() && !decode_failed_;As per coding guidelines: "Always use structured types, never use strings directly for consistency and scalability."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around lines 532 - 534, The code uses raw string literals for result.finish_reason and returns true from generate_stream() even when decode_failed_ is set; create an enum class FinishReason { Error, Cancelled, Stop, Length, Unknown } and change the type of result.finish_reason (or introduce a new field) to use FinishReason instead of string literals, update all places that set or compare finish_reason (e.g., the branches currently assigning "error", "cancelled", "stop", "length") to use the enum values, and modify generate_stream() to return false if decode_failed_ is true (i.e., return !cancel_requested_.load() && !decode_failed_.load()) so callers correctly observe decode failure; ensure generate() and any other callers are updated to map/serialize the enum back to the previous string representation where external interfaces require it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 1015-1019: When racLlmComponentGenerateStreamWithCallback returns
a non-SUCCESS status it currently only deletes the globalCallback ref and
returns nullptr; update this path to mirror racLlmComponentGenerate and
racLlmComponentGenerateStream by throwing a Java RuntimeException (use
env->ThrowNew with java/lang/RuntimeException) that includes the formatted
status message, ensure you still call env->DeleteGlobalRef(globalCallback)
before throwing, and keep the LOGe call for native logging so Kotlin callers
receive a recoverable exception instead of a silent nullptr.
---
Duplicate comments:
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 739-741: The post-generation KV-cache clear should be null-guarded
like the pre-generation clear: call llama_get_memory(context_) into a local
llama_memory_t variable (post_mem) and only call llama_memory_clear(post_mem,
true) if post_mem is non-null; ensure you use the same pattern as the
pre-generation guard (the symbols to change/verify are llama_get_memory,
llama_memory_t post_mem, and llama_memory_clear with context_).
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 568-584: There are duplicate error-handling blocks after
racLlmComponentGenerate causing double logging, double
rac_llm_result_free(&result) and a shadowed/unused msg; remove the outer
duplicate so there is a single error path that (1) logs the failure once, (2)
frees result exactly once via rac_llm_result_free(&result), (3) obtains const
char* msg = rac_error_message(status) and if msg is null or empty substitute a
clear fallback string (e.g., "Unknown error from rac_error_message"), (4) find
java/lang/RuntimeException and call env->ThrowNew with that non-empty message,
delete the local ref, and (5) return nullptr; ensure this logic is applied in
the non‑streaming path (the block around racLlmComponentGenerate) mirroring the
correct streaming implementation (lines ~876–890).
---
Nitpick comments:
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 532-534: The code uses raw string literals for
result.finish_reason and returns true from generate_stream() even when
decode_failed_ is set; create an enum class FinishReason { Error, Cancelled,
Stop, Length, Unknown } and change the type of result.finish_reason (or
introduce a new field) to use FinishReason instead of string literals, update
all places that set or compare finish_reason (e.g., the branches currently
assigning "error", "cancelled", "stop", "length") to use the enum values, and
modify generate_stream() to return false if decode_failed_ is true (i.e., return
!cancel_requested_.load() && !decode_failed_.load()) so callers correctly
observe decode failure; ensure generate() and any other callers are updated to
map/serialize the enum back to the previous string representation where external
interfaces require it.
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.h`:
- Line 137: The field decode_failed_ is a plain bool causing a potential data
race because it's written under mutex_ in generate_stream() but read without
locking in generate(), whereas cancel_requested_ is std::atomic<bool>; change
decode_failed_ to std::atomic<bool> and update any initialization/uses to
operate atomically (keep assignments and reads compatible with
std::atomic<bool>) so both cancel_requested_ and decode_failed_ are consistent
and thread-safe across generate_stream() and generate().
b5ecf04 to
8a163ca
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp (2)
902-905:⚠️ Potential issue | 🟡 MinorStreaming completion-error path silently returns
null— Java exception not thrown.Lines 877–887 now properly throw a
RuntimeExceptionwhenrac_llm_component_generate_streamfails at call time, but the async error path (whenllm_stream_error_callbackfires, or the 10-minute wait times out) at lines 902–905 still returnsnullptrwith no Java exception. The caller cannot distinguish this from a legitimate empty result.The same gap exists in
racLlmComponentGenerateStreamWithCallback: the initial-call failure at lines 1014–1018 is silent, and thehas_errorpath at lines 1034–1037 is silent too.Since both paths execute on the JNI calling thread (the function blocks on
cv.wait_for),env->ThrowNewis valid there.🛡️ Proposed fix for
racLlmComponentGenerateStreamlines 902–905if (ctx.has_error) { LOGe("Streaming failed: %s", ctx.error_message.c_str()); + jclass exClass = env->FindClass("java/lang/RuntimeException"); + if (exClass) { + env->ThrowNew(exClass, ctx.error_message.empty() + ? "LLM stream generation failed" : ctx.error_message.c_str()); + env->DeleteLocalRef(exClass); + } return nullptr; }Apply the same pattern to
racLlmComponentGenerateStreamWithCallbackat lines 1014–1018 and 1034–1037.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp` around lines 902 - 905, The async/error completion paths currently return nullptr silently; update racLlmComponentGenerateStream to call env->ThrowNew(env->FindClass("java/lang/RuntimeException"), ctx.error_message.c_str()) when ctx.has_error is true (instead of returning nullptr) so the Java caller sees an exception; apply the same pattern to racLlmComponentGenerateStreamWithCallback for both the initial-call failure path (use the result/error message returned from rac_llm_component_generate_stream) and the post-wait ctx.has_error path, ensuring each thrown RuntimeException includes the relevant error_message string.
17-24:⚠️ Potential issue | 🟡 MinorAdd explicit
#include <chrono>— code usesstd::chronowithout declaring the header.The file uses
std::chrono::minutes(10)at lines 894 and 1023 but lacks an explicit#include <chrono>. Although the code may compile on Android NDK/Clang/libc++ due to transitive inclusion through<condition_variable>or<mutex>, the C++17 standard does not guarantee this. Modern libc++ actively reduces transitive includes to enforce "include what you use." Explicitly including<chrono>ensures portability across toolchains and NDK versions.Diff
`#include` <condition_variable> +#include <chrono> `#include` <cstring>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp` around lines 17 - 24, The file runanywhere_commons_jni.cpp uses std::chrono (e.g. std::chrono::minutes(10) at usages around the code) but does not explicitly include the <chrono> header; add an explicit `#include` <chrono> near the top of runanywhere_commons_jni.cpp alongside the other includes so symbols like std::chrono::minutes resolve portably across toolchains and NDK versions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 902-905: The async/error completion paths currently return nullptr
silently; update racLlmComponentGenerateStream to call
env->ThrowNew(env->FindClass("java/lang/RuntimeException"),
ctx.error_message.c_str()) when ctx.has_error is true (instead of returning
nullptr) so the Java caller sees an exception; apply the same pattern to
racLlmComponentGenerateStreamWithCallback for both the initial-call failure path
(use the result/error message returned from rac_llm_component_generate_stream)
and the post-wait ctx.has_error path, ensuring each thrown RuntimeException
includes the relevant error_message string.
- Around line 17-24: The file runanywhere_commons_jni.cpp uses std::chrono (e.g.
std::chrono::minutes(10) at usages around the code) but does not explicitly
include the <chrono> header; add an explicit `#include` <chrono> near the top of
runanywhere_commons_jni.cpp alongside the other includes so symbols like
std::chrono::minutes resolve portably across toolchains and NDK versions.
---
Duplicate comments:
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 568-583: No change required: the error path in
racLlmComponentGenerate already frees result with rac_llm_result_free(&result),
constructs a safe fallback message using snprintf, retrieves the error string
via rac_error_message(status), and throws a Java RuntimeException via
env->FindClass + env->ThrowNew while deleting the local ref; keep the existing
logic as-is (no code modifications needed in racLlmComponentGenerate,
rac_llm_result_free, rac_error_message or the exception-throwing block).
…RunanywhereAI#393) * fix: clear KV cache and reset batch state between sequential decode calls on arm64 * fix: address bot review comments - null guard, decode failure flag, error details, and JNI exception fallback * fix: make decode_failed_ std::atomic for thread safety (review) --------- Co-authored-by: sakirr <sakirahmed75531@gmail.com>
Android SIGABRT on second inference (arm64)
Fixes #356
Problem
On Android arm64, running two LLM generations in a row could crash the process with SIGABRT. The crash happened inside
llama_context::decode→ggml_abortbecause of a position mismatch: the KV cache was not cleared between calls. The firstgenerate()left the context’s KV cache and internal position state in use; the secondgenerate()then reused the same context, so the decoder saw inconsistent state and aborted. This was not catchable from Java and appeared as a hard process kill.Solution
Two parts:
generate_stream()starts from a clean state (same approach as in the VLM code path). A clear at the end of generation is also done for hygiene.Code changes
1. LlamaCPP backend — clear KV cache in
generate_stream()File:
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cppAt the start of
generate_stream(), after the readiness check and under the mutex:A second
llama_memory_clear(...)is called at the end of the generation loop (after the decode loop) so the context is left clean.2. LlamaCPP LLM layer — treat generation failure as error
File:
sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cppAfter
text_gen->generate(request), if the result indicates a failure, return an error instead of success:3. JNI — throw on non-success
File:
sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cppWhen
rac_llm_component_generatereturns a non-success status, throw a Java exception before returning:Testing
Manual (in-app)
./gradlew installDebugfromexamples/android/RunAnywhereAI).Before fix: the second generation could trigger SIGABRT and kill the process.
After fix: both generations complete; any real failure is reported as an exception (e.g.
RuntimeExceptionwith message fromrac_error_message).Instrumented regression test
The test
Issue356TwoGenerationRegressionTestruns two consecutive non-streaming generations and asserts both return non-empty results. It is skipped if the SDK is not initialized or no LLM model is downloaded.Snippet:
examples/android/RunAnywhereAI/app/src/androidTest/.../Issue356TwoGenerationRegressionTest.ktRun the test (device/emulator connected, at least one LLM model downloaded in the app):
cd examples/android/RunAnywhereAI ./gradlew connectedAndroidTest -Pandroid.testInstrumentationRunnerArguments.class=com.runanywhere.runanywhereai.Issue356TwoGenerationRegressionTestFiles changed
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cppgenerate_stream().sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cppRAC_ERROR_GENERATION_FAILEDwhenresult.finish_reason == "error".sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpprac_llm_component_generate, throwRuntimeExceptionwithrac_error_message(status)before returning null.Important
Fixes Android arm64 crash by clearing KV cache and resetting batch state between sequential LLM generations, and improves error handling by propagating failures as exceptions.
generate_stream()inllamacpp_backend.cppto prevent position conflicts in sequential calls.rac_llm_llamacpp_generate()inrac_llm_llamacpp.cpp.RuntimeExceptioninracLlmComponentGenerateinrunanywhere_commons_jni.cppon non-success status.Issue356TwoGenerationRegressionTestto ensure two consecutive generations return non-empty results.This description was created by
for 3fb52db. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
Greptile Summary
Fixes critical SIGABRT crash on Android arm64 when running two consecutive LLM generations by clearing KV cache state between calls
Key changes:
llama_memory_clear()at the start and end ofgenerate_stream()to reset KV cache and position state (matches existing VLM pattern at rac_vlm_llamacpp.cpp:520-524)RAC_ERROR_GENERATION_FAILEDwhenfinish_reason == "error"instead of silently succeeding with empty resultRuntimeExceptionin JNI layer when generation fails, converting native errors to catchable Java exceptions instead of uncatchable process killsTesting:
Issue356TwoGenerationRegressionTestvalidates fixNote on review comment:
resultis zero-initialized and the backend only allocatesresult.texton success. Addingrac_llm_result_free(&result)would be safe but unnecessary defensive programming.Confidence Score: 4/5
Important Files Changed
generate_stream()to prevent position conflicts on sequential calls - matches existing VLM patternfinish_reason == "error"and returnsRAC_ERROR_GENERATION_FAILEDinstead of success - properly propagates decode failuresRuntimeExceptionwith error message when generation fails - converts native errors to Java exceptions. Note: potential memory leak ifresultwas partially filledFlowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Java: generate call] --> B[JNI: racLlmComponentGenerate] B --> C[Initialize result struct to zero] C --> D[Call rac_llm_component_generate] D --> E[Call rac_llm_generate] E --> F[Call backend: rac_llm_llamacpp_generate] F --> G[Call text_gen->generate] G --> H[generate_stream with mutex] H --> I{Check is_ready?} I -->|No| J[Return false] I -->|Yes| K[**NEW: Clear KV cache**] K --> L[llama_memory_clear mem, true] L --> M[Create batch & tokenize] M --> N[llama_decode prompt] N -->|Decode fails| O[Free batch, return false] N -->|Success| P[Generation loop] P --> Q[Sample & decode tokens] Q --> R{Loop complete?} R -->|No| Q R -->|Yes| S[**NEW: Clear KV cache again**] S --> T[Free batch] T --> U{Success?} U -->|No| V[Set finish_reason = error] U -->|Yes| W[Set finish_reason = stop/length] V --> X[**NEW: Check finish_reason**] W --> Y[Fill out_result] X -->|error| Z[**NEW: Return RAC_ERROR_GENERATION_FAILED**] Z --> AA[**NEW: JNI throws RuntimeException**] AA --> AB[Return null to Java] Y --> AC[Return RAC_SUCCESS] AC --> AD[JNI returns JSON to Java]Last reviewed commit: 3fb52db
(5/5) You can turn off certain types of comments like style here!