Skip to content

Comments

fix: clear KV cache and reset batch state between sequential decode c#393

Merged
Siddhesh2377 merged 4 commits intoRunanywhereAI:mainfrom
sakirr05:fix/issue-356-sigabrt-decode-arm64
Feb 20, 2026
Merged

fix: clear KV cache and reset batch state between sequential decode c#393
Siddhesh2377 merged 4 commits intoRunanywhereAI:mainfrom
sakirr05:fix/issue-356-sigabrt-decode-arm64

Conversation

@sakirr05
Copy link
Contributor

@sakirr05 sakirr05 commented Feb 19, 2026

Android SIGABRT on second inference (arm64)

Fixes #356

Problem

On Android arm64, running two LLM generations in a row could crash the process with SIGABRT. The crash happened inside llama_context::decodeggml_abort because of a position mismatch: the KV cache was not cleared between calls. The first generate() left the context’s KV cache and internal position state in use; the second generate() then reused the same context, so the decoder saw inconsistent state and aborted. This was not catchable from Java and appeared as a hard process kill.

Solution

Two parts:

  1. Clear KV cache at the start of each generation so every call to generate_stream() starts from a clean state (same approach as in the VLM code path). A clear at the end of generation is also done for hygiene.
  2. Propagate generation failures as errors so decode/generation failures are reported via the API and surfaced as a Java exception instead of an uncatchable native crash.

Code changes

1. LlamaCPP backend — clear KV cache in generate_stream()

File: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp

At the start of generate_stream(), after the readiness check and under the mutex:

// Clear KV cache before each new generation to avoid position conflicts on
// sequential calls (fixes #356: SIGABRT on second decode on Android arm64).
llama_memory_t mem = llama_get_memory(context_);
if (mem) {
    llama_memory_clear(mem, true);
}

A second llama_memory_clear(...) is called at the end of the generation loop (after the decode loop) so the context is left clean.

2. LlamaCPP LLM layer — treat generation failure as error

File: sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp

After text_gen->generate(request), if the result indicates a failure, return an error instead of success:

result = h->text_gen->generate(request);
// ...
if (result.finish_reason == "error") {
    RAC_LOG_ERROR("LLM.LlamaCpp", "rac_llm_llamacpp_generate: generation failed (e.g. llama_decode error)");
    return RAC_ERROR_GENERATION_FAILED;
}

3. JNI — throw on non-success

File: sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp

When rac_llm_component_generate returns a non-success status, throw a Java exception before returning:

rac_result_t status = rac_llm_component_generate(...);

if (status != RAC_SUCCESS) {
    const char* msg = rac_error_message(status);
    if (msg && *msg) {
        jclass exClass = env->FindClass("java/lang/RuntimeException");
        if (exClass) {
            env->ThrowNew(exClass, msg);
            env->DeleteLocalRef(exClass);
        }
    }
    return nullptr;
}

Testing

Manual (in-app)

  1. Build and install the Android app (e.g. ./gradlew installDebug from examples/android/RunAnywhereAI).
  2. On an arm64 device or emulator: open the app → Chat → load an LLM model (download one if needed).
  3. Send a first message (e.g. “Hello”) and wait for the reply.
  4. Send a second message (e.g. “Hi again”) without closing the app or unloading the model.

Before fix: the second generation could trigger SIGABRT and kill the process.
After fix: both generations complete; any real failure is reported as an exception (e.g. RuntimeException with message from rac_error_message).

Instrumented regression test

The test Issue356TwoGenerationRegressionTest runs two consecutive non-streaming generations and asserts both return non-empty results. It is skipped if the SDK is not initialized or no LLM model is downloaded.

Snippet: examples/android/RunAnywhereAI/app/src/androidTest/.../Issue356TwoGenerationRegressionTest.kt

@Test
fun twoConsecutiveGenerates_doNotCrash_andReturnResults() = runBlocking {
    // ... assume SDK initialized and (if needed) load first downloaded LLM ...

    // First generation (pre-fix: this succeeded)
    val result1 = RunAnywhere.generate("Say exactly: one.", null)
    assertNotNull("First generate should return non-null result", result1)
    assertFalse("First generate should return non-empty text", result1.text.isBlank())

    // Second generation (pre-fix: this caused SIGABRT on arm64)
    val result2 = RunAnywhere.generate("Say exactly: two.", null)
    assertNotNull("Second generate should return non-null result", result2)
    assertFalse("Second generate should return non-empty text", result2.text.isBlank())
}

Run the test (device/emulator connected, at least one LLM model downloaded in the app):

cd examples/android/RunAnywhereAI
./gradlew connectedAndroidTest -Pandroid.testInstrumentationRunnerArguments.class=com.runanywhere.runanywhereai.Issue356TwoGenerationRegressionTest
  • Pass: both generations completed without crash (fix verified).
  • Skip: ensure RunAnywhere is initialized and at least one LLM model is downloaded.
  • Fail: inspect the exception/stack trace; with the fix, generation failures should surface as thrown exceptions rather than SIGABRT.

Files changed

File Change
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp Clear KV cache at start (and end) of generate_stream().
sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp Return RAC_ERROR_GENERATION_FAILED when result.finish_reason == "error".
sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp On non-success from rac_llm_component_generate, throw RuntimeException with rac_error_message(status) before returning null.

Important

Fixes Android arm64 crash by clearing KV cache and resetting batch state between sequential LLM generations, and improves error handling by propagating failures as exceptions.

  • Behavior:
    • Clear KV cache at the start and end of generate_stream() in llamacpp_backend.cpp to prevent position conflicts in sequential calls.
    • Propagate generation failures as errors in rac_llm_llamacpp_generate() in rac_llm_llamacpp.cpp.
    • Throw RuntimeException in racLlmComponentGenerate in runanywhere_commons_jni.cpp on non-success status.
  • Testing:
    • Manual testing on Android arm64 to verify crash resolution and error propagation.
    • Added Issue356TwoGenerationRegressionTest to ensure two consecutive generations return non-empty results.

This description was created by Ellipsis for 3fb52db. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • Bug Fixes
    • Prevented crashes on sequential generation by resetting internal memory and state before each run.
    • Improved detection and reporting of decode failures so finish reasons reflect errors vs cancellations/length.
    • Safer cleanup to avoid invalid memory clears after generation.
    • Propagated generation errors to the host with clearer exception messages and early exit on failure.

Greptile Summary

Fixes critical SIGABRT crash on Android arm64 when running two consecutive LLM generations by clearing KV cache state between calls

Key changes:

  • Adds llama_memory_clear() at the start and end of generate_stream() to reset KV cache and position state (matches existing VLM pattern at rac_vlm_llamacpp.cpp:520-524)
  • Returns RAC_ERROR_GENERATION_FAILED when finish_reason == "error" instead of silently succeeding with empty result
  • Throws RuntimeException in JNI layer when generation fails, converting native errors to catchable Java exceptions instead of uncatchable process kills

Testing:

  • Manual testing on arm64 device with two consecutive generations
  • New regression test Issue356TwoGenerationRegressionTest validates fix

Note on review comment:

  • The "logic" comment about potential memory leak in JNI error handling is overly cautious - the current code is actually correct since result is zero-initialized and the backend only allocates result.text on success. Adding rac_llm_result_free(&result) would be safe but unnecessary defensive programming.

Confidence Score: 4/5

  • Safe to merge with minor review comment that should be reconsidered
  • The fix correctly addresses the root cause (KV cache not cleared between calls) and follows existing patterns from VLM code. Error propagation is properly implemented. Score reduced by 1 because: (1) one review comment about memory leak is overly cautious and may not be necessary, and (2) would benefit from verification that all error paths properly clean up resources
  • sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp - review the memory leak comment (it may be overly defensive)

Important Files Changed

Filename Overview
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp Adds KV cache clearing at start and end of generate_stream() to prevent position conflicts on sequential calls - matches existing VLM pattern
sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp Checks finish_reason == "error" and returns RAC_ERROR_GENERATION_FAILED instead of success - properly propagates decode failures
sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp Throws RuntimeException with error message when generation fails - converts native errors to Java exceptions. Note: potential memory leak if result was partially filled

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Java: generate call] --> B[JNI: racLlmComponentGenerate]
    B --> C[Initialize result struct to zero]
    C --> D[Call rac_llm_component_generate]
    D --> E[Call rac_llm_generate]
    E --> F[Call backend: rac_llm_llamacpp_generate]
    F --> G[Call text_gen->generate]
    G --> H[generate_stream with mutex]
    H --> I{Check is_ready?}
    I -->|No| J[Return false]
    I -->|Yes| K[**NEW: Clear KV cache**]
    K --> L[llama_memory_clear mem, true]
    L --> M[Create batch & tokenize]
    M --> N[llama_decode prompt]
    N -->|Decode fails| O[Free batch, return false]
    N -->|Success| P[Generation loop]
    P --> Q[Sample & decode tokens]
    Q --> R{Loop complete?}
    R -->|No| Q
    R -->|Yes| S[**NEW: Clear KV cache again**]
    S --> T[Free batch]
    T --> U{Success?}
    U -->|No| V[Set finish_reason = error]
    U -->|Yes| W[Set finish_reason = stop/length]
    V --> X[**NEW: Check finish_reason**]
    W --> Y[Fill out_result]
    X -->|error| Z[**NEW: Return RAC_ERROR_GENERATION_FAILED**]
    Z --> AA[**NEW: JNI throws RuntimeException**]
    AA --> AB[Return null to Java]
    Y --> AC[Return RAC_SUCCESS]
    AC --> AD[JNI returns JSON to Java]
Loading

Last reviewed commit: 3fb52db

(5/5) You can turn off certain types of comments like style here!

@sakirr05 sakirr05 marked this pull request as ready for review February 19, 2026 23:12
@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

📝 Walkthrough

Walkthrough

Pre-generation KV-cache clearing and decode-state resets were added to the LlamaCPP backend to avoid stale KV-cache and track decode failures; generate now reports "error" on decode failures. The RAC glue now returns an error on "error" finish_reason. JNI surfaces backend error messages and frees partial results on failure.

Changes

Cohort / File(s) Summary
LlamaCPP backend implementation
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp, sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.h
Added decode_failed_ atomic flag; clear KV-cache and reset decode_failed_/cancel state before each generation; set decode_failed_ on decode failures; safer post-generation memory clear; finish_reason now prefers error when decode failed.
Backend-to-RAC glue
sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp
After generation, if finish_reason == "error" log error, set error detail and return RAC_ERROR_GENERATION_FAILED early to avoid returning partial results.
JNI error handling
sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp
On non-SUCCESS statuses free partial rac_llm_result where applicable, retrieve rac_error_message(status) and throw a Java RuntimeException with that message (or fallback); removed unused #include <chrono>.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Java as Java client
participant JNI as JNI bridge
participant Backend as LlamaCPP backend
participant Llama as Llama/C++ engine
Java->>JNI: racLlmComponentGenerate(...)
JNI->>Backend: rac_llm_llamacpp_generate(...)
Backend->>Backend: clear KV-cache; reset decode_failed_ and cancel state
Backend->>Llama: start generate_stream / decode loop
alt decode failure occurs
Llama-->>Backend: decode error
Backend->>Backend: set decode_failed_ = true; stop generation
end
Backend-->>JNI: return status + result (or error)
alt status != SUCCESS
JNI->>JNI: free partial result (if any)
JNI-->>Java: throw RuntimeException(with backend message)
else
JNI-->>Java: return success result
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I hopped the cache and swept it wide,

Reset my flags before the stride.
If decode falters, I raise the cry,
Free the pieces, tell Java why.
Fresh runs now — no sudden slide. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main fix: clearing KV cache and resetting batch state between sequential decode calls to address the SIGABRT crash.
Description check ✅ Passed The description follows the template with clear problem statement, solution, code changes, testing instructions, and checklist; includes comprehensive details about the Android arm64 crash and fix validation.
Linked Issues check ✅ Passed The PR fully addresses all objectives from issue #356: clears KV cache between calls to eliminate ggml_abort crashes, resets internal state for consistency, and propagates failures as catchable Java exceptions instead of uncatchable SIGABRT.
Out of Scope Changes check ✅ Passed All changes directly address the linked issue #356: KV cache clearing in llamacpp_backend.cpp, error propagation in rac_llm_llamacpp.cpp and runanywhere_commons_jni.cpp, and removal of unused include are all within scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to 3fb52db in 21 seconds. Click for details.
  • Reviewed 53 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

Workflow ID: wflow_cSME8K2VXynoKlI2

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@sakirr05 sakirr05 changed the title fix: clear KV cache and reset batch state between sequential decode c… fix: clear KV cache and reset batch state between sequential decode c Feb 19, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp (1)

850-853: racLlmComponentGenerateStream does not propagate errors as Java exceptions.

The non-streaming path now throws a RuntimeException on failure, but the streaming path (racLlmComponentGenerateStream, line 850-853) still silently returns nullptr on non-SUCCESS status. Kotlin callers using streaming receive null with no exception, creating inconsistent error-handling semantics across the two code paths.

Consider applying the same ThrowNew pattern at line 851 for consistency with the non-streaming path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp` around lines 850
- 853, In racLlmComponentGenerateStream, when status != RAC_SUCCESS, throw a
Java RuntimeException via env->ThrowNew (same pattern used in the non-streaming
path) instead of silently returning nullptr; compose the exception message to
include context (e.g., "rac_llm_component_generate_stream failed") and the
status code, call ThrowNew with "java/lang/RuntimeException", then return
nullptr after throwing to preserve JNI semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 551-557: The post-generation KV cache clear currently calls
llama_memory_clear(llama_get_memory(context_), true) without checking for null;
change that call to mirror the pre-generation logic by retrieving llama_memory_t
mem = llama_get_memory(context_), then if (mem) { llama_memory_clear(mem, true);
} so that llama_get_memory(context_) is null-guarded before calling
llama_memory_clear (refer to the existing pre-generation block that uses
llama_get_memory, mem, and llama_memory_clear).

In `@sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp`:
- Around line 210-213: generate_stream currently returns a boolean that
conflates cancellation with a decode failure (called during the token loop when
llama_decode fails), so update generate_stream to propagate an explicit error
state (e.g., return an enum or set an internal flag like decode_failed_) when
llama_decode fails mid-loop instead of relying on cancel_requested_; adjust the
llama_cpp::generate caller to check this new error indicator and preserve/leave
finish_reason as "error" (do not override to "stop" or "length") when
decode_failed_ is set; ensure the initial prompt-decode path still returns false
directly and that cancel_requested_ remains only for user cancellation, with
clear handling in generate_stream and generate to distinguish cancel vs decode
failure.
- Around line 210-213: The new early-return for generation failure in
rac_llm_llamacpp_generate should set error details before returning and compare
the finish reason via the enum rather than a string: call
rac_error_set_details(...) with a descriptive message (e.g., "generation failed
(llama_decode error)") immediately before returning RAC_ERROR_GENERATION_FAILED,
replace the std::string literal check result.finish_reason == "error" with a
typed comparison against TextGenerationFinishReason::ERROR (or the appropriate
enum member) on the TextGenerationResult, and keep the RAC_LOG_ERROR call for
logging; this ensures callers see fresh diagnostics and removes the raw-string
comparison.

In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 568-579: The code path in racLlmComponentGenerate returns nullptr
without throwing if rac_error_message(status) returns null/empty; change the
error handling so that whenever status != RAC_SUCCESS you always throw a Java
exception via env->ThrowNew (use the same exClass "java/lang/RuntimeException")
with a fallback message when msg is null/empty (e.g. "rac error <status> (no
message)"), include the status integer in the message for diagnostics, and
ensure you still call env->DeleteLocalRef(exClass) before returning nullptr so
no Java exception case is silently suppressed.

---

Nitpick comments:
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 850-853: In racLlmComponentGenerateStream, when status !=
RAC_SUCCESS, throw a Java RuntimeException via env->ThrowNew (same pattern used
in the non-streaming path) instead of silently returning nullptr; compose the
exception message to include context (e.g., "rac_llm_component_generate_stream
failed") and the status code, call ThrowNew with "java/lang/RuntimeException",
then return nullptr after throwing to preserve JNI semantics.

@sakirr05 sakirr05 force-pushed the fix/issue-356-sigabrt-decode-arm64 branch from df9c939 to 2473e1b Compare February 20, 2026 00:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp (1)

1015-1019: ⚠️ Potential issue | 🟠 Major

racLlmComponentGenerateStreamWithCallback returns nullptr silently on failure — no Java exception thrown.

Both racLlmComponentGenerate (lines 568–584) and racLlmComponentGenerateStream (lines 876–890) now throw a RuntimeException on non-SUCCESS status. This variant only deletes the global ref and returns nullptr, leaving Kotlin callers with no recoverable error signal.

🛡️ Proposed fix — mirrors the streaming path at lines 876-890
     if (status != RAC_SUCCESS) {
         env->DeleteGlobalRef(globalCallback);
         LOGe("rac_llm_component_generate_stream failed with status=%d", status);
+        const char* msg = rac_error_message(status);
+        jclass exClass = env->FindClass("java/lang/RuntimeException");
+        if (exClass) {
+            char fallback[64];
+            if (!msg || !*msg) {
+                snprintf(fallback, sizeof(fallback),
+                         "LLM stream with callback failed (status=%d)", status);
+                msg = fallback;
+            }
+            env->ThrowNew(exClass, msg);
+            env->DeleteLocalRef(exClass);
+        }
         return nullptr;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp` around lines
1015 - 1019, When racLlmComponentGenerateStreamWithCallback returns a
non-SUCCESS status it currently only deletes the globalCallback ref and returns
nullptr; update this path to mirror racLlmComponentGenerate and
racLlmComponentGenerateStream by throwing a Java RuntimeException (use
env->ThrowNew with java/lang/RuntimeException) that includes the formatted
status message, ensure you still call env->DeleteGlobalRef(globalCallback)
before throwing, and keep the LOGe call for native logging so Kotlin callers
receive a recoverable exception instead of a silent nullptr.
🧹 Nitpick comments (2)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.h (1)

137-137: decode_failed_ should be std::atomic<bool> for consistency and thread safety.

cancel_requested_ (line 136) is std::atomic<bool>, but decode_failed_ is a plain bool. Both are written in generate_stream() under mutex_ and read in generate() without any lock. While single-threaded sequential calls are safe (sequenced-before), concurrent calls to generate() create a formal C++ data race: the mutex release in generate_stream() does not establish a happens-before relationship with the unlocked read in generate().

♻️ Proposed fix
-    bool decode_failed_ = false;
+    std::atomic<bool> decode_failed_{false};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.h` at line
137, The field decode_failed_ is a plain bool causing a potential data race
because it's written under mutex_ in generate_stream() but read without locking
in generate(), whereas cancel_requested_ is std::atomic<bool>; change
decode_failed_ to std::atomic<bool> and update any initialization/uses to
operate atomically (keep assignments and reads compatible with
std::atomic<bool>) so both cancel_requested_ and decode_failed_ are consistent
and thread-safe across generate_stream() and generate().
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (1)

532-534: Two optional improvements: string-based finish_reason and misleading generate_stream() return value on decode failure.

  1. String literals for finish_reason — Lines 532–538 and the surrounding logic use raw string comparisons ("error", "cancelled", "stop", "length"). Per the coding guidelines, structured types should replace raw strings for consistency and scalability. An enum class FinishReason would make exhaustive handling enforceable at compile time.

  2. generate_stream() returns true on decode failure — Line 746 returns !cancel_requested_.load(), which evaluates to true when decode_failed_ is set (since cancellation was not requested). generate() compensates by directly inspecting decode_failed_, but any other direct caller of generate_stream() would silently receive true while a decode has failed.

♻️ Sketch for improvement (2)
-    return !cancel_requested_.load();
+    return !cancel_requested_.load() && !decode_failed_;

As per coding guidelines: "Always use structured types, never use strings directly for consistency and scalability."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around
lines 532 - 534, The code uses raw string literals for result.finish_reason and
returns true from generate_stream() even when decode_failed_ is set; create an
enum class FinishReason { Error, Cancelled, Stop, Length, Unknown } and change
the type of result.finish_reason (or introduce a new field) to use FinishReason
instead of string literals, update all places that set or compare finish_reason
(e.g., the branches currently assigning "error", "cancelled", "stop", "length")
to use the enum values, and modify generate_stream() to return false if
decode_failed_ is true (i.e., return !cancel_requested_.load() &&
!decode_failed_.load()) so callers correctly observe decode failure; ensure
generate() and any other callers are updated to map/serialize the enum back to
the previous string representation where external interfaces require it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 1015-1019: When racLlmComponentGenerateStreamWithCallback returns
a non-SUCCESS status it currently only deletes the globalCallback ref and
returns nullptr; update this path to mirror racLlmComponentGenerate and
racLlmComponentGenerateStream by throwing a Java RuntimeException (use
env->ThrowNew with java/lang/RuntimeException) that includes the formatted
status message, ensure you still call env->DeleteGlobalRef(globalCallback)
before throwing, and keep the LOGe call for native logging so Kotlin callers
receive a recoverable exception instead of a silent nullptr.

---

Duplicate comments:
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 739-741: The post-generation KV-cache clear should be null-guarded
like the pre-generation clear: call llama_get_memory(context_) into a local
llama_memory_t variable (post_mem) and only call llama_memory_clear(post_mem,
true) if post_mem is non-null; ensure you use the same pattern as the
pre-generation guard (the symbols to change/verify are llama_get_memory,
llama_memory_t post_mem, and llama_memory_clear with context_).

In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 568-584: There are duplicate error-handling blocks after
racLlmComponentGenerate causing double logging, double
rac_llm_result_free(&result) and a shadowed/unused msg; remove the outer
duplicate so there is a single error path that (1) logs the failure once, (2)
frees result exactly once via rac_llm_result_free(&result), (3) obtains const
char* msg = rac_error_message(status) and if msg is null or empty substitute a
clear fallback string (e.g., "Unknown error from rac_error_message"), (4) find
java/lang/RuntimeException and call env->ThrowNew with that non-empty message,
delete the local ref, and (5) return nullptr; ensure this logic is applied in
the non‑streaming path (the block around racLlmComponentGenerate) mirroring the
correct streaming implementation (lines ~876–890).

---

Nitpick comments:
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 532-534: The code uses raw string literals for
result.finish_reason and returns true from generate_stream() even when
decode_failed_ is set; create an enum class FinishReason { Error, Cancelled,
Stop, Length, Unknown } and change the type of result.finish_reason (or
introduce a new field) to use FinishReason instead of string literals, update
all places that set or compare finish_reason (e.g., the branches currently
assigning "error", "cancelled", "stop", "length") to use the enum values, and
modify generate_stream() to return false if decode_failed_ is true (i.e., return
!cancel_requested_.load() && !decode_failed_.load()) so callers correctly
observe decode failure; ensure generate() and any other callers are updated to
map/serialize the enum back to the previous string representation where external
interfaces require it.

In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.h`:
- Line 137: The field decode_failed_ is a plain bool causing a potential data
race because it's written under mutex_ in generate_stream() but read without
locking in generate(), whereas cancel_requested_ is std::atomic<bool>; change
decode_failed_ to std::atomic<bool> and update any initialization/uses to
operate atomically (keep assignments and reads compatible with
std::atomic<bool>) so both cancel_requested_ and decode_failed_ are consistent
and thread-safe across generate_stream() and generate().

@sakirr05 sakirr05 force-pushed the fix/issue-356-sigabrt-decode-arm64 branch from b5ecf04 to 8a163ca Compare February 20, 2026 04:26
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp (2)

902-905: ⚠️ Potential issue | 🟡 Minor

Streaming completion-error path silently returns null — Java exception not thrown.

Lines 877–887 now properly throw a RuntimeException when rac_llm_component_generate_stream fails at call time, but the async error path (when llm_stream_error_callback fires, or the 10-minute wait times out) at lines 902–905 still returns nullptr with no Java exception. The caller cannot distinguish this from a legitimate empty result.

The same gap exists in racLlmComponentGenerateStreamWithCallback: the initial-call failure at lines 1014–1018 is silent, and the has_error path at lines 1034–1037 is silent too.

Since both paths execute on the JNI calling thread (the function blocks on cv.wait_for), env->ThrowNew is valid there.

🛡️ Proposed fix for racLlmComponentGenerateStream lines 902–905
 if (ctx.has_error) {
     LOGe("Streaming failed: %s", ctx.error_message.c_str());
+    jclass exClass = env->FindClass("java/lang/RuntimeException");
+    if (exClass) {
+        env->ThrowNew(exClass, ctx.error_message.empty()
+            ? "LLM stream generation failed" : ctx.error_message.c_str());
+        env->DeleteLocalRef(exClass);
+    }
     return nullptr;
 }

Apply the same pattern to racLlmComponentGenerateStreamWithCallback at lines 1014–1018 and 1034–1037.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp` around lines 902
- 905, The async/error completion paths currently return nullptr silently;
update racLlmComponentGenerateStream to call
env->ThrowNew(env->FindClass("java/lang/RuntimeException"),
ctx.error_message.c_str()) when ctx.has_error is true (instead of returning
nullptr) so the Java caller sees an exception; apply the same pattern to
racLlmComponentGenerateStreamWithCallback for both the initial-call failure path
(use the result/error message returned from rac_llm_component_generate_stream)
and the post-wait ctx.has_error path, ensuring each thrown RuntimeException
includes the relevant error_message string.

17-24: ⚠️ Potential issue | 🟡 Minor

Add explicit #include <chrono> — code uses std::chrono without declaring the header.

The file uses std::chrono::minutes(10) at lines 894 and 1023 but lacks an explicit #include <chrono>. Although the code may compile on Android NDK/Clang/libc++ due to transitive inclusion through <condition_variable> or <mutex>, the C++17 standard does not guarantee this. Modern libc++ actively reduces transitive includes to enforce "include what you use." Explicitly including <chrono> ensures portability across toolchains and NDK versions.

Diff
 `#include` <condition_variable>
+#include <chrono>
 `#include` <cstring>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp` around lines 17
- 24, The file runanywhere_commons_jni.cpp uses std::chrono (e.g.
std::chrono::minutes(10) at usages around the code) but does not explicitly
include the <chrono> header; add an explicit `#include` <chrono> near the top of
runanywhere_commons_jni.cpp alongside the other includes so symbols like
std::chrono::minutes resolve portably across toolchains and NDK versions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 902-905: The async/error completion paths currently return nullptr
silently; update racLlmComponentGenerateStream to call
env->ThrowNew(env->FindClass("java/lang/RuntimeException"),
ctx.error_message.c_str()) when ctx.has_error is true (instead of returning
nullptr) so the Java caller sees an exception; apply the same pattern to
racLlmComponentGenerateStreamWithCallback for both the initial-call failure path
(use the result/error message returned from rac_llm_component_generate_stream)
and the post-wait ctx.has_error path, ensuring each thrown RuntimeException
includes the relevant error_message string.
- Around line 17-24: The file runanywhere_commons_jni.cpp uses std::chrono (e.g.
std::chrono::minutes(10) at usages around the code) but does not explicitly
include the <chrono> header; add an explicit `#include` <chrono> near the top of
runanywhere_commons_jni.cpp alongside the other includes so symbols like
std::chrono::minutes resolve portably across toolchains and NDK versions.

---

Duplicate comments:
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 568-583: No change required: the error path in
racLlmComponentGenerate already frees result with rac_llm_result_free(&result),
constructs a safe fallback message using snprintf, retrieves the error string
via rac_error_message(status), and throws a Java RuntimeException via
env->FindClass + env->ThrowNew while deleting the local ref; keep the existing
logic as-is (no code modifications needed in racLlmComponentGenerate,
rac_llm_result_free, rac_error_message or the exception-throwing block).

@Siddhesh2377 Siddhesh2377 merged commit 7973dbc into RunanywhereAI:main Feb 20, 2026
7 of 15 checks passed
ManthanNimodiya pushed a commit to ManthanNimodiya/runanywhere-sdks that referenced this pull request Feb 23, 2026
…RunanywhereAI#393)

* fix: clear KV cache and reset batch state between sequential decode calls on arm64

* fix: address bot review comments - null guard, decode failure flag, error details, and JNI exception fallback

* fix: make decode_failed_ std::atomic for thread safety (review)

---------

Co-authored-by: sakirr <sakirahmed75531@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Android: ggml_abort / SIGABRT crash during llama_context::decode on arm64

2 participants