Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Access Fault in nxsem_trywait Due to Invalid Semaphore Pointer on RISC-V NuttX #15178

Open
1 task done
GuoyuYin opened this issue Dec 13, 2024 · 6 comments
Open
1 task done
Labels
Arch: risc-v Issues related to the RISC-V (32-bit or 64-bit) architecture Area: Kernel Kernel issues OS: Linux Issues related to Linux (building system, etc) Type: Bug Something isn't working

Comments

@GuoyuYin
Copy link

GuoyuYin commented Dec 13, 2024

Expected Behavior

The program should execute without any memory access violations or crashes. Specifically, the nxsem_trywait function should correctly attempt to wait on a semaphore without causing system instability.

Actual Behavior

Instead of executing normally, the program crashes due to a load access fault when trying to read a half-word from what seems to be an invalid or inaccessible memory location during the execution of the nxsem_trywait function. The crash happens at the address 0x800078da.

Description

Instead of executing normally, the program crashes due to a load access fault when trying to read a half-word from what seems to be an invalid or inaccessible memory location during the execution of the nxsem_trywait function. The crash happens at the address 0x800078da.

  • MCAUSE: 0x5 (Indicates a load access fault)
  • EPC: 0x800078da (Program counter at the time of exception, within nxsem_trywait)
    0x800078da (Program counter on exceptions in nxsem_trywait
  • MTVAL: 0x2 (Value related to the exception, possibly the address offset)
    MTVAL: 0x2 (The value associated with the exception, possibly the address offset)

Querying the nut-img file reveals that the exception happened during the execution of the nxsem_trywait function, specifically at offset +0x54 from its start. The error log indicates that the system was attempting to perform a load half-word unsigned (lhu) operation from an address held in register s0, which appears to be causing the issue.

The assembly instruction at this location is lhu a1,0(s0), which tries to load an unsigned half-word into register a1 from the address pointed to by s0. Given that MTVAL contains 0x2, it suggests that there might be an alignment issue or that the address being accessed does not exist or there is no valid mapping for it in the page tables, leading to the access fault.

Debug Logs

The debug logs show that the system was executing various system calls before encountering the exception. These calls are executed sequentially by number (call_num) until finally a very large number #call_num = 18446744073709551615, which is the maximum value of an unsigned 64-bit integer that could mean an overflow or invalid argument.Upon reaching the nxsem_trywait function, it attempted to load a half-word from an address stored in s0. However, this address appears to be invalid or out of bounds, resulting in the load access fault.

syz_sem_timedwait(0x2, 0x0)

#call_num = 36
# in post call

riscv_exception: BUG: EXCEPTION: Load access fault. MCAUSE: 0000000000000005, EPC: 00000000800078da, MTVAL: 0000000000000002

VM DIAGNOSIS:
08:00:39  Registers:
Failed readin g regs: dial tcp 127.0.0.1:16715: connect: connection refused
Failed readin g regs: dial tcp 127.0.0.1:16715: connect: connection refused

Steps to Reproduce

To reproduce this issue, one can use Syzkaller to execute system calls against the NuttX kernel. The specific sequence leading up to the crash includes calls such as syz_sem_timedwait, syz_putenv, syz_setenv, and syz_sem_timedwait, culminating in the problematic call to nxsem_trywait.

The corresponding syscall specific implementation code is as follows:

//lab: H spec
#include "nuttx/sched.h"
// #include "sched/sched.h"
#include <sys/types.h>
#include <sched.h>
#include <time.h>
#include <signal.h>
#include <errno.h>
#include <sys/time.h>
#include "nuttx/irq.h"
#include "mqueue.h"
#include <fcntl.h>
#include <sys/stat.h>
#include <semaphore.h>
// #include <netinet/in.h>
// #include <arpa/inet.h>

#include <arpa/inet.h>
#include <netinet/in.h>
#include <netinet/tcp.h>

static long syz_sem_timedwait(volatile long sem_ptr, volatile long abstime_ptr)
{
    sem_t *sem = (sem_t *)sem_ptr;
    const struct timespec *abstime = (const struct timespec *)abstime_ptr;
    return (long)sem_timedwait(sem, abstime);
}

Suggested Fix

The nxsem_trywait function is in sched/semaphore/sem_trywait.c with the following code:

int nxsem_trywait(FAR sem_t *sem)
{
  /* This API should not be called from the idleloop */

  DEBUGASSERT(sem != NULL);
  DEBUGASSERT(!OSINIT_IDLELOOP() || !sched_idletask() ||
              up_interrupt_context());

  /* If this is a mutex, we can try to get the mutex in fast mode,
   * else try to get it in slow mode.
   */

#if !defined(CONFIG_PRIORITY_INHERITANCE) && !defined(CONFIG_PRIORITY_PROTECT)
  if (sem->flags & SEM_TYPE_MUTEX)
    {
      int32_t old = 1;
      if (atomic_try_cmpxchg_acquire(NXSEM_COUNT(sem), &old, 0))
        {
          return OK;
        }

      return -EAGAIN;
    }
#endif

  return nxsem_trywait_slow(sem);
}

After analysis, the following recommendations were given

Pointer Validation: Ensure that the pointer passed to nxsem_trywait is properly initialized and aligned according to RISC-V's requirements. Check if the semaphore structure (sem) is corrupted or if there are issues with memory allocation or deallocation patterns that could lead to accessing invalid addresses.

Semaphore Initialization: Verify that all semaphores are correctly initialized before being used. Uninitialized or improperly initialized semaphores can cause undefined behavior, including invalid memory accesses.

Memory Alignment: Review the memory alignment of data structures used in conjunction with nxsem_trywait. Misaligned data can result in access faults on architectures like RISC-V, which have strict alignment requirements.

Runtime Checks: We can see that the problem is a load access fault when trying to access sem->flags or NXSEM_COUNT(sem). This indicates that the incoming sem pointer may be invalid (e.g., NULL, uninitialized, or pointing to an illegal memory address). To fix this, we need to make sure that the sem pointer is properly initialized and points to a valid memory location before actually using it.

int nxsem_trywait(FAR sem_t *sem)
{
  /* This API should not be called from the idleloop */

  if (sem == NULL) {
    _err("ERROR: NULL semaphore pointer provided to nxsem_trywait\n");
    return -EINVAL;  // EINVAL indicates an invalid argument
  }

  DEBUGASSERT(sem != NULL);
  DEBUGASSERT(!OSINIT_IDLELOOP() || !sched_idletask() ||
              up_interrupt_context());

  /* Ensure that the semaphore structure is valid before accessing it */
  if (!is_valid_semaphore(sem)) {
    _err("ERROR: Invalid semaphore structure\n");
    return -EFAULT;  // EFAULT indicates a bad address
  }

  /* If this is a mutex, we can try to get the mutex in fast mode,
   * else try to get it in slow mode.
   */

#if !defined(CONFIG_PRIORITY_INHERITANCE) && !defined(CONFIG_PRIORITY_PROTECT)
  if (sem->flags & SEM_TYPE_MUTEX)
    {
      int32_t old = 1;
      if (atomic_try_cmpxchg_acquire(NXSEM_COUNT(sem), &old, 0))
        {
          return OK;
        }

      return -EAGAIN;
    }
#endif

  return nxsem_trywait_slow(sem);
}

We introduce an auxiliary function is_valid_semaphore(), which is assumed to exist, to check if a semaphore structure is legal. This function can be defined on an implementation-specific basis, e.g., to check if the semaphore has been properly initialized, etc.

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Ubuntu 20.04

NuttX Version

2ff2b82

Issue Architecture

[Arch: risc-v]

Issue Area

[Area: Kernel]

Verification

  • I have verified before submitting the report.
@GuoyuYin GuoyuYin added the Type: Bug Something isn't working label Dec 13, 2024
@github-actions github-actions bot added Arch: risc-v Issues related to the RISC-V (32-bit or 64-bit) architecture Area: Kernel Kernel issues OS: Linux Issues related to Linux (building system, etc) labels Dec 13, 2024
@acassis
Copy link
Contributor

acassis commented Dec 16, 2024

@GuoyuYin what board are you using? Did you try to duplicate it on rv qemu?

@lupyuen did you see that?

@GuoyuYin
Copy link
Author

GuoyuYin commented Dec 17, 2024

@GuoyuYin what board are you using? Did you try to duplicate it on rv qemu?您用的是什么板子?您是否尝试在 rv qemu 上复制它?

@lupyuen did you see that?你看到了吗?

Yes, it's running on qemu.

@lupyuen
Copy link
Member

lupyuen commented Dec 17, 2024

@GuoyuYin Can you give the source code, build commands and NSH Commands you used? So I can reproduce it on QEMU RISC-V. (32-bit or 64-bit?) Thanks!

@Rrooach
Copy link

Rrooach commented Dec 17, 2024

@GuoyuYin Can you give the source code, build commands and NSH Commands you used? So I can reproduce it on QEMU RISC-V. (32-bit or 64-bit?) Thanks!

Hi @lupyuen the source code is a bit complicated, Basically, we are fuzz testing the NuttX. Our core idea is to predefine some functions (contain certain Nuttx's API) and expose specific arguments; then, our fuzzing tool can generate arguments for this function and test the NuttX for unknown bugs.

For this bug, the predefined function is:

static long syz_sem_timedwait(volatile long sem_ptr, volatile long abstime_ptr)
{
    sem_t *sem = (sem_t *)sem_ptr;
    const struct timespec *abstime = (const struct timespec *)abstime_ptr;
    return (long)sem_timedwait(sem, abstime);
}

and its triggering argument is:

syz_sem_timedwait(0x2, 0x0)

And during execution, we triggered this error:

riscv_exception: BUG: EXCEPTION: Load access fault. MCAUSE: 0000000000000005, EPC: 00000000800078da, MTVAL: 0000000000000002

According to the EPC value, it is something wrong within the function nxsem_trywait()

To reproduce it, I believe you can simply add the predefined function with its argument, this should trigger the error.

@acassis
Copy link
Contributor

acassis commented Dec 17, 2024

@Rrooach are you using your Rtkaller fuzzer to do that? Could you please send a guide how to use it with NuttX?

@Rrooach
Copy link

Rrooach commented Dec 17, 2024

@Rrooach are you using your Rtkaller fuzzer to do that? Could you please send a guide how to use it with NuttX?

No, Rtkaller is not capable of fuzzing OSs like Nuttx.
We wrote a new fuzzer to test embedded OS like Nuttx and found many potential bugs (the paper has not been published yet); we would love to share the bug findings and guide you to use it with Nuttx upon paper's acceptance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arch: risc-v Issues related to the RISC-V (32-bit or 64-bit) architecture Area: Kernel Kernel issues OS: Linux Issues related to Linux (building system, etc) Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants