Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reading coredump from flash in ESP32-S3 (IDFGH-14443) #53

Closed
3 tasks done
gadget-man opened this issue Jan 16, 2025 · 28 comments
Closed
3 tasks done

Error reading coredump from flash in ESP32-S3 (IDFGH-14443) #53

gadget-man opened this issue Jan 16, 2025 · 28 comments

Comments

@gadget-man
Copy link

gadget-man commented Jan 16, 2025

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

I have an ESP32-S3 with secure boot and encryption enabled. I'm saving coredump data to flash. IDF version 5.4.0.

If I manually call abort() within the code, once the device has rebooted I can call a command to read the cordump partition to UART, which works successfully and I can see that the base64 encoded core dump starts with:
f0VMRgEBAQAAAAAAAAAAAAQAXgABAAAAAAAAADQAAAAAAAAAAAAAADQAIAAsACgAAAAAAAQAAAC0BQAAAAAAAAAAAADgMQAA4DEAAAYAAAAAAAAAAQAAAJQ3AADYycw/2MnMP1wBAABcAQAABgAAAAAAAAABAAAA8DgAALBaGzywWhs8QAMAAEADAAAGAAAAAAAAAAEAAAAwPAAA+GjLP/hoyz9cAQAAXAEAAAYAAAAAAAAAAQAAAIw9AAAAZss/AGbLP6ACAACgAgAABgAAAAAAAAABAAAALEAAAPBwyz/wcMs/XAEAAFwBAAAGAAAAAAAAAAEAAACIQQAAAG7LPwBuyz+gAgAAoAIAAAYAAAAAAAAAAQAAAChEAAAwI84/MCPOP1wBAABcAQAABgAAAAAAAAABAAAAhEUAAIBsGjyAbBo80AMAANADAAAGAAAAAAAAAAEAAABUSQAAxDDMP8QwzD9cAQAAXAEAAAYAAAAAAAAAAQAAALBKAACgK8w/oCvMPxADAAAQAwAABgAAAAAAAAABAAAAwE0AAFyYzD9cmMw/XAEAAFwBAAAGAAAAAAAAAAEAAAAcTwAAcJTMP3CUzD/gAgAA

However, if the firmware crashes for some other reason (e.g. Heap Memory poisoning) and after rebooting I then call the same command to read the coredump partition, I get a very different looking response, starting as follows:
630b1pcoDKbNIrI76sCdTgUUVt2r4/oYDPIn6rDrtCKhZomDTMDG62Cu/BobAezrz1WLZQzA9ogCy/75VLC9stc4BHg9GnaW/pwcb8ZoEZvmEYLUFy0shsZ4Lf66YslIIVBahj0+eqaMNAGM4jzNZn1EwJLg/kbxR8VGd0Ou7mcvw0n4cfE/TjjxjCbu6sKAbFhLDCtmfo1ADX5Nbay3wvwkegrfw3cOcSBy8cVMUh0RSsF/NqyL62PCDy1ZYyrDr2RFxpdwkqlB2S40Lhf8hQ20Qx4IMxIAm0YR6Zf+Gpe830ZSkIwLbYYh5Ll8ZcWNT2hsxRtf2ir80aVTL4/7wtaGqBy1VMilCwb9oXuSfmtLPERjOyZN5x80HxB8ShHJ5NZvjOZBJr/BqBsTnNWxwcdr5XiwnM6BwqwyfP5hAryLnyC+eYGc6UVTdx7Q3dv8xzIK99XTzZu8yPm66kTwejWcZQbYvWLgPmCXbg5jrNLd0ryBCQ39Vci7iL3zEfuUf/hfx2UxHtwk0IeHtKJscoA8fZCXYOYPBTKyurGyb7aYHTcidfIloySul4YHsZxdFJsmNBOQbR5RfK0LkL0Ixm0a8HdzA6

Note that currently I'm trimming the first 24 bytes of the partition data, so that when working correctly, the decoded data starts with \x7fELF

Here is the function I'm using to read the coredump data from the partition:

`
// Find the coredump partition
const esp_partition_t *core_dump_partition = esp_partition_find_first(
ESP_PARTITION_TYPE_DATA, ESP_PARTITION_SUBTYPE_DATA_COREDUMP, "coredump");

if (!core_dump_partition)
{
    ESP_LOGE(TAG, "Core dump partition not found!");
    args->result = ipb_strdup("Core dump partition not found!"); // freed in calling function
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

ESP_LOGI(TAG, "Core dump partition: %s, size: %ld", core_dump_partition->label, core_dump_partition->size);

// Read the core dump size
size_t core_dump_size = core_dump_partition->size;
uint8_t *core_dump_data = malloc(core_dump_size + 1);
if (!core_dump_data)
{
    ESP_LOGE(TAG, "Failed to allocate memory for core dump");
    args->result = NULL;
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

esp_err_t err = esp_partition_read(core_dump_partition, 0, core_dump_data, core_dump_size);
if (err != ESP_OK)
{
    ESP_LOGE(TAG, "Failed to read core dump data: %s", esp_err_to_name(err));
    free(core_dump_data);
    args->result = ipb_strdup("Failed to read core dump data"); // freed in calling function
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

// Trim the first 20 bytes
uint8_t *trimmed_core_dump_data = core_dump_data + 24;
size_t trimmed_core_dump_size = core_dump_size - 24;

// Calculate Base64 encoded size
size_t encoded_size = 0;
mbedtls_base64_encode(NULL, 0, &encoded_size, trimmed_core_dump_data, trimmed_core_dump_size);
uint8_t *encoded_data = malloc(encoded_size + 1); // used as return, so freed in calling function
if (!encoded_data)
{
    ESP_LOGE(TAG, "Failed to allocate memory for encoded data");
    free(core_dump_data); // Free original buffer
    args->result = NULL;
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

// Encode core dump to Base64
err = mbedtls_base64_encode(encoded_data, encoded_size, &encoded_size, trimmed_core_dump_data, trimmed_core_dump_size);
if (err != 0)
{
    ESP_LOGE(TAG, "Failed to encode core dump data to Base64");
    free(core_dump_data); // Free original buffer
    free(encoded_data);   // Free allocated memory
    args->result = NULL;
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

encoded_data[encoded_size] = '\0'; // Null-terminate the Base64 string
ESP_LOGI(TAG, "Core dump encoded size: %d", encoded_size);

// Debug logs for verification
ESP_LOGI(TAG, "Encoded core dump: %.16s", encoded_data);

// Clean up and set the result
free(core_dump_data); // Free original buffer
args->result = (char *)encoded_data;

// Signal task completion
xSemaphoreGive(args->semaphore);

`

this is the relevant section of sdkconfig:
CONFIG_ESP_COREDUMP_ENABLE_TO_FLASH=y CONFIG_ESP_COREDUMP_DATA_FORMAT_ELF=y CONFIG_ESP_COREDUMP_CHECKSUM_CRC32=y CONFIG_ESP_COREDUMP_CHECK_BOOT=y CONFIG_ESP_COREDUMP_ENABLE=y CONFIG_ESP_COREDUMP_LOGS=y CONFIG_ESP_COREDUMP_MAX_TASKS_NUM=64 CONFIG_ESP_COREDUMP_USE_STACK_SIZE=y CONFIG_ESP_COREDUMP_STACK_SIZE=2048

Is this a bug?

@github-actions github-actions bot changed the title Error reading coredump from flash in ESP32-S3 Error reading coredump from flash in ESP32-S3 (IDFGH-14443) Jan 16, 2025
@erhankur
Copy link

You can verify the coredump data by reading it with OpenOCD. For example:

I (189) esp_core_dump_flash: Found partition 'coredump' @ 120000 65536 bytes
I (352) esp_core_dump_flash: Core dump data checksum is correct
I (353) esp_core_dump_flash: Found core dump 7780 bytes in flash @ 0x120000

Connect OpenOCD from telnet

flash read_bank 0 header.bin 0x120000 24
flash read_bank 0 data.elf 0x120018 7780

Then analyze the elf file

Print elf header

xtensa-esp32s3-elf-readelf -h data.elf

Print note section

xtensa-esp32s3-elf-readelf -n data.elf

Print program headers

xtensa-esp32s3-elf-readelf -l data.elf 

@gadget-man
Copy link
Author

I have Secure Boot v2 enabled so JTAG debugging is disabled - I'm not sure OpenOCD would work? As per the code above, I'm manually reading the first 16 bytes of the partition and printing them to console before sending the partition contents - and can verify that they match (and are wrong on both).

@erhankur
Copy link

No, Unfortunatelly, OpenOCD will not work.

In cases other than abort(), I wonder if there was another exception during the coredump, preventing the write operation from being completed. Were you able to reproduce it manually? If yes I can try it here also.

I haven't tried it before with the secure boot + flash encryption scenario. I need to check if there's something affecting the coredump process.

By the way, in your code above, why do you read/encode the entire partition instead of only the written coredump data? You can get the data length from the first 4 bytes of the header.

@gadget-man
Copy link
Author

Many thanks for this. Most of my 'real' crashes at the moment are due to heap issues - so I changed my panic function to be as follows rather than abort():

char *panic = "panic"; free(panic);

I was expecting this to fail, but it actually also work and I'm currently able to download the cordump file in the correct format. I'll try and leave my device connected via UART and hopefully catch an unplanned crash at some point to see if the log over UART provides any clues.

@gadget-man
Copy link
Author

gadget-man commented Jan 17, 2025

Here's an example of the output to UART for a core dump that just happened. When I then read the coredump partition content after a reboot, the first 16 bytes are AD/+5ioYvYyfipgV so something clearly wrong with this one.

@gadget-man
Copy link
Author

@erhankur
Copy link

Looking at the log nothing wrong with the CoreDump write. It is completed without an error.

25-01-17 19:18:58 D (12030) esp_core_dump_elf: =========== Data written size = 47472 bytes ==========
2025-01-17 19:18:58 D (12037) esp_core_dump_crc: Total length of hashed data: 47488
2025-01-17 19:18:58 I (12043) esp_core_dump_flash: Write end offset 0xb984, check sum length 4
2025-01-17 19:18:58 I (12049) esp_core_dump_common: Core dump used 1328 bytes on stack. 808 bytes left free.
2025-01-17 19:18:58 I (12057) esp_core_dump_common: Restoring stack @ 0x3c1ba260
2025-01-17 19:18:58 I (12063) esp_core_dump_flash: Core dump has been saved to flash.

Does your coredump partition is also encrypted, right?
As an extra debug step, you can call esp_core_dump_get_summary and print the summary after reboot.
And could you also print raw flash data to see what is wrong with it?

@gadget-man
Copy link
Author

gadget-man commented Jan 17, 2025

Ok, I'll try adding esp_core_dump_get_summary to see what it returns.

Yes, my coredump partition is encrypted.

I should also mention that I'm using esp_insights, and can see the crash details coming through there also. Does esp_insights retrieve the coredump data from the coredump partition also, or is it saved to nvs like the metrics data?

Image

@erhankur
Copy link

@gadget-man
Copy link
Author

OK. So does the summary rely on the coredump partition being valid? I assume that helps us identify whether the partition itself is correct and it's the reading / decrypting of this that may be the error?

Please let me know if you need any further logs / data from my side to support debugging?

@erhankur
Copy link

esp_core_dump_get_summary reads flash through cache. If that function works well (I believe it should work, since eps-insight works) you can also try in that way. Just check the implementation. https://github.com/espressif/esp-idf/blob/master/components/espcoredump/src/core_dump_elf.c#L969

Please share the full reboot logs after coredump write and the logs from your function so we can verify that the coredump-related address matches.

@gadget-man
Copy link
Author

Attached a file from a recent core dump (forced by feeing static allocated char*), which shows the UART output of the dump itself, the subsequent reboot and also the output from the esp_core_dump_get_summary and encoding of the core dump partition ready to send. I've also attached the associated coredump.bin and .txt files downloaded from the partition after reboot. Please let me know if you need anything else?

Archive.zip

@erhankur
Copy link

I didn't expect to see this.

#if CONFIG_ESP_INSIGHTS_COREDUMP_ENABLE
esp_core_dump_image_erase();
#endif // CONFIG_ESP_INSIGHTS_COREDUMP_ENABLE

https://github.com/espressif/esp-insights/blob/main/components/esp_insights/src/esp_insights.c#L301

You need to do coredump encoding before calling esp_insights_enable. Otherwise, the coredump data will be deleted after being sent to the host. Apologies for not noticing this earlier.

@gadget-man
Copy link
Author

Ah OK, that's a bit of a problem as I was hoping to use ESP_insights alongside being able to manually download the core dump data. Is there a specific reason that ESP_Insights needs to delete the coredump data once it's sent to the dashboard?

For the moment I'll try just manually comment out that line in esp_insights - will that work?

@erhankur
Copy link

I believe the reason could be to avoid sending old coredump data after every boot. As long as you delete the image after encoding, it should be fine

@vikramdattu @shahpiyushv Could you please comment?

@gadget-man
Copy link
Author

gadget-man commented Jan 21, 2025

Ok, so this may be a niche use-case, however my manual encoding of the core dump partition is on-demand using an RPC function over MQTT to allow secure remote debugging in VSCode. I don’t automatically encode and upload the partition data to a server.

In this case, I’m wanting at any time to be able to make and RPC call and download the most recent core dump for a device.

Perhaps my only option is to disable core dump download in ESP Insights? However, I do find this a very useful feature, particularly for a summary analysis of older firmware versions, where the .elf file on my development machine may be a later version.

@gadget-man
Copy link
Author

@erhankur On further review, I'm not sure this is the problem. In the example files I uploaded above, I called the coredump RPC function immediately after boot, and as you'll see from the archive file, the core dump summary was OK, but the encoding of the coredump partition failed. However, if I wait for Insights to transmit the coredump info, if I think call the coredump RPC function I get a simple Blank core dump partition! message and there is no coredump data available to encode.

So Insights is an issue for sure, but I think the encoding problem with the coredump partition may be a separate issue.

@erhankur
Copy link

In the example files I uploaded above, I called the coredump RPC function immediately after boot, and as you'll see from the archive file, the core dump summary was OK, but the encoding of the coredump partition failed.

I don't think so. I have checked your files from Archive.zip. The coredump.bin file has a 20-byte header instead of 24 bytes (it seems 4 bytes of length have been trimmed). However, the ELF part is fine, and I can parse it.

However, if I wait for Insights to transmit the coredump info, if I think call the coredump RPC function I get a simple Blank core dump partition! message and there is no coredump data available to encode.

Could you send me your latest version of coredump encoder function? In the first version above, you will not get Blank core dump partition! error.

@gadget-man
Copy link
Author

gadget-man commented Jan 22, 2025

Here is the updated function which outputs the coredump summary, then encodes the partition data for sending over MQTT / UART to the requestor. Interesting that you're able to parse the coredump.bin file - whenever I run idf.py -B build.nosync coredump-info -c coredump.bin I get the error that the coredump is not in the specified format.

Am I doing something stupid?

`
void rpc_coredump_worker_task(void *pvParameters)
{
rpc_coredump_task_args_t *args = (rpc_coredump_task_args_t *)pvParameters;

ESP_LOGI(TAG, "Generating core dump summary...");

esp_core_dump_summary_t *summary = malloc(sizeof(esp_core_dump_summary_t));
if (summary)
{
    esp_err_t err = esp_core_dump_get_summary(summary);
    if (err == ESP_OK)
    {
        printf("App ELF file SHA256: %s\n", (char *)summary->app_elf_sha256);
        printf("Crashed task: %s\n", summary->exc_task);
        printf("Exception cause: %ld\n", summary->ex_info.exc_cause);
        char panic_reason[200];
        err = esp_core_dump_get_panic_reason(panic_reason, sizeof(panic_reason));
        if (err == ESP_OK)
        {
            printf("Panic reason: %s\n", panic_reason);
        }
    }
    free(summary);
}

ESP_LOGI(TAG, "Generating core dump...");

// Find the coredump partition
const esp_partition_t *core_dump_partition = esp_partition_find_first(
    ESP_PARTITION_TYPE_DATA, ESP_PARTITION_SUBTYPE_DATA_COREDUMP, "coredump");

if (!core_dump_partition)
{
    ESP_LOGE(TAG, "Core dump partition not found!");
    args->result = ipb_strdup("Core dump partition not found!"); // freed in calling function
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

// Read the core dump size from the header (first 4 bytes)
uint32_t core_dump_size = 0;
esp_err_t err = esp_partition_read(core_dump_partition, 0, &core_dump_size, sizeof(core_dump_size));
if (err != ESP_OK || core_dump_size == 0 || core_dump_size > core_dump_partition->size)
{
    ESP_LOGE(TAG, "Invalid core dump size or failed to read header: %s", esp_err_to_name(err));
    args->result = NULL;
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

ESP_LOGI(TAG, "Core dump partition: %s, size: %ld", core_dump_partition->label, core_dump_size);

// Allocate memory for the core dump data
uint8_t *core_dump_data = malloc(core_dump_size);
if (!core_dump_data)
{
    ESP_LOGE(TAG, "Failed to allocate memory for core dump");
    args->result = NULL;
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

// Read the core dump data
err = esp_partition_read(core_dump_partition, sizeof(core_dump_size), core_dump_data, core_dump_size);
if (err != ESP_OK)
{
    ESP_LOGE(TAG, "Failed to read core dump data: %s", esp_err_to_name(err));
    free(core_dump_data);
    core_dump_data = NULL;
    args->result = ipb_strdup("Failed to read core dump data"); // Freed in calling function
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

// Calculate Base64 encoded size
size_t encoded_size = 0;
mbedtls_base64_encode(NULL, 0, &encoded_size, core_dump_data, core_dump_size);

uint8_t *encoded_data = malloc(encoded_size + 1); // Freed in calling function
if (!encoded_data)
{
    ESP_LOGE(TAG, "Failed to allocate memory for encoded data");
    free(core_dump_data);
    core_dump_data = NULL;
    args->result = NULL;
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

// Encode core dump to Base64
err = mbedtls_base64_encode(encoded_data, encoded_size, &encoded_size, core_dump_data, core_dump_size);
free(core_dump_data); // Free original buffer
core_dump_data = NULL;
if (err != 0)
{
    ESP_LOGE(TAG, "Failed to encode core dump data to Base64");
    free(encoded_data);
    args->result = NULL;
    xSemaphoreGive(args->semaphore);
    vTaskDelete(NULL);
}

encoded_data[encoded_size] = '\0'; // Null-terminate the Base64 string
ESP_LOGI(TAG, "Core dump encoded size: %d", encoded_size);

// Debug logs for verification
ESP_LOGI(TAG, "Encoded core dump: %.16s", encoded_data);

// Clean up and set the result
args->result = (char *)encoded_data;

// Signal task completion
xSemaphoreGive(args->semaphore);

// Delete the task
vTaskDelete(NULL);

}
`

@erhankur
Copy link

Yes you are skipping only 4 bytes in the header. Skip full header sizeof(core_dump_header_t) and save only the elf part. That should resolve the issue.

// Read the core dump data
err = esp_partition_read(core_dump_partition, 24, core_dump_data, core_dump_size);
...

In summary, you can read the core dump data at boot time. However, on the 2nd attempt, "Invalid core dump size or failed to read header is expected since ESP-Insights will delete the image after boot.

@gadget-man
Copy link
Author

Thanks, I'll check this.

Is there anyway for Insights to save the SHA of the transmitted core dump to NVS, then only send a new summary if the SHA changes, rather than having to delete the cord ump partition after the data is sent to the dashboard?

@vikramdattu
Copy link
Contributor

Hi @gadget-man the problem with that is we will then miss on the crashes which are exactly same. e.g., if the board crashed at the same code point multiple times, the logic will skip this reporting thinking it's the same one.

@gadget-man
Copy link
Author

Hi @gadget-man the problem with that is we will then miss on the crashes which are exactly same. e.g., if the board crashed at the same code point multiple times, the logic will skip this reporting thinking it's the same one.

In which case, could the insights component not check esp_reset_reason() for ESP_RST_PANIC instead?

@vikramdattu
Copy link
Contributor

In that case, if we fail to send the dump in first attempt for whatever reason, may it be hard/soft reset after crash, crash dump will never be sent.

@gadget-man
Copy link
Author

OK, so in that case and if there's no option to use NVS flags etc and the only way is to send the partition data if it exists, can I suggest you update the ESP Insights documents to make it clear that, if CONFIG_ESP_INSIGHTS_COREDUMP_ENABLE=y (default), then any other uses of CONFIG_ESP32_ENABLE_COREDUMP_TO_FLASH elsewhere in the code are likely to fail, unless they are processed immediately after boot (before ESP Insights has a chance to delete the coredump data).

For the moment, I'll have to stop using ESP Insights for core dump reporting.

@vikramdattu
Copy link
Contributor

Hi @gadget-man thanks for the suggestion. We will add this in the README.

@Alvin1Zhang Alvin1Zhang transferred this issue from espressif/esp-idf Jan 23, 2025
@gadget-man
Copy link
Author

Thanks for updating the readme.

Just out of interest, is there a way I could manually call the function to send the core dump to the insights dashboard?

Although I may miss a few, I’m comfortable that the vast majority of crashes would be caught by checking boot reason, and I could then send the data across without having to subsequently delete the core dump partition.

@vikramdattu
Copy link
Contributor

vikramdattu commented Jan 27, 2025

Hi @gadget-man the crash dump is sent in the boot time data message, which is sent at the start. So, it is not straight forward to send the crash-dump at later than with the boot info.

You can, however, achieve your use-case, by modifying the esp-insight component to not erase the crash dump after it is sent. Re-enable core dump from menuconfig and remove calls to esp_core_dump_image_erase(); from esp_insights.c. This should do the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants