Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFTP update failed #60

Open
darkbyte-ru opened this issue Mar 24, 2015 · 46 comments
Open

TFTP update failed #60

darkbyte-ru opened this issue Mar 24, 2015 · 46 comments

Comments

@darkbyte-ru
Copy link
Contributor

After updating antares and SDK make a new firmware and write it to esp via tftp, all was ok, esp bootup.

Memory Layout:
data  : 0x3ffe8000 ~ 0x3ffe8f90, len: 3984
rodata: 0x3ffe8f90 ~ 0x3ffeb4c0, len: 9520
bss   : 0x3ffeb4c0 ~ 0x3fff4118, len: 35928
heap  : 0x3fff4118 ~ 0x3fffc000, len: 32488

 Flash layout:
Firmware size is     221 KiB (0x375b0)
Filesystem starts at 224 KiB (0x38000)

Then run tftp upgrade from new firmware and it's failed:

TFTP: Downloading tftp://192.168.1.1/antares.rom, Starting flash block 56

 ########################################################
 TFTP done, 226720 bytes transferred
Committing update in 2 seconds

Commiting update, 56 sectors 226720 bytes
##########################Fatal exception (0):
epc1=0x40219a3c, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x40219318, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x40219318, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000

Last message repeated N times before esp stop responding.

The same firmware successfully flushed via XTCOM.

@darkbyte-ru darkbyte-ru changed the title TFTP update TFTP update failed Mar 24, 2015
@susisstrolch
Copy link

I've never got tftp working inside a telnet session.

Last famous words: "Committing update in 2 seconds"
Did reset 1-5min after the commit message, but only got trash on serial
console,
telnet/ping wasn't avail anymore.

Rewriting the flash via esptool works - so it doesn't look like esp
config was borked
by tftp...

On 03/24/2015 06:46 PM, DarkByte wrote:

After updating antares and SDK make a new firmware and write it to esp
via tftp, all was ok, esp bootup.

|Memory Layout:
data : 0x3ffe8000 ~ 0x3ffe8f90, len: 3984
rodata: 0x3ffe8f90 ~ 0x3ffeb4c0, len: 9520
bss : 0x3ffeb4c0 ~ 0x3fff4118, len: 35928
heap : 0x3fff4118 ~ 0x3fffc000, len: 32488

Flash layout:
Firmware size is 221 KiB (0x375b0)
Filesystem starts at 224 KiB (0x38000)
|

Then run tftp upgrade from new firmware and it's failed:

|TFTP: Downloading tftp://192.168.1.1/antares.rom, Starting flash block 56

########################################################
TFTP done, 226720 bytes transferred
Committing update in 2 seconds

Commiting update, 56 sectors 226720 bytes
##########################Fatal exception (0):
epc1=0x40219a3c, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x40219318, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x40219318, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
|

Last message repeated N times before esp stop responding.

The same firmware successfully flushed via XTCOM.


Reply to this email directly or view it on GitHub
#60.

@nekromant
Copy link
Owner

Hm... I think I know why it doesn't work over telnet. It's easy to fix, but you won't get any progress report whatsoever when synchronising data.
@darkbyte-ru Does it work via serial? (e.g. NOT telnet)

@darkbyte-ru
Copy link
Contributor Author

@nekromant, I use serial to communicate with esp.

@nekromant
Copy link
Owner

@darkbyte-ru Got it, hold on will fetch the board and see if it works for me.

@susisstrolch
Copy link

I know that progress bar isn't possible because of blocked wdog and INTs.
So a short "Starting commit, rebooting in XXX sec" would be sufficient
via telnet.

On 03/24/2015 07:10 PM, Andrew wrote:

Hm... I think I know why it doesn't work over telnet. It's easy to
fix, but you won't get any progress report whatsoever.
@darkbyte-ru https://github.com/darkbyte-ru Does it work via serial
(e.g. NOT telnet)


Reply to this email directly or view it on GitHub
#60 (comment).

@nekromant
Copy link
Owner

Bad news. tftp updates crash as well as spi_wipe command. Looks like some spi flash code is now calling something that's NOT in RAM. Looks like we're in deep trouble with 1.0 SDK.
Unless someone pokes Espressif to fix their blobs we're mostly out of luck with 1.0 SDK this way. Ideas welcome.

@susisstrolch
Copy link

That's what I see when trying the tftp via serial:

Commiting update, 54 sectors 220484 bytes
#########################Fatal exception (0):
epc1=0x40218250, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x40217bc4, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x40217bc4.......

The latter one repeats until the module hangs...
0x40218250 is "system_get_time"
0x40217bc4 is "system_get_boot_version"

Module must be reflashed afterwards...

By reordering the "ets_wdt_disable()" / "ets_wdt_enable()" and commenting out the hash printout
I see exactly the same when updating during a telnet session.
Maybe only an additional "disable ints" is required?

@nekromant
Copy link
Owner

@susisstrolch I already tried running os_intr_lock() before doing the actual update, doesn't help much, things are the same. What's ever weirder, epc1 addresses of exception appear to be random on my board.

@susisstrolch
Copy link

Also, encapsulating in
system_upgrade_flag_set(UPGRADE_FLAG_START)
system_upgrade_flag_set(UPGRADE_FLAG_FINISH)
system_upgrade_reboot()

doesn't work. Exactly same - only differenc is, that wdt reset shows up
after all...

I'll ask Espressif...

On 03/27/2015 02:45 PM, Andrew wrote:

@susisstrolch https://github.com/susisstrolch I already tried
running os_intr_lock() before doing the actual update, doesn't help
much, things are the same. What's ever weirder, epc1 addresses appear
to be random on my board.


Reply to this email directly or view it on GitHub
#60 (comment).

@nekromant
Copy link
Owner

@susisstrolch Thanks! Hope they can give us some insight of what's happening inside the blobs. For I don't feel like spending a romantic evening with a disassembler right now.

@susisstrolch
Copy link

The SDK docu about "system_upgrade_flag_set" is completely misleading,
because of the "If you using 'spi_flash_write' to upgrade firmware
yourself, this flag need to be set to 'UPGRADE_FLAG_FINISH', then call
'system_upgrade_reboot' to reboot to run new firmware."

On 03/27/2015 03:06 PM, Andrew wrote:

@susisstrolch https://github.com/susisstrolch Thanks! Hope they can
give us some insight of what's happening inside the blobs. For I don't
feel like spending a romantic evening with a disassembler right now.


Reply to this email directly or view it on GitHub
#60 (comment).

@susisstrolch
Copy link

And what's about the idea to go the same route than Espressif - using the bootloader-blob 1.3 and providing two different roms (f.e. antares.rom1, antares.rom2), linked for the two (by Espressif) supported flash areas?
That would at least prevent hustle with os_intr_lock() and other undocumented functions.

Maybe the os_intr_lock() should go into the loop body, just in case one of those funcs will reenable ints..

@nekromant
Copy link
Owner

@susisstrolch I thought about that. However it all looks pretty nasty. If those things didn't change since first versions of SDK (and we still have a flash split, where one application updates the other mirror and reboots to it) we'll be limiting ourselves to 256KiB flash (for rom1 and rom2) not saying about complicating the build process. If we manage to bring tftp update back to life, with heatshrink compression we can get ~300-400KiB of flash without sacrificing OTA update functionality.

@susisstrolch
Copy link

Had a quick look with "objdump -dtr libmain.a"...
Guess what - spi_flash_erase_sector calls system_get_time on entrance and on leave...
And system_get_time accesses the irom_text section...

@nekromant
Copy link
Owner

@susisstrolch Looks like someone has been doing some profiling and forgot to remove portions of that profiling code. Do you think you can ask Espressif to kindly fix that in further SDK releases?

@susisstrolch
Copy link

Just verifying against 0.96...
For sure!

@susisstrolch
Copy link

Hmm... strange - sure it worked with 0.96?
Because system_get_time, which is located in irom, is called by nearly any spi_flash function.

@darkbyte-ru
Copy link
Contributor Author

Previous version of SDK was 0.9.6? I thought 0.9.5.

@susisstrolch
Copy link

So it was introduced with the 0.9.6 beta...
Definitly the 0.9.5 spi_flash_* does NOT call system_get_time!

On 03/28/2015 12:52 PM, DarkByte wrote:

Previous version of SDK was 0.9.6? I thought 0.9.5.


Reply to this email directly or view it on GitHub
#60 (comment).

@susisstrolch
Copy link

Hmm... an unsane workaround would be to remove system_get_time out of libmain and provide an own one, until SDK is fixed...

@darkbyte-ru
Copy link
Contributor Author

Maybe simple nop system_get_time calls in libmain?

@susisstrolch
Copy link

Jep - 'cause it only contains code w/o relocation:

40218250 <system_get_time>:
40218250: fffe31 l32r a3, 40218248
<system_timer_reinit+0x18> ; =0x3ffee874
40218253: fffe21 l32r a2, 4021824c
<system_timer_reinit+0x1c> ; =0x3ff20a00
40218256: 0338 l32i.n a3, a3, 0
40218258: 0020c0 memw
4021825b: 802222 l32i a2, a2, 0x200
4021825e: 223a add.n a2, a2, a3
40218260: f00d ret.n

On 03/28/2015 01:19 PM, DarkByte wrote:

Maybe simple nop system_get_time calls in libmain?


Reply to this email directly or view it on GitHub
#60 (comment).

@susisstrolch
Copy link

these are the bytepos in libmain to patch with a nop (0x0020f0 -> f0 20 00):

0xa29c, 0xa2be, 0xa308, 0xa32e, 0xa369, 0xa38f

ensure, that those locations contain "c0 00 00" before patching...
Affected are: spi_flash_read, spi_flash_write, spi_flash_erase_sector

@darkbyte-ru
Copy link
Contributor Author

@susisstrolch, seems not working

Commiting update, 57 sectors 232928 bytes
############################Fatal exception (0):
epc1=0x4021b194, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (28):
epc1=0x4021aa7a, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000004, depc=0x00000000

@susisstrolch
Copy link

Are you using the proper libmain.a?
Have a "xtensa-lx106-elf-objdump -drt images/antares.elf >tmp/antares.S". Search for "spi_flash_erase_sector>:"
It should look like:

401007f8 <spi_flash_erase_sector>:
401007f8: f0c112 addi a1, a1, -16
401007fb: 0129 s32i.n a2, a1, 0
401007fd: 2109 s32i.n a0, a1, 8
401007ff: fffc01 l32r a0, 401007f0 <spi_flash_write_status+0x34>
40100802: 0020f0 nop
40100805: ffc401 l32r a0, 40100718 <vPortInitialiseBlocks+0x54>
40100808: 0000c0 callx0 a0

If not, you've patched the wrong libmain.a...
libmain.a is two times in esp-open-sdk:

./esp-open-sdk/xtensa-lx106-elf/xtensa-lx106-elf/sysroot/usr/lib/libmain.a
./esp-open-sdk/esp_iot_sdk_v1.0.0/lib/libmain.a

./esp-open-sdk/xtensa-lx106-elf/xtensa-lx106-elf/sysroot/usr/lib/libmain.a is the right one to patch...

@darkbyte-ru
Copy link
Contributor Author

401007f8 <spi_flash_erase_sector>:
401007f8:   f0c112          addi    a1, a1, -16
401007fb:   0129        s32i.n  a2, a1, 0
401007fd:   2109        s32i.n  a0, a1, 8
401007ff:   fffc01          l32r    a0, 401007f0 <spi_flash_write_status+0x34>
40100802:   0000c0          callx0  a0

Seems the same, but "callx0 a0" it is call system_get_time. Must be "0020f0 nop". I'm really patch wrong file. At first I patch "/esp-open-sdk/esp_iot_sdk_v1.0.0/lib/libmain.a", but for linking was used "/esp-open-sdk/xtensa-lx106-elf/xtensa-lx106-elf/sysroot/usr/lib/libmain.a".

@darkbyte-ru
Copy link
Contributor Author

Now update commits okay, but module does not starts after that.

Commiting update, 57 sectors 232928 bytes
#########################################################
Firmware update completed (57 sectors), rebooting

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
ets_main.c

@susisstrolch
Copy link

So it's bricked? Does it work after a pwr cycle?

@darkbyte-ru
Copy link
Contributor Author

@susisstrolch, yep, bricked, after reset it send only

ets Jan  8 2013,rst cause:2, boot mode:(3,6)

ets_main.c

at baud 74880.

@susisstrolch
Copy link

Hmm, here it doesn't work also..
I dumped the flash after the tftp

esptool.py -p /dev/ttyUSB0 -b $(( 4 * 115200)) dump_mem 0x40100000 $((512 * 1024)) flash512-40100000.rom

and got a pretty strange result:

xxd images/antares.rom |head
0000000: e903 0000 0400 1040 0000 1040 a460 0000 .......@...@.`..
0000010: 040e 0040 12c1 f009 0105 3000 01fd ffc0 [email protected].....
0000020: 0000 0801 12c1 100d f000 0000 00fe ef3f ...............?
0000030: c8ae fe3f ecae fe3f cc24 0040 12c1 f009 ...?...?.$.@....

xxd flash512-40100000.rom |head
0000000: 040e 0040 12c1 f009 0105 3000 01fd ffc0 [email protected].....
0000010: 0000 0801 12c1 100d f000 0000 00fe ef3f ...............?
0000020: c8ae fe3f ecae fe3f cc24 0040 12c1 f009 ...?...?.$.@....

Any idea how to read the whole flash with esptool? Trying to read from 0x00 doesn't work...

@nekromant
Copy link
Owner

@susisstrolch Something tells me they might be using system_get_time() there for a reason. Perhaps swapped busy-wait for a system_get_time(). This way if you replace system_get_time with a nop it might misbehave, e.g. timeout immediately and therefore garble the flash contents in weird way.

@susisstrolch
Copy link

Crosschecked again and compared spi_flash-0.9.5 vs spi_flash-1.0.
Onyl difference is the additional call to system_get_time. And it's used
as void function...

Other variant - instead of patching:
ar -d libmain.a spi_flash.o
ar -x libmain-v0.95.a spi_flash.o
ar -r libmain.a spi_flash.o

Will try to call system_restart() instead of ets_wdt_enable();...

On 03/29/2015 01:35 PM, Andrew wrote:

@susisstrolch https://github.com/susisstrolch Something tells me
they might be using system_get_time() there for a reason. Perhaps
swapped busy-wait for a system_get_time(). This way if you replace
system_get_time with a nop it might misbehave, e.g. timeout
immediately and therefore garble the flash contents in weird way.


Reply to this email directly or view it on GitHub
#60 (comment).

@darkbyte-ru
Copy link
Contributor Author

Answer from espressif team:

Erasing and rewriting in the running program area are not our recommended approach and there are potencial risks.

We recommend you to use boot.bin+user1.bin/user2.bin and you can refer to our "SDK User Manual_V1.0.0" for details.

@susisstrolch
Copy link

Got exactly the same one with addition
"Next,we will optimize what may lead to crash problems arising in the
process of use."

On 03/31/2015 08:46 AM, DarkByte wrote:

Answer from espressif team:

|Erasing and rewriting in the running program area are not our recommended approach and there are potencial risks.

We recommend you to use boot.bin+user1.bin/user2.bin and you can refer to our "SDK User Manual_V1.0.0" for details.
|


Reply to this email directly or view it on GitHub
#60 (comment).

@susisstrolch
Copy link

The problem seems to be the system restart.
Have a look at
https://github.com/susisstrolch/esp8266-frankenstein/commit/1bd262354312189d88d117b170444e2e7d74f2e2
for my approach...
It results in

frankenstein > tftp
TFTP: Downloading tftp://192.168.254.20/volume1/public/tftp/antares.rom,
Starting flash block 54

######################################################
TFTP done, 220484 bytes transferred
Committing update in 2 seconds

Commiting update, 54 sectors 220484 bytes
111111111111111111111111111111111111111111111111111111
Firmwa▒
ets Jan 8 2013,rst cause:4, boot mode:(3,6)

wdt reset
ets_main.c

So, for me it looks like the bootloader assumes a given structure, which
is set by serial upload via esptool, but not by simple copy.

Unfortunately, the update related functions from SDK are based in irom, so, at least without elf patching, we can't use the suggested methods "system_upgrade_flag_set()", "system_upgrade_reboot()"

On 03/31/2015 08:46 AM, DarkByte wrote:

Answer from espressif team:

|Erasing and rewriting in the running program area are not our recommended approach and there are potencial risks.

We recommend you to use boot.bin+user1.bin/user2.bin and you can refer to our "SDK User Manual_V1.0.0" for details.
|


Reply to this email directly or view it on GitHub
#60 (comment).

@nekromant
Copy link
Owner

@susisstrolch Bad news. I didn't expect Espressif to really fix the problem - they don't look like the people who care about anything but their own weird approach. I'll give their bootloader thingie a spin, may be with heatshrink it won't be a huge overhead. Basically we need a basic app to jump to that will only do the syncronisation.
Regarging the .rom - it looks really weird. I remember I compared flash contents with .rom after flashing and they were byte to byte identical.

@darkbyte-ru
Copy link
Contributor Author

@nekromant, some good news:

As determined by our internal engineers,this problem can not be treated as a bug for the moment.

However, we will modify related functions to support your systems and new version of SDK will be released on this Friday.

"This not a bug, but we will fix it" :D

@susisstrolch
Copy link

Don't go directly to "Start"... :)

On 03/31/2015 10:17 AM, DarkByte wrote:

@nekromant https://github.com/nekromant, some good news:

|As determined by our internal engineers,this problem can not be treated as a bug for the moment.

However, we will modify related functions to support your systems and new version of SDK will be released on this Friday.
|

"This not a bug, but we will fix it" :D


Reply to this email directly or view it on GitHub
#60 (comment).

@darkbyte-ru
Copy link
Contributor Author

I think this is time to make own bootloader with blackjack and tftp support. Or, maybe arduino like for a first time.

@nekromant
Copy link
Owner

@darkbyte-ru o_O Either they are reading this very thread or I'm just paranoid. Nevertheless - good it will be fixed. Maybe I was even wrong to bash them like that.
Regarding the bootloader - I think the easiest and most portable - will be adding a linux-kernel-like piggyback decompressor to updates.
e.g.

  • We make a super-simple loader that only uses stack.
  • Compile it with -fPIC,
  • Append the compressed update image to it
  • When we've loaded the update to the internal flash - we make a jump there. The loader then takes care to decompress the update and write it to the application section.

Any comments?

@susisstrolch
Copy link

So we're at the same point than now:
the update is written to the flash, and we have to restart in a clean way.
Problem is, we don't know what else has to be done after flash is written.
A simple "software_reset()" isn't sufficient - it leads to

ets Jan 8 2013,rst cause:3, boot mode:(3,6)

ets_main.c

All others (system_reboot(), system_upgrade_*) can't be run because we
don't have the relocated
addresses...

In fact, we have to find out what the rom-based loader does after
flashing...

On 03/31/2015 11:53 AM, Andrew wrote:

@darkbyte-ru https://github.com/darkbyte-ru o_O Either they are
reading this very thread or I'm just paranoid. Nevertheless - good it
will be fixed. Maybe I was even wrong to bash them like that.
Regarding the bootloader - I think the easiest will be adding a
linux-kernel-like piggyback decompressor to updates.
e.g.

  • We make a super-simple loader that /only/ uses stack.
  • Compile it with -fPIC,
  • Append the compressed update image to it
  • When we've loaded the update to the internal flash - we make a
    jump there. The loader then takes care to decompress the update
    and write it to the application section.

Comments?


Reply to this email directly or view it on GitHub
#60 (comment).

@nekromant
Copy link
Owner

@susisstrolch Let me try it again next weekend. I don't think the ROM loader does any super-special magic there.

@susisstrolch
Copy link

Ok - found the reason...
recv_cb reads the image to "u->fblock * SPI_FLASH_SEC_SIZE" (which is 54 in my environment)
but commit_handler reads from (64*i) * SPI_FLASH_SEC_SIZE...
I'll extend the update_server struct with the start block of recv flash, so it can be handled dynamicly in commithandler...

@nekromant
Copy link
Owner

My bad, looks like I've left some hardcode after I've implemented the end-of-firware marker.

@susisstrolch
Copy link

New sdk is out: esp_iot_sdk_v1.0.1_b1
A quick look (and also the readme) shows that the unnecessary system_get_time calls are removed from spi_flash functions.
So I only have to push it into the open sdk to do further tests...

@dirkx
Copy link

dirkx commented May 2, 2015

Am trying to get 1.0.1 wired in to see if that resolves the issue (getting a hang similar to above on UART). Pulling in this version seems to require two changes; moving the rom0 segment back to 0B00 and changing the include path.

That makes things compile without ado -- but on flash & reboot (though UART) a hang is observed just after it prints out the Memory map on reset/reboot.

Any suggestions as to what I may be missing ? Or where to look.

Dw.

diff -ur ./ld/eagle.app.v6.ld sdk/esp_iot_sdk_v1.0.1/ld/eagle.app.v6.ld
@@ -5,8 +5,7 @@
   dport0_0_seg :                        org = 0x3FF00000, len = 0x10
   dram0_0_seg :                         org = 0x3FFE8000, len = 0x14000
   iram1_0_seg :                         org = 0x40100000, len = 0x8000
-  irom0_0_seg :                         org = 0x40240000, len = 0x3C000
+  irom0_0_seg :                         org = 0x4020B000, len = 0x3C000 
 }
@@ -179,5 +178,4 @@
 }

 /* get ROM code address */
-INCLUDE "../ld/eagle.rom.addr.v6.ld"  
+INCLUDE "antares/src/arch/esp8266/ld/eagle.rom.addr.v6.ld"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants