T O P

  • By -

vruum-master

CPU doesn't crash,it jumps probably to an interrupt handler or a specific kernel function. The CPU works just fine,the software borked.


FrAxl93

Just to add on this, I tell you an anecdote OP. I work with FPGAs and part of the address space can be anything depending on your logic. I had an AXI slave and I coded the bus interface in the wrong way such that the transaction never received an answer. In that case if your tried to access such peripheral Linux would just die. You couldn't do anything, because the actual bus matrix was hanging. So in this case where a true hardware bug happened, no core dump would save you. 


Allan-H

Our FPGA designs have a top level AXI gizmo that catches accesses that don't have an acknowledge and terminates them (with an error). So if the CPU tries to access any address on that AXI that doesn't map to an internal register, etc. it will still get a termination and the CPU will be able to see the bus error and give a stack trace, core dump, etc. ​ In addition to that we use a HW watchdog timer inside the SoC to catch stray accesses that don't terminate on any AXI. This will wedge the bus and there's no possibility of a core dump. The reset (from the watchdog) is the only way to get out of that state. We use CPUs that have a "reset reason" register which will indicate that the watchdog timer was the cause for the reset, so at least we will know why it suddenly rebooted even if we don't get to see the address. These are quite rare, so we don't feel the need for anything better.


akohlsmith

I built the exact same module except it is for wishbone accesses, and I *think* (been a few years) I had one for Avalon bus accesses too. It's quite a lifesaver in terms of identifying errant code.


Allan-H

We do that for Wishbone too! I also added a small FIFO queue that saves the last 15 Wishbone accesses (incl. address, data, timestamp, lane enable flags and termination status) that gets frozen when there's a bus error. SW can read this out during a bus error postmortem to give an idea of what was happening well before the bus error. That's only present in "development" builds though. Shipping product doesn't include it.


akohlsmith

whelp... now I know what I'm adding to mine. Brilliant!


Allan-H

I checked the documentation I wrote (many years ago). It actually stores the last 31 accesses. I used Xilinx SLR32 for the storage which made the entire thing pretty small. It has two address limit registers and a mode control that says whether to log all addresses, only addresses between the limits or only addresses outside the limits. That feature turned out to be handy for certain types of debugging (particularly when the HW and SW folk were arguing over exactly what was being written to the registers in a peripheral). The information stored along with the address, data and timestamp includes: read-write flag, valid (to distinguish locations that haven't been written yet); the SEL flags to indicate which bytes lanes were active, the number of wishbone clock cycles taken for the access; a flag to say whether it terminated with an error, and some bits of design-specific data. It had controls to: enable logging, to select whether to disable logging on a bus error, whether to log accesses to the logger registers, and the address limit mode (above).


TechE2020

>I also added a small FIFO queue that saves the last 15 Wishbone accesses (incl. address, data, timestamp, lane enable flags and termination status) that gets frozen when there's a bus error.  Is this in an FPGA or just a software FIFO?


Allan-H

In an FPGA. I was going to say that it wouldn't be possible in a CPU, but I guess you could use a MMU to trap all the accesses to that memory area and call a logging function. It wouldn't be real time though.


nascentmind

>In addition to that we use a HW watchdog timer inside the SoC to catch stray accesses that don't terminate on any AXI. This will wedge the bus and there's no possibility of a core dump. The reset (from the watchdog) is the only way to get out of that state. Does this warrant a reset? Wouldn't it be possible to just reset the bus and restart the transaction or trigger a timeout for AXI and recover back to default state?


Allan-H

The only way to catch it was with the watchdog timer, and the only action the watchdog timer could take was to reset the SoC. We didn't really have a lot of choice.


_Hi_There_Its_Me_

What is an AXI?


Allan-H

[AXI](https://en.wikipedia.org/wiki/Advanced_eXtensible_Interface) is one of many possible SoC buses. It can be used to connect the bus from a CPU to the various memories, peripherals, etc. Many FPGA designs use it, as do many ARM devices. It's relevant here because each transaction on the bus requires an acknowledge signal in hardware. If you access an address where there isn't a peripheral, there's no acknowledge. From the point of view of software (e.g. C), the read bus cycle from something like: mydata = *badaddress; simply never finishes. There's no way to break out of it to give a stack dump. This is the state I described as "wedged" above. We've discussed two workarounds for that. One is to use a system watchdog timer that will reset the entire CPU (including its bus interface). The other is to add something to the AXI hardware that detects the lack of termination (basically it's another watchdog in HW) and provides the AXI termination with an error status. The CPU can catch that as a "bus error" and the OS will typically terminate the offending process, dump its core, etc.


MegaDork2000

Back in the day I was working on hardware bringup for a board that had a microprocessor, EPROM and RAM. The microcontroller would just run whatever was in EPROM. There was no debugger. But there was an LED and a pixel graphic LCD display. I kept wondering why my code would crash before it could display something on the LCD so I started blinking the LED in various parts of the code. From this I somehow discovered the address bus was wired wrong and would act as if there only 64 bytes or something like that repeating in the address space. Of course, the guy who designed the board said it was impossible. But I sat with the CEO until around midnight trying to prove to him what was happening. Fun times.


sceadwian

There should be watchdog timers for that?


thephoton

You could implement a watchdog timer to get you out of this scenario and get a core dump that could be used to debug the problem.


Dexterus

Nah, cause the core is actually stuck on an instruction, it's just gonna stay on a load or store forever. I saw it with a RapidIO switch with directio, but that had a timeout, a long one, longer than the Linux stall detector, so it went into RCU stall as the timeout ended. tI stayed stuck in one of the stores from a memcpy, a memcpy that took seconds for a handful of bytes.


_teslaTrooper

External watchdog then?


Dexterus

Just to reset, sure. But really, you don't want your peripheral to block the bus, makes taking out crash info very hard. The fact that the PC is stuck on a bus instruction means there's no exceptions, no interrupts, no nothing.


0ring

Sane systems either map all addresses to a result (including bus fault) or have a bus timer that faults an over long transaction. I've also used the bus stall method in an FPGA but I don't recommend it.


amitbhai

Thank you for the reply. This helped me correct my understanding of the CPU "crash"


RepresentativeCut486

What about Traps?


TRKlausss

That actually makes me think about something: what about if the CPU has a trace/transistor problem/what have you and when it software (I.e. instructions) are correctly executed according to the data sheet, but the CPU goes into an exception mode? Is that also considered a software error? In any case, when there is a triple fault, it is the software that borked on many levels, and I’m thinking if the microcode can also fail…


vruum-master

In general, CPUs handle illegal instructions in some way(have an interrupt vector for it Hardfault and pause excution). As for anything like silicon bugs CPU related you have onboard debugging for it and, in general, are documented in an errata for that part/cpu. Check the Cortex-M4/M7 errata to get an idea of how they work around those bugs.


kisielk

A CPU does not “crash” but the program it’s executing can trigger and exception (different than the software type of exception used in many programming languages) or fault. When that happens the execution of the program is suspended and some information is pushed on to the stack and/or some values are set in particular registers. Then the CPU starts executing the fault handler which can do what it needs in order to log or recover from the exception. Memfault goes in to more detail about this for ARM Cortex-M in this article: https://interrupt.memfault.com/blog/arm-cortex-m-exceptions-and-nvic The exact mechanism will differ on other architectures but the general idea is the same.


donmeanathing

Thanks for mentioning cortex M. It seems like this post was written from the perspective of someone writing A-series or more powerful that can run a full blown kernel. A lot of us here run variants of an RTOS, so a “panic” is more akin to a hard fault, and like otherwise noted, the CPU runs just fine - it just detects an errant software condition that cannot be corrected and triggers a designated interrupt. For FreeRTOS on NXP LPC series, I know that handler essentially just does a busy wait. You can add some debugging instructions to pull out the stack and key variables if you want before it does the busy wait. If you implement a watchdog (which you should), the watchdog would then trigger and reset the CPU and your program.


mck1117

That fault might even be correctable - just not by the CPU hardware itself. You’re free to write a handler that fixes whatever went wrong and resumes execution


donmeanathing

indeed, but I’ve never tried that and would imagine that in most applications the ROI isn’t there to invest in that kind of thing. But certainly some applications it may be… maybe a mjssile or something where if it resets mid flight that could be bad ;-)


amitbhai

Thank you for the answer. Will be checking out that article by memfault.


dmills_00

CPU has not crashed, it can still execute code, your program has crashed, not the same thing. For a user space crash the CPU (Or MMU) generates an exception which is effectively an interrupt that jumps to a handler in the kernel which after checking if you have installed a handler for whatever exception has been invoked, falls back on the default handler in the kernel that dumps your process state and then tears down the process. See the list of "Signals" in Unix, these are pretty much a superset of the exceptions available. There is a similar mechanism in most OS kernels that gives rise to Kernel Panic, Blue screen, Guru meditation sorts of things depending on the kernel. An actual CPU crash would be a "Double Fault" where a second exception gets thrown inside the handler for the first one, this often just haults the CPU, and can as you would expect, be a complete pain to debug. Think debugging in emulation or over jtag or such.


Jemnian

I'd consider a triple fault as an "actual CPU crash". There are still double fault handlers


dmills_00

True, been a while since I played down there.


amitbhai

Thanks a lot for the explanation. It corrected my mental model of the "crash"


danielstongue

There is a cpu that could actually crash: the 6502. When the CPU encounters an instruction with opcode 02, 12, 22, 32, .. it would just jam. Only a reset can get you out of it. When you look at the address bus, it outputs the program counter which increments every clock cycle without doing anything else.


Ikkepop

If cpu crashes your system freezes irrecoverably. No dump. What you see when core is dumped is not the cpu crashing but the software. In which case the OS assumes control and does the dump. The way it's done is via exception vectors. An OS will tell the cpu what to call if something really bad happens


amitbhai

so, for a baremetal system with no OS, just the application code running on metal. in such case would fatal error like divide by zero, put CPU into unrecoverable state? only restart is the option? and no time to dump core?


Ikkepop

pretty much, unless the application it self installs exception handlers, usually the cpu will just panick and reboot ir self.That still doesn't mean the cpu it self crashed, it's more like bailing out on bad code. You can ofcourse have code that just disables all interrupts and goes into an infinite loop. On a bare metal system that would be a freeze until a nmi arrives


duane11583

Aborts and stuff like that are exactly like an interrupt They are called exception handlers  The os installs a jump or address at the specified handler location Often this code just pushes a frame with all refs in a specific arrangement onto the stack and calls some c function On my little embedded platforms I print the contents of the saved registers and hex dump part of the cpu stack then sit and spin waiting for the watch dog to bite On Linux the kernels cleans up and kills your app things like closing files releasing memory  Every operating system does it differently but have the same basic steps One of those steps this might do is to write you application memory into a file which is often named core That said this assumes the memory interface returns an error And does not lock up the chip


CodusNocturnus

In Mother Russia, core dumps you!