The Quest for Bugs:
(Approximating Silicon with FPGA Prototyping)
10 minute read
Bryan Dickman, Valytic Consulting Ltd.,
Joe Convey, Acuerdo Ltd.
Verification is a resource limited ‘quest’ to find as many bugs as possible before shipping. It's a long, difficult, costly search, constrained by cost, time and quality. For a multi-billion gate ASIC,
the search space is akin to a space search; practically infinite.
In this article we talk about the quest for bugs at the system-level, which it turns out is even more difficult than finding another Earth-like planet humanity can survive on!
FPGA prototyping systems make deep cycles possible for deep RTL bug searches scaled-up to full chip or sub-system level by running real software payloads with realistic I/O stimulus. In fact, with scale-out, you can approximate the throughput of silicon devices.
Terms of reference
Are we on the same page?
Modern FPGA prototyping systems are effective and performant platforms for software development (in advance of available hardware reference boards), system-level validation (demonstrating that the hardware with the target firmware/software delivers the required capabilities), and system-level verification (bug searching/hunting from a system context). Hardware and software bugs can be found in all use-cases. Remember…
Sadly, bugs do remain undetected
(even after rigorous searches such as simulation and formal)
Why do hardware bugs escape earlier verification levels such as simulation and formal and what type of bugs can be found when using an FPGA prototyping platform as your verification environment? Hardware and software cannot be developed in isolation. How are you going to ‘validate’ the capabilities of hardware and software combined before committing to silicon? How are you going to validate the ‘target’ system in terms of real software and realistic hardware I/O?
What are the key capabilities that you will need from your FPGA prototyping environment to be successful? Deployed silicon will achieve exacycles (1e18) of usage over millions of devices. What are both the achievable and the aspirational targets for ‘deep cycles’ (at least a quadrillion cycles!) of pre-silicon verification, and how do you scale-up to meet these goals?
Hitting the buffers with simulation
You’ve probably met all your coverage goals with simulation. Does this mean all of the bugs have been found? We know you wouldn’t bet your house on it! Consider this for a moment: If you are developing an ASIC or an IP Core that will likely run with a target clock speed in the GHz range; consider how many cycles will run in daily usage and then try to figure out the total number of cycles that you have simulated? You might be surprised (or depressed!).
1e9 and 1e11 cumulative scaled-out simulation cycles
is equivalent to only a few seconds of activity
And that’s if you have the servers and licences. You probably have not run the target firmware or software or been able to simulate realistic I/O traffic yet. Are you confident to go to silicon on that basis? Don’t be fooled by great coverage results. As previously mentioned in The Origin of Bugs,
Covered != Verified
How many silicon respins can you afford to address critical design bugs?
If one thinks about actual devices operating at 2-3GHz or greater, and being deployed in the millions, those devices in aggregate are running exacycles (1e18) and greater every day. Now, it’s not the job of verification to simulate the entire lifetime of all devices but,
you need a convincing road-test.
Given that, what is the minimum cycle depth we would want to achieve if we want to sleep at night? Companies simply can’t afford to do this with simulation. Emulation goes a long way towards it, but FPGA takes performance and scalability up to another level. The challenge is how to validate designs to this depth, pre-silicon?
The only viable solution is scaled-out FPGA prototyping
For a modest 1GHz device, consider how many simulation hours are needed to achieve just one hour of target speed equivalent cycles:
Total Simulation time (@100Hz) = 3600x1e9 /100
= 36x1e9 seconds (10M hours)
(or 10K elapsed hours if you can run 1000 simulations in parallel)
The latest FPGA prototyping technologies can achieve speeds from tens of MHz to hundreds of MHz depending on the design parameters. Even here, engineering will need to scale-out the FPGA prototyping capacity to achieve meaningful deep cycle targets.
Approximating to Silicon
As an example, for an FPGA clock speed of say 50MHz, 20 prototype instances will give an equivalent of 1GHz throughput capability. In other words…
20 prototype instances gets us a piece of silicon to play with!
Scale-out further, and it’s possible to approximate to the throughput of multiple pieces of silicon. So, it’s feasible to achieve deep cycles of pre-silicon verification in a practical timeframe. Of course, the level of scale-out needed depends on achievable FPGA clock speed, the target silicon clock speed, and the depth of testing being targeted.
To find most of the bugs
(at least all of the critical ones)
Deep cycles of verification testing alone is not a smart enough goal for our quest.
Hardware acceleration unlocks the ability to run real firmware and software on the RTL.
Booting an operating system and then running some software, or rendering a few frames of video, demands long sequences of execution that are not practical with simulation – you can’t wait that long for the test to complete. Emulation takes a big step closer to this capability, and FPGA prototyping takes everything a stage further.
Finding bugs that escape simulation, formal and emulation
Both FPGA prototyping and emulation enable a software-driven approach to system-level verification. Just booting the OS alone is likely to take several billion cycles.
Booting Android takes around 30B cycles; that’s only a few minutes for our 50MHz FPGA, versus several thousand hours for a fast 1KHz simulation! Running the target software or other software payloads such as compliance suites, benchmarks or other test suites, can and does find bugs that will have escaped earlier stages of verification. You might not find a huge volume of such bugs, but when you find them, you know you have…
saved an escape that otherwise would have made it into silicon.
So, the relative value of these bug finds is high. If you then multiply this testing load by the configuration space of both the hardware and the software, you can quickly escalate towards an extremely large volume of testing demand.
When bugs occur, can you detect them?
A really great way to improve failure detection is to leverage SVA assertions from your simulation and formal environments. When SVA’s fire, they pinpoint the failure precisely, which is a huge advantage when you are running huge volumes of testing! You may be able to implement a home-grown flow for adding assertions to your FPGA prototyping environment, or better still, the FPGA prototyping vendor will have already provided a workflow to do this.
Recommendation: Leverage SVA assertions to enhance FPGA-prototyping checking capability.
When bugs occur, can you debug them?
Debug consumes a lot of engineering time and this is also the case for an FPGA prototyping approach. There have been many studies of where verification engineers spend their time and most of them show that over 40% of time is spent in debug. Debug is labor intensive, so,
efficient debug is critical.
FPGA prototyping systems necessitate a different approach to debug compared with emulation. You might start the debug process with software debug and trace tools, but from this point onwards you are going to need to debug the hardware, requiring visibility of internal state at the RTL code level. This requires trigger points for events of interest and the ability to extract signal waveforms for debug. It’s an iterative process as the designer zooms-in on the suspicious timeframe and provide enough of the internal hardware state and logic signals to diagnose the bug.
FPGA prototyping systems provide multiple complementary debug strategies to accomplish this. A logic analyzer is required to set up the necessary trigger points to start/stop waveform traces. Debug can be performed at speed, with visibility of a limited subset of chosen signals (probe points), which are then stored to either on-FPGA memory, on-board memory or off-board memory when deeper trace windows are required. This works well, especially when the engineer has a good idea of where to search in the RTL. The limited set of probe points may be more than sufficient to debug most problems.
More complex bugs require full-vision waveform analysis, demanding a different debug approach where full RTL visibility can be reconstructed.
No, scale-up and scale-out!
Nothing is impossible, but certain things are hard problems to solve. With many ASIC designs trending to multi-billion gates, prototyping systems need to get bigger, and offer platform architectures that
scale-up to handle bigger systems and sub-systems,
by plugging multiple boards and systems together while minimizing the impact on achievable DUT clock speeds.
You may have started with a lab or desk-based setup, but quickly conclude that you need to deploy an FPGA prototyping cluster (or farm). Deep cycles can be achieved if you
scale-out to approximate to silicon levels of throughput.
Use real-world traffic on a system realistic platform
Advanced FPGA prototyping solutions offer strong build and compile workflows, debug and visibility solutions, and high degrees of flexibility to configure the system to the design needs and cope with devices of any size. Plug-in hardware such as daughter cards allow users to quickly configure the environment to be system-realistic and connect the DUT to real I/O with real traffic profiles. This invariably involves some stepping down of native I/O speeds to the achieved FPGA clock speeds, and it means that the prototyping system needs to be able to handle many asynchronous clocks.
Good automation of the flow (e.g., auto-partitioning) will get you to running your first cycles within a few days. You can then optimize to get the fastest possible throughputs needed to drive towards your approximated-silicon testing goals.
How many cycles to run, is itself, a dilemma; a ‘deep cycles’ dilemma!
In practice this is determined by budget and the capacity available, but the fact is that risk of getting it badly wrong is increasing rapidly. Billion gate designs that integrate a growing array of IP components are required in our modern world to perform ever more complex tasks for applications such as 5G, AI, ML, Genomics, Big Data…the list goes on…
With scaled-out FPGA prototyping you can approximate to the throughput of the final silicon device. But how effective is this software-driven verification approach?
Bug data will be the best metric to measure effectiveness and ultimately the ROI of the investment. Be sure to document and analyze bugs found from FPGA prototyping and assess what the total impact cost would have been had those bugs escaped to be discovered later in costly silicon.
Even if no bugs were found, which is a highly unlikely scenario, there is a value calculation for the assurance that this level of extreme deep testing brings.
Think of it as the “help me sleep at night” verification metric
A true-negative result is just as valuable as a true-positive when it comes to confidence levels and the true value of the final product.
Innovation in technology always creates bigger, better, faster ways of completing the verification quest successfully.
Balance the investment between simulation, formal, emulation and FPGA prototyping in a way that reflects the complexity and size of the design, but critically the nature of risk if there were to be escapes. Justifying investment in the quest for bugs is predicated on many factors: the risk of reputational damage and huge rework/mitigation costs can build a strong ROI for investment in FPGA prototyping at scale.