Addressing {hardware} failures and silent information corruption in AI chips

Addressing {hardware} failures and silent information corruption in AI chips



Addressing {hardware} failures and silent information corruption in AI chips

Meta educated one among its AI fashions, referred to as Llama 3, in 2024 and printed the ends in a broadly lined paper. Throughout a 54-day interval of pre-training, Llama 3 skilled 466 job interruptions, 419 of which had been surprising. Upon additional investigation, Meta discovered 78% of these hiccups had been attributable to {hardware} points comparable to GPU and host element failures.

{Hardware} points like these don’t simply trigger job interruptions. They’ll additionally result in silent information corruption (SDC), inflicting undesirable information loss or inaccuracies that usually go undetected for prolonged durations.

Whereas Meta’s pre-training interruptions had been surprising, they shouldn’t be solely stunning. AI fashions like Llama 3 have huge processing calls for that require colossal computing clusters. For coaching alone, AI workloads can require lots of of 1000’s of nodes and related GPUs working in unison for weeks or months at a time.

The depth and scale of AI processing and switching create an amazing quantity of warmth, voltage fluctuations and noise, all of which place unprecedented stress on computational {hardware}. The GPUs and underlying silicon can degrade extra quickly than they might below regular (or what was regular) situations. Efficiency and reliability wane accordingly.

That is very true for sub-5 nm course of applied sciences, the place silicon degradation and defective conduct are noticed upon manufacturing and within the area.

However what may be carried out about it? How can unanticipated interruptions and SDC be mitigated? And the way can chip design groups guarantee optimum efficiency and reliability because the business pushes ahead with newer, greater AI workloads that demand much more processing capability and scale?

Guaranteeing silicon reliability, availability and serviceability (RAS)

Sure AI gamers like Meta have established monitoring and diagnostics capabilities to enhance the supply and reliability of their computing environments. However with processing calls for, {hardware} failures and SDC points on the rise, there’s a distinct want for take a look at and telemetry capabilities at deeper ranges—all the best way right down to the silicon and multi-die packages inside every XPU/GPU in addition to the interconnects that carry them collectively.

The hot button is silicon lifecycle administration (SLM) options that assist guarantee end-to-end RAS, from design and manufacturing to bring-up and in-field operation.

With higher visibility, monitoring, and diagnostics on the silicon stage, design groups can:

  • Acquire telemetry-based insights into why chips are failing or why SDC is going on.
  • Determine voltage or timing degradation, overheating, and mechanical failures in silicon elements, multi-die packages, and high-speed interconnects.
  • Conduct extra exact thermal and energy characterization for AI workloads.
  • Detect, characterize, and resolve radiation, voltage noise, and mechanism failures that may result in undetected bit flips and SDC.
  • Enhance silicon yield, high quality, and in-field RAS.
  • Implement reliability-focused strategies—like triple modular redundancy and twin core lock step—throughout the register-transfer stage (RTL) design part to mitigate SDC.
  • Set up an correct pre-silicon getting old simulation methodology to detect delicate or susceptible circuits and substitute them with aging-resilient circuits.
  • Enhance outlier detection on reliability fashions, which helps reduce in-field SDC.

Silicon lifecycle administration (SLM) options assist guarantee end-to-end reliability, availability, and serviceability. Supply: Synopsys

An SML design instance

SLM IP and analytics options assist enhance silicon well being and supply operational metrics at every part of the system lifecycle. This contains environmental monitoring for understanding and optimizing silicon efficiency based mostly on the working surroundings of the system; structural monitoring to determine efficiency variations from design to in-field operation; and useful monitoring to trace the well being and anomalies of crucial system capabilities.

Beneath are the important thing options and capabilities that SLM IP gives:

  1. Course of, voltage and temperature displays
  • Assist guarantee optimum operation whereas maximizing efficiency, energy, and reliability.
  • Extremely correct and distributed monitoring all through the die, enabling thermal administration through frequency throttling.
  1. Path margin displays
  • Measure timing margin of 1000+ artificial and useful paths (in-test and in-field).
  • Allow silicon efficiency optimization based mostly on precise margins.
  • Automated path choice, IP insertion, and scan era.
  1. Clock and delay displays
  • Measure the delay between the perimeters of a number of indicators.
  • Verify the standard of the clock responsibility cycle.
  • Measure reminiscence learn entry time monitoring with built-in self-test (BIST).
  • Characterize digital delay strains.
  1. UCIe monitor, take a look at and restore
  • Monitor sign integrity of die-to-die UCIe lane(s).
  • Generate algorithmic BIST patterns to detect interconnect fault sorts, together with lane-to-lane crosstalk.
  • Carry out cumulative lane restore with redundancy allocation (upon manufacturing and in-field).
  1. Excessive-speed entry and take a look at
  • Allow testing over useful interfaces (PCIe, USB and SPI).
  • For in-field operation in addition to wafer kind, last take a look at, and system-level take a look at.
  • Can be utilized together with automated take a look at tools.
  • Assist conduct in-field distant diagnoses and lower-cost take a look at through lowered pin rely.
  1. HBM exterior take a look at and restore
  • Complete, silicon-proven DRAM stack take a look at, restore and diagnostics engine.
  • Assist third-party HBM DRAM stack suppliers.
  • Present high-performance die to die interconnect take a look at and restore assist.
  • Function together with HBM PHY and assist a spread of HBM protocols and configurations.
  1. SLM hierarchical subsystem
  • Automated hierarchical SLM and take a look at manageability resolution for system-on-chips (SoCs).
  • Automated integration and entry of all IP/cores with in-system scheduling.
  • Pre-validated, prepared ATE patterns with sample porting.

Silicon take a look at and telemetry within the age of AI

With the size and processing calls for of AI units and workloads on the rise, system reliability, silicon well being and SDC points have gotten extra widespread. Whereas there isn’t a single resolution or antidote for avoiding these points, deeper and extra complete take a look at, restore, and telemetry—on the silicon stage—might help mitigate them. The flexibility to detect or predict in-field chip degradation is especially helpful, enabling corrective motion earlier than sudden or catastrophic system failures happen.

Delivering end-to-end visibility by way of RAS, silicon take a look at, restore, and telemetry shall be more and more vital as we transfer towards the age of AI.

Shankar Krishnamoorthy is chief product growth officer at Synopsys.

Krishna Adusumalli is R&D engineer at Synopsys.

Jyotika Athavale is structure engineering director at Synopsys.

Yervant Zorian is chief architect at Synopsys.

Associated Content material

The submit Addressing {hardware} failures and silent information corruption in AI chips appeared first on EDN.

Leave a Reply

Your email address will not be published. Required fields are marked *