#Tesla's patent WO2024072934A1 discloses an innovative performance monitoring system for
#Dojo, its AI training supercomputer that leverages an array of systems on wafers (SoWs).
This technology could significantly enhance Dojo's reliability and prevent performance degradation by identifying and addressing issues in individual dies.
1️⃣ Problems to be Solved:
- Difficulty in real-time monitoring of individual dies within a high-density computing system
- Performance degradation of the entire SoW due to issues in a portion of the dies
- Inefficient utilization of computing resources within the system
2️⃣ Key Technologies and Their Effects:
- Telemetry data collection from individual dies:
: Each die in the SoW array generates telemetry data (e.g., temperature, voltage, current, usage, bandwidth, latency)
: Microcontrollers receive telemetry data from one or more dies
: Controller obtains and processes telemetry data from microcontrollers
- Real-time monitoring and performance optimization:
: Controller identifies each die and associates telemetry data with specific dies
: Performance metrics are determined for individual dies based on the processed telemetry data
: Corrective actions (e.g., throttling, deactivating, reinitializing, power control) are applied when performance metrics satisfy thresholds
- Graphical representation of performance metrics:
: Controller generates graphical representations of performance metrics at various resolutions (e.g., SoW level, die level)
: Visualization aids in debugging and performance enhancement
3️⃣ Key Figures:
- Fig. 2: Example computing system with an array of SoWs and an electronic module array
- Fig. 3A, 3B: SoW with an array of dies, each die containing compute nodes and global nodes for telemetry data generation
- Fig. 4: Interactions between the controller and SoWs/dies for telemetry data collection and processing
- Fig. 6A, 6B: Graphical representations of processed telemetry data at SoW and die levels
4️⃣ Claim 1:
A computing system comprising:
- an array of dies included on a system on a wafer (SoW), wherein the dies of the array are configured to output telemetry data;
- a microcontroller configured to receive telemetry data associated with at least one die of the array of dies; and
- a controller configured to obtain data that comprises the telemetry data from the microcontroller, determine a performance metric of a particular die of the array of dies by processing the obtained data, and apply a corrective action in response to determining that the performance metric satisfies a threshold.
💡 Patent: WO2024072934A1
- Title: Method and Apparatus for Telemetry of System on a Wafer
- Applicant: Tesla, Inc.
- Link:
patents.google.com/patent/WO…
$TSLA #SoW #SystemOnWafer #Dojo #TelemetryData #PerformanceMonitoring