NVIDIA and Oracle are joining forces with the U.S. Department of Energy (DOE) to build the nation’s largest AI supercomputer.
The Solstice System
The Solstice system will feature a record-breaking 100,000 NVIDIA Blackwell GPUs, while a companion machine, Equinox, will pack 10,000 GPUs. Both will be located at Argonne National Laboratory and interconnected through NVIDIA’s high-speed networking, delivering a combined 2,200 exaflops of AI performance , the most powerful AI infrastructure ever developed for the DOE.
The new supercomputers will enable researchers to train frontier AI and reasoning models using the NVIDIA Megatron-Core library and TensorRT™ inference software, creating what the company calls “agentic AI workflows” for open science.
“The Equinox and Solstice systems are designed to accelerate a broad set of scientific AI workflows,” added Paul K. Kearns, director of Argonne National Laboratory.
energy.gov/articles/energy-d…
Quantum Processors Powered by GPU’s; NVQLink™
Developed with input from leading DOE laboratories, including Brookhaven, Los Alamos, Berkeley, and Oak Ridge, NVQLink™,
nvidia.com/en-us/solutions/q… a new open system architecture that connects GPU supercomputers with quantum processors to build accelerated quantum supercomputers for next-generation hybrid computing.
NVQLink™ enables ultra-fast, low-latency data exchange between GPUs and quantum units and the technology is being adopted by 17 quantum hardware builders and 9 U.S.national labs,
nvidianews.nvidia.com/news/n… marking the next step toward scalable quantum-GPU hybrid systems.
interestingengineering.com/e…
Megatron-Core Library
Megatron-Core is an open-source PyTorch-based library that contains GPU-optimized techniques and cutting-edge system-level optimizations. It abstracts them into composable and modular APIs, allowing full flexibility for developers and model researchers to train custom transformers at-scale on NVIDIA accelerated computing infrastructure.
This library is compatible with all NVIDIA Tensor Core GPUs, including FP8 acceleration support for NVIDIA Hopper architectures.
docs.nvidia.com/megatron-cor…
Multi-Storage Client (MSC) Integration
The Multi-Storage Client (MSC)
nvidia.github.io/multi-stora… provides a unified interface for reading datasets and storing checkpoints from both filesystems (e.g., local disk, NFS, Lustre) and object storage providers such as S3, GCS, OCI, Azure, AIStore, and SwiftStack.
The base client supports POSIX file systems by default, but there are extras for each storage service which provide the necessary package dependencies for its corresponding storage provider.
MSC uses a YAML configuration file to define how it connects to object storage systems. This design allows you to specify one or more storage profiles, each representing a different storage backend or bucket. To tell MSC where to find this file, set the following environment variable before running your Megatron-LM script.
MSC uses a custom URL scheme to identify and access files across different object storage providers.
To train with datasets stored in object storage, use an MSC URL with the --data-path argument. In addition, Megatron-LM requires the --object-storage-cache-path argument when reading from object storage.
docs.nvidia.com/megatron-cor…
NVIDIA Hopper Architecture
nvidia.com/en-us/data-center…
PyTorch
PyTorch is a GPU accelerated tensor computational framework. Functionality can be extended with common Python libraries such as NumPy
Numpy.org and SciPy.
scipy.org/ Automatic differentiation is done with a tape-based system at the functional and neural network layer levels.
pytorch.org/
PyTorch Project
github.com/pytorch/pytorch
TensorRT™
Software Development Kit (SDK) for high-performance deep learning inference.
nvidia.github.io/Torch-Tenso…
Torch-TensorRT™
Operates as a PyTorch extension and compiles modules that integrate into the JIT runtime seamlessly.
nvidia.github.io/Torch-Tenso…