$NVDA Understanding the hierarchy of GPU service offerings is critical because profit pools, competitive dynamics, and risk profiles differ sharply at each layer of the stack. At the top, fully managed AI applications and model APIs capture value through software, data, and distribution, with GPUs as a largely invisible input. Margins are high, pricing power is strongest, and sensitivity to raw GPU-hour pricing is indirect. At the bottom, bare-metal GPU services, GPU-accelerated VMs, and colocation sit closest to Nvidia hardware economics. These businesses are capital intensive, highly exposed to GPU ASPs and utilization, and face intense price competition as more capacity comes online.
A clear taxonomy makes it possible to map each company’s true position in this stack, separate narrative from reality, and identify where economic rents are likely to accrue.
•Fully managed AI applications (SaaS) that hide GPUs entirely
•Managed foundation models and inference APIs (model-as-a-service)
•Managed training / fine-tuning platforms (“train a model, no infra”)
•Full ML / MLOps platforms with integrated GPU orchestration (PaaS)
•Managed GPU clusters (Kubernetes/Ray/Slurm as-a-service)
•GPU-accelerated VM instances (IaaS VMs with attached GPUs)
•Bare-metal GPU-as-a-service (dedicated GPU servers delivered via API)
•GPU colocation / hosted racks (customer-owned GPUs in provider facilities)
•On-premise self-managed GPU infrastructure (customer buys and runs everything)
The hierarchy above is ordered from the most abstracted, feature-rich, and “application-level” offerings at the top toward the most primitive, infrastructure-level offerings at the bottom. As one moves down the list, the provider delivers fewer higher-level services and the customer assumes more responsibility for software, orchestration, and operations. The underlying GPU product (H100, H200, B200, etc.) can appear at multiple layers; what changes is how much of the stack the provider bundles.
1.Fully managed AI applications (SaaS) that hide GPUs entirely
This category includes end-user products that embed GPUs but present themselves as business applications: copilots inside productivity suites, AI assistants in CRM, code copilots, AI search, AI-native office suites, and vertical AI tools (design, drug discovery, industrial simulation) that expose a pure SaaS interface. The customer never sees GPUs, instances, or clusters; pricing is per seat, per document, per project, or per usage metric (tokens, queries, tasks) but packaged as an application.
From a GPU standpoint, this layer is the furthest from the hardware. Providers hedge GPU capacity commitments against user growth, drive utilization via multi-tenancy, and optimize model architectures and inference stacks internally. Economically, this tier carries the highest margins and the greatest pricing power, since value is anchored in business outcomes rather than compute units. For infrastructure investors, GPU pricing enters only indirectly, through its effect on gross margin and the provider’s ability to subsidize AI features to drive user growth.
2.Managed foundation models and inference APIs (model-as-a-service)
The next layer is model APIs that expose generic or specialized models via HTTP or gRPC endpoints: text LLMs, vision models, multi-modal, embedding APIs, and sometimes domain-specific models. Customers pay per token, per 1k images, or per unit of model usage. The provider manages model hosting, scaling, versioning, A/B testing, safety, and often some monitoring and logging.
At this tier, GPUs are abstracted into a “token” or “inference” unit. Providers still actively manage GPU fleets, but customers think in terms of model latency, throughput, and cost per token. GPU choice (H100 vs L40S vs B200) is a provider decision, often invisible or only exposed as coarse tiers (standard vs “turbo” / “premium”). Economics are more sensitive to GPU pricing than at the SaaS layer but still benefit from aggregation: high utilization, batching, and kernel-level optimization can produce meaningful arbitrage between GPU-hour input costs and token-level output pricing. Competitive pressure in inference APIs is compressing margins, but differentiated models and ecosystem lock-in still offer room for attractive returns.
3.Managed training / fine-tuning platforms (“train a model, no infra”)
This layer offers “bring your data, get a model” services. Customers upload datasets and configuration; the platform orchestrates the entire training pipeline: data preprocessing, sharding, distributed training, checkpointing, evaluation, and sometimes deployment. Examples include managed fine-tuning products, AutoML services, and dedicated training PaaS that hide cluster-level complexity.
Conceptually, this tier is training-as-a-service (TaaS). The platform decides which GPU type, cluster size, parallelism strategy, and fabric to use, possibly across multiple cloud providers. Customers pay per training job, per GPU-hour with automated scaling, or per project. The provider must understand Nvidia’s product roadmaps (A100 vs H100 vs H200 vs B200), topology (SXM vs PCIe, NVLink vs Ethernet/InfiniBand), and optimizer/parallelism choices to minimize training cost while meeting SLAs.
This tier is operationally complex and capital intensive because it must handle long-duration jobs, complex failure modes, and heavy state (checkpoints). It sits closer to raw GPU economics than inference APIs because training jobs often fully occupy clusters for days or weeks. The ability to arbitrage between cloud GPU pricing options, exploit spot capacity, and route workloads to neoclouds or hyperscalers efficiently is a key differentiator. Providers at this tier effectively “compile” customer workloads to underlying GPU markets.
4.Full ML / MLOps platforms with integrated GPU orchestration (PaaS)
These are general-purpose ML platforms (e.g., managed notebooks, pipelines, experiment tracking, feature stores) that tightly integrate GPU scheduling and orchestration. The core product is not “API access to a model” but a managed development and deployment environment. GPUs are a resource managed by the platform’s scheduler, exposed via abstractions such as “GPU pool,” “compute profile,” or “accelerator type.”
Customers typically pay for a mix of platform fees and underlying compute/storage usage. The platform may support multiple workloads: data preprocessing, training, hyperparameter tuning, batch inference, and interactive experimentation. The provider’s responsibilities include user management, security, observability, and lifecycle management, in addition to GPU provisioning. This tier straddles PaaS and IaaS; it commonly runs on top of hyperscaler GPU instances, neocloud GPU clusters, or a hybrid of both.
From a GPU-market standpoint, this layer is where many enterprises “terminate” their abstraction. They want to select GPU families and sizes, but not manage Kubernetes, Slurm, or NCCL tuning. The platform provider’s value comes from masking heterogeneity in GPU hardware and pricing across clouds and exposing a stable developer experience. Economically, margins depend on the degree of lock-in and differentiated workflow tools. GPU cost is a major line item but often passed through with markup.
5.Managed GPU clusters (Kubernetes/Ray/Slurm as-a-service)
This layer exposes GPU clusters more directly but with managed control planes and orchestration frameworks: managed Kubernetes with GPU node pools, managed Ray clusters, Slurm-as-a-service, or specialized orchestration for distributed training. Customers see nodes, pods, and jobs, and they deploy their own containers or training code. The provider operates the cluster, control plane, autoscaling, and sometimes the storage and network fabric.
At this tier, customers are responsible for their own software stack (frameworks, libraries, distributed training logic) and performance tuning but can offload cluster provisioning, lifecycle management, upgrades, and failure handling. GPU choice, topology, and interconnect details may be selectable (e.g., “H100 NVLink InfiniBand cluster”), but the platform still abstracts away low-level host management.
Economically, this layer tends to be lower-margin than model APIs or training PaaS because it is closer to raw infrastructure. Differentiation comes from quality of implementation (latency, NCCL efficiency, topology guarantees), reliability, and ease of integration with customer CI/CD. From a pricing perspective, this is usually billed per GPU-hour plus possible control-plane or support fees, with customers closer to seeing the “true” price of H100/H200/B200.
6.GPU-accelerated VM instances (IaaS VMs with attached GPUs)
Here, the main product is virtual machines with attached GPUs. The provider offers SKUs like “8× H100 80 GB” or “1× L4,” with documented vCPU, memory, and network bandwidth. Customers manage the OS, drivers, frameworks, and application stack. This is the classic hyperscaler model (EC2, Compute Engine, Azure VMs) and is also exposed by many neoclouds.
The abstraction is relatively thin: the provider virtualizes hardware and networking, offers images and basic monitoring, and enforces quotas and security isolation. Customers choose GPU type, count, and region; they manage scaling via autoscaling groups, scripts, or external orchestrators. Pricing is nearly directly indexed to GPU-hour cost, plus a premium for VM overhead and network/storage. This tier is the baseline reference for H100/H200/B200/A100 pricing comparisons and is where price competition is most visible.
From a responsibility perspective, customers are responsible for cluster-wide concerns: placement, data locality, job scheduling, checkpointing, and resilience. The provider’s main levers are price, availability, and performance guarantees. Margins depend heavily on utilization and on the provider’s ability to secure favorable GPU procurement terms.
7.Bare-metal GPU-as-a-service (dedicated GPU servers via API)
Bare-metal GPU-as-a-service offers physical servers with GPUs and no virtualization. Customers receive full control over the node (BIOS-level in some cases), installing their own OS or bringing custom images. Some providers add an API layer for provisioning and basic lifecycle actions (power on/off, PXE boot, image deploy), but there is no hypervisor.
Bare metal is attractive for high-performance workloads, customers wanting custom kernels, or those needing very predictable latency and performance. It also allows providers to avoid hypervisor overhead and simplify some aspects of capacity management. However, customers must handle everything above the bare hardware: OS hardening, clustering, storage, networking topology awareness, and distributed job management. Interconnect (NVLink, InfiniBand vs Ethernet) and node configuration (HGX vs PCIe) are often key differentiators.
Pricing is typically per GPU-hour or per-node-hour and often lower than equivalent VM offerings per unit of GPU, because less virtualization overhead is present and the service is more commoditized. For neoclouds, this tier is capital intensive and exposes them most directly to Nvidia GPU ASPs and utilization risk. Unit economics are sensitive to GPU generation pricing, system capex, and rack-level power and cooling costs.
8.GPU colocation / hosted racks (customer-owned GPUs in provider facilities)
At this level, the customer owns the GPU servers and the provider supplies data-center services: power, cooling, physical security, network cross-connects, and sometimes remote-hands operations. The provider may offer cage space, managed PDUs, and optional services like monitoring and hardware replacement, but the hardware capex and much of the operational burden reside with the customer.
This model is analogous to traditional colocation but focused on GPU-dense racks. It is attractive for entities with large, stable workloads and access to capital, who wish to avoid cloud markups and retain full control over hardware, while leveraging provider expertise in power and cooling. Sovereign deployments and AI majors increasingly use such models for part of their fleet, sometimes operated by colocation specialists or repurposed HPC data centers.
In this arrangement, Nvidia and the hardware OEMs capture most of the hardware margin; the colocation provider earns relatively stable, utility-like returns on space and power. The “service” layer around GPUs is minimal. The economics hinge on long-term contracts and high rack occupancy, not on per-GPU-hour arbitrage.
9.On-premise self-managed GPU infrastructure (customer buys and runs everything)
At the bottom of the hierarchy is fully self-managed infrastructure. The customer purchases GPUs and servers, builds or repurposes facilities, manages power and cooling, implements networking and storage, and runs their own orchestration stacks (Kubernetes, Slurm, Ray, proprietary frameworks). This includes both traditional on-premise data centers and private cloud environments.
Here, GPUs are pure capital assets on the customer balance sheet. There is no external “GPU service provider” per se; the customer is both owner and operator. The upside is maximal control over performance, security, and long-term cost of compute, particularly for predictable, high-volume workloads. The downside is high upfront capex, operational complexity, and exposure to technology obsolescence (e.g., transitions from A100 to H100 to B200 and beyond).
From a market-structure perspective, this tier competes with all higher layers on a lifecycle TCO basis. For large, steady workloads, internal GPU fleets can be cost-advantaged over public cloud or neocloud offerings at current GPU-hour pricing, but they lack elasticity and require strong internal capabilities. For small or spiky workloads, higher-level services are generally more economical.
Taken together, this hierarchy illustrates how the same Nvidia GPU products underpin multiple economic layers. At the top of the stack, GPUs are an invisible input into SaaS and model APIs, where value is captured through application outcomes and proprietary models. In the middle, GPUs are mediated by platforms and managed clusters, where provider differentiation relies on orchestration, tooling, and integration. At the bottom, GPUs are exposed directly via VMs, bare metal, or colocation, where competition tends toward price-per-GPU-hour and utilization management. Investment analysis of any player in this ecosystem requires identifying where in this hierarchy it operates, how much of the stack it controls, and how exposed it is to shifts in GPU pricing, utilization, and Nvidia’s product cadence.