Filter
Exclude
Time range
-
Near
Getting a DevOps job in 2026 is harder than it was two years ago. Not because there are fewer jobs. Because interviewers have become much better at spotting people who learned DevOps from courses versus people who have actually operated systems in production. The engineers getting hired are not the ones with the most certifications. They are the ones who can troubleshoot. Why are only 10% of API calls failing? Is one replica running a bad configuration? Is a memory leak affecting only a single pod? Is a downstream dependency timing out intermittently? Why did the node scale up but the pod stay Pending? Is it a taint? A resource request problem? Did Karpenter provision the wrong instance type because the NodePool constraints were too restrictive? Why did the deployment pass CI and fail in production? Missing IAM permissions. Environment variables not configured. Database migrations executed in the wrong order. These are not interview questions. These are Tuesday morning problems. Real DevOps work is not deploying applications all day. It is investigating incidents, understanding failure modes, debugging distributed systems, and making sure the same problem never happens twice. In 14 years of running production systems, one thing has stayed consistent: The engineers who get hired fastest are usually the ones who can walk into a conversation about a broken system and contribute immediately. That ability does not come from collecting certificates. It comes from building projects, breaking things, fixing them, and understanding why systems behave the way they do. If you are preparing for DevOps roles in 2026, you need to live the experience of a DevOps engineer And this is what Living Devops is all about: real Devops engineers teaching production environments. 👉My new cohorts are starting soon; check them out. P.S. 30% early bird discount till June 15. Real world Devops K8s SRE DevSecops livingdevops.com/courses/aws… Real world Devops K8s SRE DevSecops MLOps AIOps livingdevops.com/courses/28-…
3
31
1,168
Karpenter assumes that if EC2 can launch an instance in a subnet, that subnet is healthy. But infrastructure failures aren't always binary. Gray failures - elevated packet loss, increased latency, subtle control plane issues in a specific AZ - can mean nodes launch successfully but networking is compromised. Pods get stuck in ContainerCreating while Karpenter keeps provisioning into the bad zone. This article shows how to make Karpenter infrastructure-aware using AWS Application Recovery Controller (ARC) and EventBridge. When ARC detects an AZ impairment, an event triggers a Lambda that patches the Karpenter NodePool to exclude that zone automatically. No manual intervention during an incident. The article below from Shashi Shankar includes the full Lambda code, RBAC setup, and EventBridge configuration. lckhd.eu/tliCPb #Lambda #Karpenter #EventBridge
1
8
30
1,730
Karpenter lessons I had to learn the painful way so you don't have to. 1/ Spot interruptions need a PodDisruptionBudget Spot nodes get reclaimed. If your app has no PDB, Karpenter will evict all your pods at once during consolidation. Protect yourself: ```yaml apiVersion: policy/v1 kind: PodDisruptionBudget spec: minAvailable: 1 selector: matchLabels: app: my-api ``` 2/ Don't mix spot and on-demand in one NodePool Separate them. Use weights. Karpenter will try spot first and fall back to on-demand — but only if you set it up intentionally. 3/ Consolidation can be aggressive in dev Set consolidateAfter: 5m in dev, longer in prod. Aggressive consolidation in production during traffic spikes = bad day. 4/ Tag your NodePools Your cloud bill will thank you. Add tags to NodeClass and you'll know exactly what Karpenter provisioned vs your static nodes. 5/ Use karpenter.sh/do-not-disrupt annotation For long-running jobs or stateful pods, add this annotation. Karpenter won't touch them. Three days in. You now know more about Karpenter than 90% of engineers.

1
1
13
527
Jaydeep retweeted
Karpenter internals — let's make it concrete. Two core concepts: NodePool: Defines constraints for the nodes Karpenter can provision. Think of it as your "what am I allowed to create?" policy. NodeClass (EC2NodeClass on AWS) The cloud-specific config — AMI, subnets, security groups, instance profile. The genius bit: disruption consolidation. Karpenter doesn't just scale up. It continuously asks: "can I pack these workloads onto fewer, cheaper nodes?" If yes, it drains and replaces automatically. That consolidation line alone can cut your EC2 bill by 20–30%.
4
7
38
1,982
Follow this Sequence to create resources in Azure (K8s Cluster & App Deployment) -> Resource Group Creation -> Identity Mgmt -> AKS Cluster Creation (Nodepool & autoscaling) -> Networking setup (VNet, Subnet & CNI) -> Storage Configurations (PV & PVC) -> App Deployment (using YAML) -> Ingress controller (Setup App Gateway Ingress) -> Monitoring & Logging (Integrate Azure Monitor & Log Analytics) -> Back & DR -> CI/CD Pipeline (to Automate deployment using Azure DevOps or Github Actions)
9
1,367
You can schedule across heterogeneous GPU pools in Karpenter without writing a custom controller. Most teams don’t use this. They pin to a single GPU SKU and overpay. The NodePool spec lets you define tiered GPU capacity in a single manifest: •L4 for low-latency inference •A100 for batch training •H100 reserved for workloads that actually need the memory bandwidth Karpenter handles bin-packing and instance selection. You set requirements and let the scheduler do the rest. No external scheduler. No custom operator. Just a NodePool and an EC2NodeClass. If you’re running H100s at 40% utilization, start from here.
1
1
7
448
Cluster Autoscaler on EKS works, but I would not pick it for most teams today. Node groups do not flex cleanly, scale-up lags during spikes, and matching instance types to workloads takes constant tuning. Karpenter handles all three better and saves real money on idle nodes. Instead of scaling pre-defined groups, Karpenter watches pending pods and provisions nodes that fit what those pods actually need. The article below covers the install via Helm, NodePool and EC2NodeClass setup for system and job workloads. The example splits system pods and job pods into separate pools with taints, mixes spot and on-demand, and uses consolidation to clean up underutilized nodes. Nenis Rudani also links a working Terraform repo so you can run the same EKS and Karpenter setup end to end. lckhd.eu/gAwhxC #aws #eks #karpenter #finops
1
22
1,511
🧪 Weekly Focus – Phase #3 Wrap-Up, Regression & Final E2E This week is the final stretch of Testnet Phase #3. Focus shifts from iteration → wrap-up, validation, and stabilization so we can close the phase cleanly and prepare for what’s next. 🔹 Phase #3 – Support, Monitoring & Final Stability - Continue active monitoring across routing, miners, validators, dashboards, and L3 stats. - Ensure stability holds as we move into final regression and wrap-up. 🔹 v3 /delegate /validate – Final Regression & E2E - Run another round of regression light full E2E testing across v3 flows. - Cover session configs (1 / 3 / 5), routing paths, and real execution/validation behavior. 🔹 MVP Data Management – Full E2E Validation - Run final validation across Privacy Feature 1.0 Offchain Storage v3. - Ensure consistency across session/task encryption, dedicated/ephemeral flows, and router → miner → dashboard paths. 🔹 Inference Quality (SLA #3) – Validation Pass - Run regression on SLA #3 selection path with user-task quality stats. - Validate behavior across NodePool selection, quality stats table, and Node Perf signals. 🔹 Dashboard – UI/UX Refinement (Final Pass) - Continue final UI/UX refinement across privacy, offchain, and quality stats views. - Focus on usability and clarity before Phase #3 snapshot. 🔹 Phase #3 Snapshot & Wrap-Up - Prepare for phase snapshot toward end of this weekend. - Finalize stats, ensure data consistency, and get ready for phase closing process. This week is about closing Phase #3 cleanly - regression across all major systems, final E2E validation, dashboard polish, and preparing for the snapshot. #Cortensor #Testnet #Phase3 #AIInfra #DePIN #Corgent #Bardiel #Delegate #Validate #InferenceQuality #PrivateAI #L3
🗓️ Weekly Recap – Phase #3 v3 Iteration, Bardiel Updates & SLA #3 Testing This week’s planned Phase #3 items were completed, with additional prep work started for PyClaw model iteration. 🔹 Phase #3 – Support, Monitoring & Stats - Continued monitoring across routing, miners, validators, dashboards, and L3 stats. - Phase #3 remained stable while v3 flows and inference-quality signals moved deeper into testing. 🔹 v3 /delegate /validate – Continued Tests - Continued deeper testing across prepared session paths. - Focus stayed on real execution/validation behavior, routing consistency, and closing remaining logic gaps. 🔹 Bardiel Dashboard – v3 Adaptation - Continued refining Bardiel Dashboard views for v3 /delegate /validate. - Improved test data and UX around newer agentic surfaces. 🔹 Inference Quality – SLA #3 Rollout - Tested the newer SLA #3 path in NodePool NodePoolUtils. - Continued validating the shape: SLA #1 = node-level, SLA #2 = node network-task stats, SLA #3 = node network-task user-task stats. 🔹 Inference Quality – Dashboard & Regression - Quality stats are now surfaced in both the quality stats rank table and Node Perf quality columns. - Continued regression on testnet1a first, with testnet0 expansion next. 🔹 New Models for PyClaw Iteration - Built newer model images 73–77: Gemma 4 variants and Qwen 3.6 variants. - Next step is regression/testing with latest binary support, especially for PyClaw/tool-calling experiments. A productive Phase #3 week - v3 surfaces moved forward, SLA #3 testing progressed, dashboard visibility improved, and newer open models are now ready for PyClaw-side iteration. #Cortensor #Testnet #Phase3 #AIInfra #DePIN #Bardiel #Delegate #Validate #InferenceQuality #PyClaw #Models #L3
2
8
14
1,747
🧪 Testnet Phase #3 – Progress Recap So Far (1–2 Weeks Before Wrap-Up) Testnet Phase #3 has been running for a few weeks now. So far, Phase #3 has moved from initial setup → real testing loops → early product/data-layer alignment. We now have v3 agent surfaces, MVP data management, and inference-quality signals all actively exercised together. 🔹 v3 /delegate /validate (prep → testing phase) - Initial rollout session prep is done. - We’re now in deeper testing iteration on real delegation/validation behavior, routing consistency, consensus handling, and remaining logic gaps. 🔹 MVP Data Management (now fully in place) Privacy Feature 1.0 Offchain Storage v3 are working end-to-end in MVP form: session/task encryption, dedicated/ephemeral flows, deferred write, and router → miner → dashboard integration. 🔹 Dashboard / UI / UX (closing the gap) Quality stats, privacy flows, offchain behavior, and task-status improvements are now surfaced in the dashboards - including the quality stats rank table, Node Perf quality columns, and clearer completed / processing / stale task views. 🔹 Inference Quality (SLA #3 live path) - NodePool NodePoolUtils now include SLA #3 (user-task quality stats) on top of node-level network-task signals. - The new quality path is deployed on testnet0 testnet1a and is being observed in live selection behavior. 🔹 Inference Quality (oracle data layer) - Early E2E Quality Oracle data model is working in MVP form. - Real task probes now flow into stored quality stats, ranking views, and selection/reward-oriented surfaces. 🔹 Privacy Feature 1.0 (ready → testing stage) - Private inference is now live across the router surfaces and tested across dedicated ephemeral paths. - Session-level and task-level encryption flows have both gone through matrix-style testing. 🔹 Offchain Storage v3 (usable MVP path) Dedicated-node and ephemeral-node offchain storage flows are both working in MVP form, including the router-assisted deferred-write path for ephemeral nodes. 🔹 Bardiel dashboard – v3 adaptation started - Dashboard iteration is now actively underway to align Bardiel with v3 /delegate /validate flows. - We’ve generated larger and heavier test datasets, added API examples, and started refining task cards, result rendering, consensus views, and longer-input UX. 🔹 New models for PyClaw iteration Added newer models (Gemma 4 Qwen variants, model IDs 73–77), built images, updated runtime/dashboard support, and enabled selected Gemma 4 models on ephemeral-node runtime for near-term experiments. 🔹 PyClaw (early iteration alignment) - Initial iteration is starting around newer Gemma 4 models on dedicated sessions. - Focus is still on identifying agent/tool-calling gaps on newer open models and preparing for further experimentation. Phase #3 so far has successfully connected the pieces: agent surfaces (v3) data management (privacy/offchain) quality signals (SLA #3). The remaining stretch is less about adding missing foundations and more about testing, refinement, and tightening how these pieces behave together under real conditions. #Cortensor #Testnet #Phase3 #AIInfra #DePIN #Corgent #Bardiel #Delegate #Validate #InferenceQuality #PrivateAI #L3
🧪 Testnet Phase #3 Starts Now Testnet Phase #3 is now live. Phase #3 pushes Cortensor from “trust-layer hardening” into economic agentic hardening - payments/staking become production-shaped, rewards become more repeatable, and router surfaces move closer to agent primitives. 🔹 Privacy feature v0.5 Router key distribution (ACL/signature checks) dashboard privacy module dedicated sessions marked private/public 🔹 Payments (SessionPayment “production shape”) Tune unit costs/rebates/surcharges expand pricing metadata improve cost attribution failure handling 🔹 Dynamic pricing activation SLA / depth / redundancy-aware fees via SessionPaymentTable consistency manipulation-resistance checks 🔹 Staking-to-use Quotas → auto-deduction usage buckets cadence anti-abuse guardrails (dashboard-visible) 🔹 Rewards (semi-automation) Push reward outputs (CSV) further end-to-end to remove more manual steps (human still final send/batching) keep tuning level params (capacity/breadth) 🔹 Router surfaces (agent-ready) Iterate /delegate /validate v3/v4 with policy hints, tool traces, retries/fallbacks explicit redundancy/consensus attributes expand factcheck/new endpoints 🔹 Bardiel dashboard regular test jobs Keep views stable across schema changes scheduled smoke/regression jobs so examples stay fresh 🔹 Pre-mainnet experiments Mainnet-identical dry runs: ops/reliability, cost mechanics, payment/staking/reward flows, agent-facing behavior 🔹 Self-managed L3 as primary track (Testnet-1a) Ops hardening runbooks (backups, snapshots, disk estimates) validate procedures under real incidents 🔹 ERC8004 agent workflows Corgent Bardiel iteration via router MCP/prototype stack (real agent-facing trust rails) 🔹 Gas profiling iteration Measure deltas vs Phase #2 baseline keep trimming writes/log spam for mainnet viability 🔹 PyClaw (router-aligned local agent runtime) Phase #3 = design alignment for Phase #4 implementation/dogfooding 🔹 Core Docs Testnet Phase #3: docs.cortensor.network/commu… PyClaw – local-first agent runtime design: docs.cortensor.network/commu… /delegate & /validate – next router-node agentic surface - v3: docs.cortensor.network/commu… - v4: docs.cortensor.network/commu… Private / Encrypted Inference v0.5: docs.cortensor.network/techn… #Cortensor #TestnetPhase3 #DevLog #AgenticAI #RouterNode #DePIN #AIInfra #L3 #ERC8004 #Bardiel #Corgent #Privacy
1
5
14
2,145
🗓️ Weekly Recap – Phase #3 v3 Iteration, Bardiel Updates & SLA #3 Testing This week’s planned Phase #3 items were completed, with additional prep work started for PyClaw model iteration. 🔹 Phase #3 – Support, Monitoring & Stats - Continued monitoring across routing, miners, validators, dashboards, and L3 stats. - Phase #3 remained stable while v3 flows and inference-quality signals moved deeper into testing. 🔹 v3 /delegate /validate – Continued Tests - Continued deeper testing across prepared session paths. - Focus stayed on real execution/validation behavior, routing consistency, and closing remaining logic gaps. 🔹 Bardiel Dashboard – v3 Adaptation - Continued refining Bardiel Dashboard views for v3 /delegate /validate. - Improved test data and UX around newer agentic surfaces. 🔹 Inference Quality – SLA #3 Rollout - Tested the newer SLA #3 path in NodePool NodePoolUtils. - Continued validating the shape: SLA #1 = node-level, SLA #2 = node network-task stats, SLA #3 = node network-task user-task stats. 🔹 Inference Quality – Dashboard & Regression - Quality stats are now surfaced in both the quality stats rank table and Node Perf quality columns. - Continued regression on testnet1a first, with testnet0 expansion next. 🔹 New Models for PyClaw Iteration - Built newer model images 73–77: Gemma 4 variants and Qwen 3.6 variants. - Next step is regression/testing with latest binary support, especially for PyClaw/tool-calling experiments. A productive Phase #3 week - v3 surfaces moved forward, SLA #3 testing progressed, dashboard visibility improved, and newer open models are now ready for PyClaw-side iteration. #Cortensor #Testnet #Phase3 #AIInfra #DePIN #Bardiel #Delegate #Validate #InferenceQuality #PyClaw #Models #L3
🗓️ Weekly Focus – Phase #3 v3 Iteration, Bardiel Updates & SLA #3 Testing Phase #3 continues to move from setup into deeper iteration. This week is mainly about pushing the v3 agent surfaces further, refining Bardiel around those flows, and validating the newly deployed SLA #3 path in real selection behavior. 🔹 Phase #3 – Support, Monitoring & Stats - Continue active monitoring across routing, miners, validators, dashboards, and L3 stats. - Track stability while v3 flows and inference-quality signals are exercised more heavily. 🔹 v3 /delegate /validate – Continued Tests - Continue deeper testing on v3 /delegate /validate across the prepared session paths. - Focus on real execution/validation behavior, routing consistency, and closing remaining logic gaps. 🔹 Bardiel Dashboard – Refinement / Updates / v3 Adaptation - Continue refining the Bardiel Dashboard so it better reflects and supports v3 /delegate /validate flows. - Focus on adapting data views, test datasets, and UX around the newer agentic surfaces. 🔹 Inference Quality – SLA #3 Rollout - The latest NodePool NodePoolUtils with SLA #3 is now deployed, so this week is about testing that newer selection path in practice. - Current shape: SLA #1 = node-level, SLA #2 = node-level network-task stats, SLA #3 = node-level network-task stats user-task stats. 🔹 Inference Quality – Dashboard & Regression - Quality stats are now surfaced in two places: the quality stats rank table and the quality stats columns under Node Perf. - Focus this week is validating how those signals behave in real routing/selection, starting on testnet1a first and then expanding to testnet0. This week is about continuing the Phase #3 push: making v3 /delegate /validate more solid, bringing Bardiel closer to those surfaces, and testing SLA #3 as a more meaningful inference-quality signal across routing and dashboard layers. #Cortensor #Testnet #Phase3 #AIInfra #DePIN #Bardiel #Delegate #Validate #InferenceQuality #L3
1
3
11
1,047
🛠️ DevLog – Newer Models Now Enabled on Ephemeral Node Runtime Follow-up on the earlier dashboard/model rollout: we’ve now also enabled runtime support so these newer models can be used through the ephemeral-node / node-pool path as well. 🔹 What changed This is the runtime-side follow-up to the earlier dashboard update. The newer model support is no longer only visible in the UI - it is now also enabled through the ephemeral-node path. 🔹 Currently enabled on ephemeral path For now, we only enabled the 2 models that we think are the most relevant in the near term: - gemma4:e4b-gpu - gemma4:26b-gpu 🔹 Environment scope This should affect both: - testnet0 - testnet1a 🔹 Why these two first The main reason is that these are the more likely candidates we may need soon for upcoming iteration, especially around open-model and agent-related testing. 🔹 Current takeaway So the newer model rollout now has 3 layers in place for these selected models: - build/runtime support - dashboard support - ephemeral-node / node-pool enablement #Cortensor #DevLog #Models #Gemma4 #EphemeralNodes #NodePool
🛠️ DevLog – Dashboard Updated for the Newer Model Set We’ve now updated the dashboard to support the newer model additions, and those changes are now reflected on both testnet environments. 🔹 What changed - the newer model set is now available through the dashboard session flow - the model list/table was also updated so these newer entries are visible in the product surface - both environments now reflect the latest model-side dashboard changes 🔹 Why this matters This is the product-side follow-up to the newer model rollout, so the new model support is not only present at the runtime/build layer but is also exposed properly through the dashboard/UI. 🔹 Current scope This mainly covers: - newer model support in the dashboard flow - updated model listing/table - both testnet environments now aligned with the latest model additions 🔹 Current status So at this point, the dashboard side is now caught up with the newer model rollout as well, and both envs have the latest changes. #Cortensor #DevLog #Dashboard #Models #Gemma4 #Qwen
1
4
10
419
Recent K8s AI updates worth catching up on vLLM v0.19.1 → Patch release fixing CVE-2026-0994 (protobuf deserialization). If you’re running vLLM in production, this is a required upgrade. Most major features : async scheduling by default, Gemma 4 support, KV cache optimizations) landed in v0.19.0 / v0.18 this is stabilization security. llm-d → Cloud Native Computing Foundation Sandbox (March 2026) Donated by IBM Research, Red Hat, and Google Cloud, with backing from NVIDIA, AMD, Hugging Face, Intel, etc.The important shift is treating distributed inference as a first-class Kubernetes workload. Concepts like prefill/decode disaggregation and prefix-cache-aware routing (EPP) are pushing infra closer to model-aware scheduling. Karpenter v1.11.x → NodePool limits, topology-aware scheduling, and provider-level hooks are now part of the baseline. If your mental model of Karpenter is pre-1.0, you’re already behind. Kubernetes AI Conformance v1.35 → Not another “drop,” but a directional signal. In-place pod resizing (scale inference without restarts) Workload-aware scheduling (reduce deadlocks in distributed training) Takeaway: This isn’t hype cycles. The Kubernetes ecosystem is actively evolving to natively support AI workloads from inference runtimes to scheduling semantics. If you’re building in this space, these aren’t optional reads.
6
12
690
🛠️ DevLog – Weekly Progress Check Across v3, Inference Quality, and Bardiel So far, the main workstreams this week are looking good overall. Data management, inference quality, and the newer v3 /delegate /validate paths are all in a better place now, and the focus continues to be more on testing, refinement, and product-side cleanup. 🔹 v3 /delegate /validate The newer v3 paths are holding up well enough that we’ll continue deeper testing from here, especially around real execution/validation behavior, routing consistency, and any remaining logic gaps that still show up under broader usage. 🔹 Bardiel dashboard Most of the core setup is already there, and we’ve now shifted more of the remaining week toward polishing the Bardiel dashboard so it better reflects the newer v3 /delegate /validate flow, datasets, and result attributes. 🔹 Inference quality / SLA #3 The newer inference-quality path and SLA #3 rollout also look good so far. With the latest NodePool NodePoolUtils already deployed, we’ll keep validating how the newer selection path behaves in practice across testnet environments. 🔹 Dashboard / regression side Quality stats are now visible through both the rank table and the Node Perf view, so the current focus is making sure those signals behave reasonably in actual routing/selection and continuing to refine them as more real data accumulates. 🔹 Current takeaway At this point, most of these core workstreams are no longer just rough setup. Data management, inference quality, and v3 /delegate /validate all look to be in a solid enough place to continue testing and refinement, while the remaining week leans more toward polishing Bardiel and tightening the newer product surfaces. #Cortensor #DevLog #Bardiel #Delegate #Validate #InferenceQuality
🗓️ Weekly Focus – Phase #3 v3 Iteration, Bardiel Updates & SLA #3 Testing Phase #3 continues to move from setup into deeper iteration. This week is mainly about pushing the v3 agent surfaces further, refining Bardiel around those flows, and validating the newly deployed SLA #3 path in real selection behavior. 🔹 Phase #3 – Support, Monitoring & Stats - Continue active monitoring across routing, miners, validators, dashboards, and L3 stats. - Track stability while v3 flows and inference-quality signals are exercised more heavily. 🔹 v3 /delegate /validate – Continued Tests - Continue deeper testing on v3 /delegate /validate across the prepared session paths. - Focus on real execution/validation behavior, routing consistency, and closing remaining logic gaps. 🔹 Bardiel Dashboard – Refinement / Updates / v3 Adaptation - Continue refining the Bardiel Dashboard so it better reflects and supports v3 /delegate /validate flows. - Focus on adapting data views, test datasets, and UX around the newer agentic surfaces. 🔹 Inference Quality – SLA #3 Rollout - The latest NodePool NodePoolUtils with SLA #3 is now deployed, so this week is about testing that newer selection path in practice. - Current shape: SLA #1 = node-level, SLA #2 = node-level network-task stats, SLA #3 = node-level network-task stats user-task stats. 🔹 Inference Quality – Dashboard & Regression - Quality stats are now surfaced in two places: the quality stats rank table and the quality stats columns under Node Perf. - Focus this week is validating how those signals behave in real routing/selection, starting on testnet1a first and then expanding to testnet0. This week is about continuing the Phase #3 push: making v3 /delegate /validate more solid, bringing Bardiel closer to those surfaces, and testing SLA #3 as a more meaningful inference-quality signal across routing and dashboard layers. #Cortensor #Testnet #Phase3 #AIInfra #DePIN #Bardiel #Delegate #Validate #InferenceQuality #L3
3
5
14
231
🛠️ DevLog – Early Tests on SLA #3 Look Good So Far We’ve been running tests on testnet1a with SLA #3 enabled in the selection path, including some random/rough checks, and so far it looks good. 🔹 Current status The basic SLA #3 path is now in place and behaving reasonably from the first testnet1a pass. 🔹 What this means At a high level, the newer user-task quality signal is now actually participating in ephemeral-node selection instead of only being observed separately. 🔹 What’s next We’ll keep running more tests over this week to see how it behaves under more cases and repeated usage. 🔹 In parallel At the same time, we’ll keep polishing the Node Perf page as well, especially around how the Quality Stats data is shown and used operationally. #Cortensor #DevLog #NodePool #InferenceQuality #Oracle #EphemeralNodes
🛠️ DevLog – SLA #3 Filter Now Live on Testnet0 Testnet1a We’ve now rolled out the latest node-pool changes on both testnet0 and testnet1a, and the SLA #3 filter is enabled there as well. So far, it looks like it is working as expected. The dashboard was also updated to reflect this new filter path. 🔹 Current SLA shape - SLA #1 = node-level - SLA #2 = node-level network-task stats - SLA #3 = node-level network-task stats user-task stats 🔹 What’s now added We also added two controls around SLA #3: - enable / disable switch for the filter - threshold setting for the required success rate 🔹 Current threshold Right now, the threshold is set so only nodes with at least 80% success rate are allowed to pass the SLA #3 quality gate during ephemeral-node session selection. 🔹 Current status We’ll still be doing more testing from here, but so far the rollout on both testnet environments looks to be behaving as expected. #Cortensor #DevLog #NodePool #InferenceQuality #Oracle #EphemeralNodes
3
6
10
174
🛠️ DevLog – SLA #3 Filter Now Live on Testnet0 Testnet1a We’ve now rolled out the latest node-pool changes on both testnet0 and testnet1a, and the SLA #3 filter is enabled there as well. So far, it looks like it is working as expected. The dashboard was also updated to reflect this new filter path. 🔹 Current SLA shape - SLA #1 = node-level - SLA #2 = node-level network-task stats - SLA #3 = node-level network-task stats user-task stats 🔹 What’s now added We also added two controls around SLA #3: - enable / disable switch for the filter - threshold setting for the required success rate 🔹 Current threshold Right now, the threshold is set so only nodes with at least 80% success rate are allowed to pass the SLA #3 quality gate during ephemeral-node session selection. 🔹 Current status We’ll still be doing more testing from here, but so far the rollout on both testnet environments looks to be behaving as expected. #Cortensor #DevLog #NodePool #InferenceQuality #Oracle #EphemeralNodes
🛠️ DevLog – Latest Node Pool Node Pool Utils with SLA #3 Now Deployed We’ve now deployed the latest NodePool and NodePoolUtils, including the newer SLA #3 filter path. 🔹 What changed This deployment includes the new selection path where user-task quality stats can now sit on top of the earlier node-level and network-task filters. 🔹 Current SLA shape - SLA #1 = node-level - SLA #2 = node-level network-task stats - SLA #3 = node-level network-task stats user-task stats 🔹 What’s next We’ll start by testing this on testnet1a first, including some regression around the newer selection behavior. 🔹 After that Once the initial testnet1a pass looks okay, we’ll expand the testing into testnet0 as well. #Cortensor #DevLog #NodePool #InferenceQuality #Oracle #EphemeralNodes
1
5
12
387
🗓️ Weekly Focus – Phase #3 v3 Iteration, Bardiel Updates & SLA #3 Testing Phase #3 continues to move from setup into deeper iteration. This week is mainly about pushing the v3 agent surfaces further, refining Bardiel around those flows, and validating the newly deployed SLA #3 path in real selection behavior. 🔹 Phase #3 – Support, Monitoring & Stats - Continue active monitoring across routing, miners, validators, dashboards, and L3 stats. - Track stability while v3 flows and inference-quality signals are exercised more heavily. 🔹 v3 /delegate /validate – Continued Tests - Continue deeper testing on v3 /delegate /validate across the prepared session paths. - Focus on real execution/validation behavior, routing consistency, and closing remaining logic gaps. 🔹 Bardiel Dashboard – Refinement / Updates / v3 Adaptation - Continue refining the Bardiel Dashboard so it better reflects and supports v3 /delegate /validate flows. - Focus on adapting data views, test datasets, and UX around the newer agentic surfaces. 🔹 Inference Quality – SLA #3 Rollout - The latest NodePool NodePoolUtils with SLA #3 is now deployed, so this week is about testing that newer selection path in practice. - Current shape: SLA #1 = node-level, SLA #2 = node-level network-task stats, SLA #3 = node-level network-task stats user-task stats. 🔹 Inference Quality – Dashboard & Regression - Quality stats are now surfaced in two places: the quality stats rank table and the quality stats columns under Node Perf. - Focus this week is validating how those signals behave in real routing/selection, starting on testnet1a first and then expanding to testnet0. This week is about continuing the Phase #3 push: making v3 /delegate /validate more solid, bringing Bardiel closer to those surfaces, and testing SLA #3 as a more meaningful inference-quality signal across routing and dashboard layers. #Cortensor #Testnet #Phase3 #AIInfra #DePIN #Bardiel #Delegate #Validate #InferenceQuality #L3
🗓️ Weekly Recap – Phase #3 Testing, Inference Quality & v3 Surface Progress This week’s focus items were completed, and progress went a bit further than planned. Phase #3 continues to move from setup into more real testing and integration work. 🔹 Phase #3 – Support, Monitoring & Stats - Continued active monitoring across routing, miners, validators, dashboards, and L3 stats. - Stability held while more Phase #3 features moved into deeper testing. 🔹 v3 /delegate /validate – Matrix Tests & Stress Tests - Ran the planned matrix-style testing across session/routing paths and continued stress-style validation. - This pushed the v3 surfaces further beyond prep and into more practical testing conditions. 🔹 Inference Quality – Quality Oracle Iteration - Completed the planned iteration on the Inference Quality Oracle with light end-to-end execution. - Focus stayed on real task-style checks so node functionality is measured by behavior, not just static availability. 🔹 Inference Quality – Data Layer Progress - In parallel with oracle work, we also moved the quality-check data model forward. - We now have data flowing, and a light implementation/integration is already wired into NodePoolUtil as a third filter. 🔹 MVP Data Management – Continuous Testing - Continued testing across Privacy Feature 1.0 and Offchain Storage v3. - Combined flows were exercised again across router → miner → dashboard paths to keep validating the MVP data stack. 🔹 Bardiel Dashboard – v3 Adaptation Started - In addition to the original focus items, we began iterating the Bardiel Dashboard to better adapt to v3 /delegate and /validate. - This is early UI/product-side work to match the newer agent surfaces as they mature. A productive Phase #3 week overall - the original focus items were completed, inference quality now has both oracle data moving in practice, and the first product-side adaptation for v3 /delegate /validate has started. #Cortensor #Testnet #Phase3 #AIInfra #DePIN #Corgent #Bardiel #Delegate #Validate #PrivateAI #L3
1
5
13
1,998
🛠️ DevLog – Latest Node Pool Node Pool Utils with SLA #3 Now Deployed We’ve now deployed the latest NodePool and NodePoolUtils, including the newer SLA #3 filter path. 🔹 What changed This deployment includes the new selection path where user-task quality stats can now sit on top of the earlier node-level and network-task filters. 🔹 Current SLA shape - SLA #1 = node-level - SLA #2 = node-level network-task stats - SLA #3 = node-level network-task stats user-task stats 🔹 What’s next We’ll start by testing this on testnet1a first, including some regression around the newer selection behavior. 🔹 After that Once the initial testnet1a pass looks okay, we’ll expand the testing into testnet0 as well. #Cortensor #DevLog #NodePool #InferenceQuality #Oracle #EphemeralNodes
🛠️ DevLog – SLA #3 Filter Path Is Now Deployable After the recent optimization pass, the SLA #3 filter path is now at least in a deployable state. 🔹 Current status The latest Node Pool Utils changes now fit well enough that we can move beyond only local iteration and start linking/testing the path properly. 🔹 What’s next We’ll try linking and testing this first on testnet1a with ephemeral nodes, since that is the main place where the newer quality/user-task signal matters most. 🔹 Rollout path If that first testnet1a pass looks okay, we’ll then continue rolling it out and do more testing across both testnet0 and testnet1a. 🔹 Current filter shape At a high level, the selection path now looks like: - SLA #1 = node-level - SLA #2 = node-level network-task stats - SLA #3 = node-level network-task stats user-task stats 🔹 Why this matters So this is the point where the newer Quality Stats / Quality Oracle signal can start moving from design/iteration into actual testnet behavior for ephemeral-node selection. #Cortensor #DevLog #NodePool #InferenceQuality #Oracle #EphemeralNodes
5
11
397
🛠️ DevLog – SLA #3 Filter Path Is Now Deployable After the recent optimization pass, the SLA #3 filter path is now at least in a deployable state. 🔹 Current status The latest Node Pool Utils changes now fit well enough that we can move beyond only local iteration and start linking/testing the path properly. 🔹 What’s next We’ll try linking and testing this first on testnet1a with ephemeral nodes, since that is the main place where the newer quality/user-task signal matters most. 🔹 Rollout path If that first testnet1a pass looks okay, we’ll then continue rolling it out and do more testing across both testnet0 and testnet1a. 🔹 Current filter shape At a high level, the selection path now looks like: - SLA #1 = node-level - SLA #2 = node-level network-task stats - SLA #3 = node-level network-task stats user-task stats 🔹 Why this matters So this is the point where the newer Quality Stats / Quality Oracle signal can start moving from design/iteration into actual testnet behavior for ephemeral-node selection. #Cortensor #DevLog #NodePool #InferenceQuality #Oracle #EphemeralNodes
🛠️ DevLog – SLA #3 Filter Path Now Fits After Optimization We did some optimization and structural changes on the Node Pool Utils side, and the latest filter path now fits again. It is still under iteration, but the current shape is back in a workable place. 🔹 What changed The recent optimization work was mainly about making room for the latest SLA filter path without forcing it in too heavily. After those changes, the path now fits again and looks more realistic to continue with. 🔹 Current SLA shape At a high level, the selection flow is now shaping like this: - SLA #1 = node-level filtering - SLA #2 = node-level network-task stats - SLA #3 = node-level network-task stats user-task stats 🔹 Why this matters That means the newer Quality Stats / Quality Oracle signal is now positioned as the third layer on top of the existing selection logic, instead of replacing the earlier filters. 🔹 Current status - Still iterating, but the main optimization/gap there is now in a better place. - The plan is to finish this up and try deploying the latest filter path later this week so we can start testing how effective the new quality signal actually is. #Cortensor #DevLog #NodePool #InferenceQuality #Oracle #EphemeralNodes
1
5
10
356