roanak

roanak

2 Photos and videos

Tweets

Trainy retweeted

roanak @roanakb

Feb 4

NeptuneAI shuts down March 5th. @TrainyAI just launched Pluto on @ycombinator, a drop-in replacement so you don't lose years of experiment data. Swap one import. Dual-log to validate. Export your history. Open source. On Neptune's official transition hub. ycombinator.com/launches/PLM…

Launch YC: 🪐 Pluto – OSS experiment tracker for Neptune users | Y Combinator

Neptune-compatible experiment tracking, built for teams running AI workloads

ycombinator.com

7,239

roanak

Trainy retweeted

roanak @roanakb

29 Oct 2024

@TrainyAI's Konduktor platform helps bring the benefits of a leading research team to your GPU cluster. We provide a fault-tolerant scheduler, integrated observability, and more. Check out our docs: konduktor.readthedocs.io/en/…

246

roanak

Trainy retweeted

roanak @roanakb

29 Oct 2024

This leads to significantly higher (>80%) GPU usage. Add in some fault-tolerance to the infrastructure, and we see: - No more manual restarts at 2am. - ML Engineers get to focus on their jobs, rather than becoming DevOps experts.

232

roanak

Trainy retweeted

roanak @roanakb

29 Oct 2024

Top tier AI research teams (Meta, OpenAI, etc.) have figured out the most efficient way to work with a cluster of GPUs. Instead of managing each GPU separately, they create a pools of GPU nodes and let sophisticated schedulers manage GPU availability efficiently.

233

roanak

Trainy retweeted

roanak @roanakb

24 Oct 2024

Is your team struggling with GPU failures? Let’s talk! Docs: konduktor.readthedocs.io/en/…

116

roanak

Trainy retweeted

roanak @roanakb

24 Oct 2024

At @TrainyAI, we built a controller within Konduktor to monitor GPU node health and isolate unhealthy nodes. This way if a job fails, 0 manual intervention is required. K8s does its magic of placing work only on healthy nodes, and we forward relevant GPU/NCCL logs to your CSP. 🚀

105

roanak

Trainy retweeted

roanak @roanakb

24 Oct 2024

ML engineers shouldn’t be wasting time debugging infrastructure — especially when H100s have a 25-30% fault rate. 🛠️ ML infrastructure should be able to handle bumps and bruises to the underlying hardware.

144

roanak

Trainy retweeted

roanak @roanakb

21 Oct 2024

4/ Struggling with multinode setups on your cloud provider? We'll cut your setup time from weeks to minutes. Docs: konduktor.readthedocs.io/en/…

roanak

Trainy retweeted

roanak @roanakb

21 Oct 2024

3/ One of the biggest value-adds of @TrainyAI's Konduktor platform is that we simplify this complexity. We abstract away network configurations, so you can launch multinode training with high-bandwidth networking across different clouds in the same way.

roanak

Trainy retweeted

roanak @roanakb

21 Oct 2024

2/ At @TrainyAI, we've seen AI research teams lose over $10,000 trying to scale out due to misconfigured GPU fabrics. That's a costly mistake that can be avoided.

roanak

Trainy retweeted

roanak @roanakb

21 Oct 2024

Setting up and validating GPU networking is a lot less trivial than you'd think. Here's why: 1/ GPU fabric technology varies a lot across cloud providers for the H100. For example, Google Cloud has TCP-X, while AWS uses EFA. Once you commit to one setup, it often locks you in.

152

roanak

Trainy retweeted

roanak @roanakb

17 Oct 2024

He lays out the ARC-AGI benchmark, how it tests generalization abilities rather than memorization, and his thoughts on what kind of AI system will be necessary to improve on the SoTA. Watch here: youtube.com/watch?v=s7_NlkBw…

It's Not About Scale, It's About Abstraction

MLST is sponsored by Tufa Labs:Are you interested in working on AR...

youtube.com

102

roanak

Trainy retweeted

roanak @roanakb

17 Oct 2024

3. Skill does not show intelligence. And displaying skill at any number of tasks does not show intelligence. - This misguided view of intelligence is what causes our current form of benchmarking to be inadequate.

roanak

Trainy retweeted

roanak @roanakb

17 Oct 2024

2. For any LLM, for any query that seems to work, there exists an equivalent rephrasing of the query that will break. - This ties into LLM's inability to handle deviations from a pattern - Highlights the modern LLM's lack of robustness

roanak

Trainy retweeted

roanak @roanakb

17 Oct 2024

1. The core limitations of Transformer-based architectures have not changed in over 5 years. - Inability to adapt to small deviations from memorized patterns - Weak, patchy generalization

roanak

Trainy retweeted

roanak @roanakb

17 Oct 2024

The latest Machine Learning Street Talk (MLST) episode, with François Chollet discussing inherent limitations of LLMs, was amazing. It was a breath of fresh air to hear some sound reasoning after all the usual Doomer/Acceleration talk on AGI. He makes some great points:

151

roanak

Trainy retweeted

roanak @roanakb

15 Oct 2024

With the features above and more, AI teams using @TrainyAI's Konduktor platform see at least 2x the utilization out of their GPU cluster. Curious? Drop me a message or click here to check out our docs: konduktor.readthedocs.io/en/….

roanak

Trainy retweeted

roanak @roanakb

15 Oct 2024

3. Enhanced Observability: Our platform offers comprehensive dashboards that provide a clear view of cluster usage and performance. Metrics like SM Efficiency help you understand how effectively your GPUs are being used, across different jobs and teams.

roanak

Trainy retweeted

roanak @roanakb

15 Oct 2024

2. Minimize Downtime Disruptions: Traditional setups require manual intervention if a job fails. With H100 GPUs, these hardware faults are quite frequent (~30%). Konduktor detects hardware issues on failure, resumes jobs on healthy GPUs, and alerts your provider with logs.

roanak

Trainy retweeted

roanak @roanakb

15 Oct 2024

@TrainyAI's Konduktor platform is here to change that. 1. Maximize GPU Utilization: With Konduktor, engineers can queue up a large number of jobs on their GPU cluster of varying priorities. This means the P0 workloads get run first, and your GPUs keep crunching numbers 24/7.