Building open source tools for distributed training.

Joined June 2023
2 Photos and videos
Trainy retweeted
NeptuneAI shuts down March 5th. @TrainyAI just launched Pluto on @ycombinator, a drop-in replacement so you don't lose years of experiment data. Swap one import. Dual-log to validate. Export your history. Open source. On Neptune's official transition hub. ycombinator.com/launches/PLM…
2
2
19
7,239
Trainy retweeted
29 Oct 2024
@TrainyAI's Konduktor platform helps bring the benefits of a leading research team to your GPU cluster. We provide a fault-tolerant scheduler, integrated observability, and more. Check out our docs: konduktor.readthedocs.io/en/…

2
2
246
Trainy retweeted
29 Oct 2024
This leads to significantly higher (>80%) GPU usage. Add in some fault-tolerance to the infrastructure, and we see: - No more manual restarts at 2am. - ML Engineers get to focus on their jobs, rather than becoming DevOps experts.
1
1
2
232
Trainy retweeted
29 Oct 2024
Top tier AI research teams (Meta, OpenAI, etc.) have figured out the most efficient way to work with a cluster of GPUs. Instead of managing each GPU separately, they create a pools of GPU nodes and let sophisticated schedulers manage GPU availability efficiently.
1
2
2
233
Trainy retweeted
24 Oct 2024
Is your team struggling with GPU failures? Let’s talk! Docs: konduktor.readthedocs.io/en/…

1
1
116
Trainy retweeted
24 Oct 2024
At @TrainyAI, we built a controller within Konduktor to monitor GPU node health and isolate unhealthy nodes. This way if a job fails, 0 manual intervention is required. K8s does its magic of placing work only on healthy nodes, and we forward relevant GPU/NCCL logs to your CSP. 🚀
1
1
1
105
Trainy retweeted
24 Oct 2024
ML engineers shouldn’t be wasting time debugging infrastructure — especially when H100s have a 25-30% fault rate. 🛠️ ML infrastructure should be able to handle bumps and bruises to the underlying hardware.
1
2
2
144
Trainy retweeted
21 Oct 2024
4/ Struggling with multinode setups on your cloud provider? We'll cut your setup time from weeks to minutes. Docs: konduktor.readthedocs.io/en/…

1
1
63
Trainy retweeted
21 Oct 2024
3/ One of the biggest value-adds of @TrainyAI's Konduktor platform is that we simplify this complexity. We abstract away network configurations, so you can launch multinode training with high-bandwidth networking across different clouds in the same way.
1
1
1
64
Trainy retweeted
21 Oct 2024
2/ At @TrainyAI, we've seen AI research teams lose over $10,000 trying to scale out due to misconfigured GPU fabrics. That's a costly mistake that can be avoided.
1
1
1
43
Trainy retweeted
21 Oct 2024
Setting up and validating GPU networking is a lot less trivial than you'd think. Here's why: 1/ GPU fabric technology varies a lot across cloud providers for the H100. For example, Google Cloud has TCP-X, while AWS uses EFA. Once you commit to one setup, it often locks you in.
1
2
2
152
Trainy retweeted
17 Oct 2024
He lays out the ARC-AGI benchmark, how it tests generalization abilities rather than memorization, and his thoughts on what kind of AI system will be necessary to improve on the SoTA. Watch here: youtube.com/watch?v=s7_NlkBw…
1
3
102
Trainy retweeted
17 Oct 2024
3. Skill does not show intelligence. And displaying skill at any number of tasks does not show intelligence. - This misguided view of intelligence is what causes our current form of benchmarking to be inadequate.
1
1
2
56
Trainy retweeted
17 Oct 2024
2. For any LLM, for any query that seems to work, there exists an equivalent rephrasing of the query that will break. - This ties into LLM's inability to handle deviations from a pattern - Highlights the modern LLM's lack of robustness
1
1
2
38
Trainy retweeted
17 Oct 2024
1. The core limitations of Transformer-based architectures have not changed in over 5 years. - Inability to adapt to small deviations from memorized patterns - Weak, patchy generalization
1
1
2
38
Trainy retweeted
17 Oct 2024
The latest Machine Learning Street Talk (MLST) episode, with François Chollet discussing inherent limitations of LLMs, was amazing. It was a breath of fresh air to hear some sound reasoning after all the usual Doomer/Acceleration talk on AGI. He makes some great points:
1
2
3
151
Trainy retweeted
15 Oct 2024
With the features above and more, AI teams using @TrainyAI's Konduktor platform see at least 2x the utilization out of their GPU cluster. Curious? Drop me a message or click here to check out our docs: konduktor.readthedocs.io/en/….

1
1
53
Trainy retweeted
15 Oct 2024
3. Enhanced Observability: Our platform offers comprehensive dashboards that provide a clear view of cluster usage and performance. Metrics like SM Efficiency help you understand how effectively your GPUs are being used, across different jobs and teams.
1
1
1
50
Trainy retweeted
15 Oct 2024
2. Minimize Downtime Disruptions: Traditional setups require manual intervention if a job fails. With H100 GPUs, these hardware faults are quite frequent (~30%). Konduktor detects hardware issues on failure, resumes jobs on healthy GPUs, and alerts your provider with logs.
1
1
1
29
Trainy retweeted
15 Oct 2024
@TrainyAI's Konduktor platform is here to change that. 1. Maximize GPU Utilization: With Konduktor, engineers can queue up a large number of jobs on their GPU cluster of varying priorities. This means the P0 workloads get run first, and your GPUs keep crunching numbers 24/7.
1
1
1
33