深層学習を最適化するライブラリ DeepSpeed の情報を日本語で発信する公式アカウントです。 大規模な分散学習や推論を高速かつ簡単に実施できます。 このアカウントでは、DeepSpeedの新機能や論文などの最新情報を紹介していきます。英語Twitterアカウント: @DeepSpeedAI

Joined March 2023
17 Photos and videos
DeepSpeed (日本語アカウント) retweeted
May 18
Don't miss @DeepSpeedAI virtual office hours on May 26 at 12:00 PM America/New_York to ask questions of @toh_tana member of DeepSpeed TSC & get the latest recent key updates, including AutoSP (sequence parallel), AutoEP (expert parallel), and AutoTP (tensor parallel).
4
7
20
8,241
DeepSpeed の新機能 AutoSP のPyTorch公式ブログが公開されました! - コンパイラレベルでの最適化により、既存モデルに設定変更だけで Sequence Parallel を適用 - 長い系列の学習に最適化された Sequence-aware AC (activation checkpointing) これにより、長い系列の学習を、より高いGPU効率で容易に実現できます。 pytorch.org/blog/introducing…

Great News! Thanks to DeepSpeed AutoSP, efficient long context LLM training is now easily accessible.
2
2
417
DeepSpeed (日本語アカウント) retweeted
Good news! Ulysses Sequence Parallelism from the Snowflake AI Research and the Deepspeed teams has been integrated into @huggingface Trainer, Accelerate and TRL For extensive details please see this writeup: huggingface.co/blog/ulysses-… Thanks a lot to @krasul for helping make it happen. Also the others in the HF team who helped with integration.
4
19
116
17,818
PyTorchブログで最新のDeepSpeedアップデートが紹介されました! - PyTorch互換の backward API: Rayを用いたマルチモーダルの大規模学習をよりシンプルに実装可能に - 省メモリな BF16/FP16 モード: torch.autocastとの組み合わせにより、ピークメモリ削減(最大40%) ご意見・ご要望、お待ちしてます。Issue/PRもぜひ!
Feb 25
New @DeepSpeedAI updates make large-scale multimodal training simpler and more memory-efficient. Our latest blog introduces a PyTorch-identical backward API that helps code multimodal training loops easy, plus low-precision model states (BF16/FP16) that can reduce peak memory by up to 40% when combined with torch.autocast. 🖇️ Read the full post for details: hubs.la/Q044yYVs0 #DeepSpeed #PyTorch #MemoryEfficiency #MultimodalTraining #OpenSourceAI
2
15
6,641
DeepSpeed (日本語アカウント) retweeted
12 Dec 2025
Zhipeng (Jason) Wang, PhD (@PKUWZP) explains how @DeepSpeedAI supports ML training research and why joining PyTorch Foundation benefits researchers and developers working on AI training workloads. 🔗youtu.be/67719mlOSp0 #PyTorch #DeepSpeed #OpenSourceAI #AIInfrastructure
1
13
111
11,813
DeepSpeed (日本語アカウント) retweeted
9 Oct 2025
UIUC, AnyScale, and Snowflake significantly enhanced LLM offloading for the Superchip era!
🚀 SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips Superchips like the NVIDIA GH200 offer tightly coupled GPU-CPU architectures for AI workloads. But most existing offloading techniques were designed for traditional PCIe-based systems. Are we truly tapping into their full potential for LLM training? 🎯 SuperOffload is our answer to this challenge, a new DeepSpeed component rethinking offloading from the ground up, specially designed for LLM training on Superchips. ✨ SuperOffload is exact -- no approximation, no heuristics, and no changes to your training algorithm. Just faster, larger model with longer sequence training using the same code, which are made possible by system-level optimizations exploiting Superchip architecture. 🧪 SuperOffload allows you: - Finetune models like GPT-OSS-20B, Qwen3-14B, and Phi-4 on a single GH200 - Up to 4X faster speed than previous approaches like ZeRO-Offload - Effortlessly scales to: -- Qwen3-30B-A3B and Seed-OSS-36B on 2 x GH200s -- LLaMA2-70B on 4 x GH200s -- 1M sequence length on 8x GH200 with 55% MFU - Easy-to-use: Fully integrated and open-sourced in DeepSpeed. Just a few lines of code to enable! 📚 Read more through official PyTorch blog: pytorch.org/blog/superoffloa… 🧠 For more technical details, please read our technical report: arxiv.org/abs/2509.21271 🛠️ SuperOffload is fully open-sourced through DeepSpeed. Try it now: github.com/deepspeedai/DeepS… 📄 SuperOffload has been accepted to ASPLOS 2026! Kudos to Xinyu Lian (@Alexlian0806), Masahiro Tanaka (@toh_tana), and Olatunji Ruwase. 🎤 Featured at PyTorch Conference 2025 SuperOffload will be featured in the DeepSpeed & vLLM keynote at this year's PyTorch Conference in San Francisco. 🔥Come see how we're rethinking large-scale LLM training for the Superchip era: events.linuxfoundation.org/p…
3
12
2,715
10/22-23 にサンフランシスコ開催の PyTorch Conference で、DeepSpeedチームからのキーノートスピーチが行われます。 PyTorch Conference にご参加の方は、ぜひご聴講ください。 events.linuxfoundation.org/p…
9 Sep 2025
Step into the future of AI at #PyTorchCon 2025, Oct 22–23 in San Francisco 🔥 Join the DeepSpeed keynote and technical talks. Register: events.linuxfoundation.org/p… Oct 21 co-located events: Measuring Intelligence, Open Agent & AI Infra Summits / Startup Showcase & PyTorch Training
1,228
DeepSpeed の Universal Checkpointing に関する論文が、ソフトウェアシステム分野のトップカンファレンスである ATCで発表されました。
📢 Yesterday at USENIX ATC 2025, Xinyu Lian from UIUC SSAIL Lab presented our paper on Universal Checkpointing (UCP). UCP is a new distributed checkpointing system designed for today's large-scale DNN training, where models often use complex forms of parallelism, including data, tensor, pipeline, and expert parallelism. Existing checkpointing systems struggle in this setting because they are tightly coupled to specific training strategies (e.g., ZeRO-style data parallelism or 3D model parallelism), which break down when the training configs need to dynamically reconfigure over time. This makes it difficult to have resilient and fault-tolerant training. UCP solves this by decoupling distributed checkpointing from parallelism strategies. Our design introduces a unified checkpoint abstraction -- atomic checkpoint, and a full pattern matching-based transformation pipeline, which enables scalable and low-overhead checkpointing with reconfigurable parallelism across arbitrary model sharding strategies. We show that UCP supports state-of-the-art models trained with hybrid 3D/4D parallelism (ZeRO, TP, PP, SP) while incurring less than 0.001% overhead of the total training time. UCP is fully open-sourced in DeepSpeed. It has been adopted by Microsoft, BigScience, UC Berkeley and others for large-scale model pre-training and fine-tuning, including Phi-3.5-MoE (42B), BLOOM (176B), and many more. It also has been selected for presentation at PyTorch Day 2025 and FMS 2025(the Future of Memory and Storage). Big thanks to the amazing collaborators from Microsoft and Snowflake: @samadejacobs , @LevKurilenko, @MasahiroTanaka, @StasBekman , and @TunjiRuwase. 🔗 Project: lnkd.in/gG6j4vJe 📄 Paper: lnkd.in/gUiC5kcR 💻 Code: lnkd.in/g6uS29nH 📚 Tutorial: lnkd.in/gi_zWSWh #ATC2025 #LLM #Checkpointing #SystemsForML #DeepLearning #DistributedTraining #UIUC #DeepSpeed
8
746
DeepSpeed (日本語アカウント) retweeted
📢 Yesterday at USENIX ATC 2025, Xinyu Lian from UIUC SSAIL Lab presented our paper on Universal Checkpointing (UCP). UCP is a new distributed checkpointing system designed for today's large-scale DNN training, where models often use complex forms of parallelism, including data, tensor, pipeline, and expert parallelism. Existing checkpointing systems struggle in this setting because they are tightly coupled to specific training strategies (e.g., ZeRO-style data parallelism or 3D model parallelism), which break down when the training configs need to dynamically reconfigure over time. This makes it difficult to have resilient and fault-tolerant training. UCP solves this by decoupling distributed checkpointing from parallelism strategies. Our design introduces a unified checkpoint abstraction -- atomic checkpoint, and a full pattern matching-based transformation pipeline, which enables scalable and low-overhead checkpointing with reconfigurable parallelism across arbitrary model sharding strategies. We show that UCP supports state-of-the-art models trained with hybrid 3D/4D parallelism (ZeRO, TP, PP, SP) while incurring less than 0.001% overhead of the total training time. UCP is fully open-sourced in DeepSpeed. It has been adopted by Microsoft, BigScience, UC Berkeley and others for large-scale model pre-training and fine-tuning, including Phi-3.5-MoE (42B), BLOOM (176B), and many more. It also has been selected for presentation at PyTorch Day 2025 and FMS 2025(the Future of Memory and Storage). Big thanks to the amazing collaborators from Microsoft and Snowflake: @samadejacobs , @LevKurilenko, @MasahiroTanaka, @StasBekman , and @TunjiRuwase. 🔗 Project: lnkd.in/gG6j4vJe 📄 Paper: lnkd.in/gUiC5kcR 💻 Code: lnkd.in/g6uS29nH 📚 Tutorial: lnkd.in/gi_zWSWh #ATC2025 #LLM #Checkpointing #SystemsForML #DeepLearning #DistributedTraining #UIUC #DeepSpeed
3
8
7,582
DeepSpeed (日本語アカウント) retweeted
8 May 2025
PyTorch Day France marked the launch of a global PyTorch Day series—and the announcement of a major milestone: PyTorch Foundation is now an umbrella foundation. First new projects: @vllm_project @DeepSpeedAI. Next Stop: PyTorch Day China, June 7 🇨🇳 hubs.la/Q03lJvHh0 #PyTorch #OpenSourceAI #vLLM #DeepSpeed
1
13
60
11,695
DeepSpeedプロジェクトのPyTorch Foundationへの参加が発表されました。 幅広いステークホルダーとのオープンな連携を通じて、コミュニティに一層貢献していきます。 公式アナウンス: pytorch.org/blog/pytorch-fou… pytorch.org/projects/deepspe…

7 May 2025
PyTorch Foundation has expanded into an umbrella foundation. @vllm_project and @DeepSpeedAI have been accepted as hosted projects, advancing community-driven AI across the full lifecycle. Supporting quotes provided by the following members: @AMD, @Arm, @AWS, @Google, @Huawei, @huggingface, @IBM, @Intel, @LightningAI, @Meta, @NVIDIA, and @Snowflake. 🔗💡 Read the full announcement: hubs.la/Q03lmJNH0 #PyTorchFoundation #PyTorch #OpenSourceAI #vLLM #DeepSpeed
5
536
DeepSpeed (日本語アカウント) retweeted
7 May 2025
PyTorch Foundation has expanded into an umbrella foundation. @vllm_project and @DeepSpeedAI have been accepted as hosted projects, advancing community-driven AI across the full lifecycle. Supporting quotes provided by the following members: @AMD, @Arm, @AWS, @Google, @Huawei, @huggingface, @IBM, @Intel, @LightningAI, @Meta, @NVIDIA, and @Snowflake. 🔗💡 Read the full announcement: hubs.la/Q03lmJNH0 #PyTorchFoundation #PyTorch #OpenSourceAI #vLLM #DeepSpeed
8
43
228
70,585
DeepSpeed (日本語アカウント) retweeted
17 Apr 2025
This is pretty neat. They insert into torch.compile and insert some profile-guided optimizations as well as a bunch of other specific optimizations like offloading. Since torch.compile is all in Python all their compiler passes are fairly accessible too! github.com/deepspeedai/DeepS…
16 Apr 2025
Introducing 🚀DeepCompile🚀: compiler-based distributed training optimizations. - Automatic parallelization & profile-guided optimizations - Enable ZeRO1, ZeRO3, Offloading, etc. via compiler passes - 1.2X-7X speedups over manual ZeRO1/ZeRO3/Offloading tinyurl.com/8cys28xk
1
26
223
22,275
DeepSpeedの新機能 "DeepCompile" をリリースしました! ✅プロファイルに基づく並列処理の自動最適化 ✅ ZeROやオフロードをコンパイラの最適化パスとして実現 ✅ ZeRO1 / ZeRO3 / オフロードの 1.2〜7倍の高速化を達成 詳細は下記をご覧ください ブログ(英語): tinyurl.com/8cys28xk
16 Apr 2025
Introducing 🚀DeepCompile🚀: compiler-based distributed training optimizations. - Automatic parallelization & profile-guided optimizations - Enable ZeRO1, ZeRO3, Offloading, etc. via compiler passes - 1.2X-7X speedups over manual ZeRO1/ZeRO3/Offloading tinyurl.com/8cys28xk
6
32
19,926
ありがとうございます、ぜひご活用ください!
deepspeedでtensor parallelとzero optimizerを組み合わせられるようになったとのこと🎉 zeroだけだとノード数を増やして学習を加速したくてもper_device_micro_batch_size * gpu_per_node * num_nodes <= 1536の制約がネックになりやすかったのが、tp=8にできればノード数も理論上は8倍に増やせる。
1
1
822
HuggingFaceモデルに自動でテンソル並列 (TP) を適用する機能がリリースされました! - HuggingFaceモデルハブの大規模モデルをより大きいバッチサイズ・系列長で訓練可能に - Llama3のfine-tuningを4倍高速化 - ユーザによるコード変更が不要! ブログ(英語): tinyurl.com/5n8nfs2w

1 Apr 2025
AutoTP ZeRO Training for HF Models - Enhance HF post-training with larger models, batches, & contexts - 4x faster LLAMA3 fine-tuning with TP=2 vs TP=1 - No code changes needed Blog: tinyurl.com/5n8nfs2w
15
41
7,045
DeepSpeed (日本語アカウント) retweeted
🚀 Excited to introduce DeepSpeed, a deep learning optimization library from @Microsoft! It simplifies distributed training and inference, making AI scaling more efficient and cost-effective. Learn more 👉 hubs.la/Q0351DJC0 #DeepSpeed #AI #OpenSource #LFAIData
1
9
34
10,806
DeepSpeed (日本語アカウント) retweeted
Microsoft Research congratulates Yasuyuki Matsushita on being named a 2025 IEEE Fellow for his outstanding contributions to photometric 3D modeling and computational photography. msft.it/6018oINMG
2
11
37
10,923
限られたGPUリソースで、非常に長い系列を学習するための新機能 Ulysses-Offload をリリースしました! - A100-80GB 4台だけで LLaMA3-8B を系列長2Mトークンで訓練可能 - 55%を超えるMFUを達成 ブログ: shorturl.at/Spx6Y チュートリアル: shorturl.at/bAWu5
5 Dec 2024
🚀Introducing Ulysses-Offload🚀 - Unlock the power of long context LLM training and finetuning with our latest system optimizations - Train LLaMA3-8B on 2M tokens context using 4xA100-80GB - Achieve over 55% MFU Blog: shorturl.at/Spx6Y Tutorial: shorturl.at/bAWu5
3
23
6,540
DeepSpeed (日本語アカウント) retweeted
【 Microsoft Research Asia - Tokyo を設立】 アジア太平洋地域における人工知能研究とイノベーションの推進を強化するため、東京に新たな研究拠点である「Microsoft Research Asia-Tokyo(マイクロソフト リサーチ アジア東京)」を設立したことを発表します。 msft.it/6018WqGNw
6
103
293
208,970