PhD candidate @CIS_Penn. Prev: applied scientist @AWS agentic AI, MS @JohnsHopkinsAMS. Pushing data science foundations & efficient and trustworthy algorithms.

Joined June 2011
16 Photos and videos
Pinned Tweet
Glad to see the renaissance/revival of sparse structures brought by AI bigheads, from Anthropic to OpenAI! Instead of training extra AEs and manually interpreting AE features, our latest paper decomposes the activations along concept vectors that have semantic meanings by design via classic compressed sensing. Not only is the decomposition fast, but it also reveals the extent of concepts in each activation, allowing for effectively steering LLMs’ behavior. SOTA performance on detoxification and other alignment tasks. Paper: arxiv.org/abs/2406.04331 TLDR/Code/Dataset: peterljq.github.io/project/p…
6 Jun 2024
We're sharing progress toward understanding the neural activity of language models. We improved methods for training sparse autoencoders at scale, disentangling GPT-4’s internal representations into 16 million features—which often appear to correspond to understandable concepts. openai.com/index/extracting-…
2
15
91
44,129
A simple trick is to explicitly tell Fable that subagents should use Opus/Sonnet otherwise I go bankrupt. Be that as it may, I still need to watch the 5-hour limit and ^C before the usage goes wild.
Fable eats rate limit like crazy esp when it does dynamic workflows. Any tips? Maybe breaking the tasks down so that it does one part in each 5-hour window? Or explicitly asking it to use opus/sonnet as subagents?
88
Fable eats rate limit like crazy esp when it does dynamic workflows. Any tips? Maybe breaking the tasks down so that it does one part in each 5-hour window? Or explicitly asking it to use opus/sonnet as subagents?
124
1. flash 2. visual basic 3. amxx scripting (i did counter-strike mods in high school/2010s) 4. html 5. python 6,7. c/c (they are different for sure, i just don't remember the order) 8. assembly 9. rust 10. matlab
20 Feb 2023
1. html 2. php 3. irc scripting 4. visual basic 5. flash 6. c 7. c 8. java 9. ruby 10. python 11. objective c 12. haskell 13. matlab 14. julia
58
Tianjiao Ding retweeted
A linear autoencoder learns PCA. This has been understood since Baldi–Hornik ’89. But what happens if you make the model nonlinear by adding just a ReLU in the middle? Surprisingly, the training dynamics of even this minimal nonlinear variant remain poorly understood. In our paper, we take a stab at this by characterizing how a one-hidden-layer ReLU network, trained from random initialization while updating both layers, recovers a linear model. arxiv.org/abs/2606.04476 No fixed features. No preprocessing tricks. No “train only the last layer” gimmicks. Just standard gradient descent on the empirical loss, with both layers trained end-to-end. At first glance this may sound almost comically simple: “Congrats, you proved neural nets can learn linear functions?!” But the dynamics turn out to be surprisingly subtle, and the proof techniques ended up being some of my favorite ones we’ve developed in the last few years. Quick thread below on why this problem is challenging 🧵 I also genuinely think the techniques here may be useful for understanding more contemporary models and training phenomena, including reasoning and post-training. Stay tuned!
2
9
46
3,406
Tianjiao Ding retweeted
Presenting this work at #CVPR2026 - Sat June 6, 11:45 AM–1:45 PM MDT, ExHall F (poster/booth 557) along with @ZancatoLuca and @PengLiangzu.
#CVPR2026 is around the corner and we're excited to share Gated KalmanNet: A Fading Memory Layer through Test-Time Ridge Regression. Looking forward to meeting everyone who wants to learn more. Gated KalmaNet (GKA, pronounced "gee-ka") generalizes Mamba-2 and Gated DeltaNet, and outperforms both under identical training conditions. It also works beyond language: swapping the Mamba layer in MambaVision for GKA improves ImageNet accuracy with no vision-specific tuning. 1/4
1
2
196
Not sure about order, but tech giants are surely bringing orders to AI
In the 1940s, von Neumann brought order to computer architecture. in the 1970s, Codd brought order to database management. In the 2020s, who will bring order to AI?
1
1
119
Jokes aside, my take is scientific understanding and large scale engineering are both essential; neither alone can cut it for order.
52
♨️Your new heater for summer has just arrived. Isn't it more resource efficient and 🌏 friendly to have power beasts in data centers?
The laptop hasn't changed in 30 years. NVIDIA just changed it RTX Spark is their first PC chip ever. - RTX 5070 level GPU - 128GB unified memory - 1 petaflop of local AI - thin, light, barely throttles unplugged Your AI agent lives on the machine. 24/7. No cloud. This is step one of the agentic AI PC, and everyone else is about to copy it.
1
58
Pre-AI, mathematicians sit on a problem, go with a few intuitions, hit walls, and if lucky they make the proof. The explanation is a result of the whole process and gives faithful insights. If AI does the proof and mathematicians only explain the proof, does the explanation still contain insights?
Replying to @axiommathai
3/ Math journal review cycles can famously take years. Why so fast now? Every proof is generated in machine-verified Lean, then paired with a human exposition. The mathematician authors are there to explain the theorem, not to prove it. AxiomProver produces math one can trust.
1
106
Tianjiao Ding retweeted
Replying to @anshulkundaje
Those are orthogonal concepts. - World models trained on highly diverse data become foundation models: their encoders can be used for a wide variety of downstream tasks. - "World" refers to two things: (1) predicting the evolution of a complex system or environment, (2) predicting the evolution of a system under control and its effect on the environment (action-conditioned world model) which is a necessary component of planning.
39
103
1,163
81,450
Wow nice! The 3D-printed geometry reminds me of @prof_grimmer 's optimization courses, likely 3 or 4 years ago when I was at Hopkins.
I spent a year of my PhD stuck on a 2002 problem of Schechtman. GPT 5.5-Pro helped me finish: vector balancing for zonotopes (shadows of a cube)! For any zonotope Z ⊂ ℝᵈ, v₁,...,vₙ ∈ Z, there are signs x₁,...,xₙ ∈ {-1, 1} with x₁v₁ ... xₙvₙ ∈ O(√d) Z, sharp. [1/4]
1
2
332
Tianjiao Ding retweeted
I'm excited to share some joint work done with @TaeHo_Y00N. We considered algorithm design for fixed-point problems. This area models gradient descent, minimax optimization, and more. Below I give the wild ride of this paper. Mathematically, it is gorgeous.
4
26
132
10,461
Tianjiao Ding retweeted
A fun experiment comparing a random step with one gradient step: With a small CNN on CIFAR-10, a random step is basically a disaster. (A gradient step is a ~185σ event.) That makes sense if you expect a random direction in R^d to be ~sqrt(d) standard deviations worse than the optimal one. So scaling up to a larger model should make things even worse. But with a 7B model (test on GSM8k), random steps have a good chance of outperforming a gradient step. (The gradient norm of one PPO update is 1.94, while the L2 norm of the Gaussian perturbation is 85.6. The figure below rescales the Gaussian perturbation to match the PPO update norm, so the random step and gradient step have the same radius.) We should really rethink the parameter-function map.
15
21
138
14,309
Tianjiao Ding retweeted
There's a lot of controversy brewing around arXiv's decision to penalize authors who post unchecked AI generated content. The impulse is correct, IMO, simply on grounds of efficiency: it is much cheaper to insist the authors vet their work first, rather than distributing the cost of that work to EVERY reader/agent who subsequently downloads the work. I believe the mechanism is likely the wrong one, however. Unfortunately, suggestions to use github are even worse, IMO, because they lose the (effective) immutability of the scientific record, which arXiv upholds.
20
7
188
25,975
Last time i ordered food in Shanghainese was in HK, not Shanghai
Very awkward that the last time I ordered food in Cantonese was in Hamburg, not any recent HK or Guangdong trips 🙃
70
Tianjiao Ding retweeted
This is so sad. 😞 You have at reach an omniscient personalised teacher and you end up delegating your work to it rather than growing from having its support. Like looking at the solution keys before doing maths exercises. What is the point? 🥺
I'm 22 years old and Claude Code is deteriorating my brain. Every single day for the last 6 months I've had 6 to 8 Claude Code terminals open, waiting for a response just so I can hit 'enter' 75% of the time. And it's doing something to me. In convos with a couple of friends, it's been a point that's been brought up pretty frequently. None of us feel as sharp as we used to. I don't know if it's just us, or others in their 20s are feeling the same thing, but it's something I've been thinking about a lot. P.S. I know this is a problem with my reliability/usage of it, not Claude Code itself, but the effects are real nonetheless
9
5
112
18,280
Tianjiao Ding retweeted
A mathematician who shared an office with Claude Shannon at Bell Labs gave one lecture in 1986 that explains why some people win Nobel Prizes and other equally smart people spend their whole lives doing forgettable work. His name was Richard Hamming. He won the Turing Award. He invented error-correcting codes that made modern computing possible. And he spent 30 years at Bell Labs sitting in a cafeteria at lunch watching which scientists became legendary and which ones faded into nothing. In March 1986, he walked into a Bellcore auditorium in front of 200 researchers and told them exactly what he had seen. Here's the framework that has been quoted by every serious scientist for the last 40 years. His opening line landed like a punch. He said most scientists he worked with at Bell Labs were just as smart as the Nobel Prize winners. Just as hardworking. Just as credentialed. And yet at the end of a 40-year career, one group had changed entire fields and the other group was forgotten by the time they retired. He wanted to know what the difference actually was. And he said it wasn't luck. It wasn't IQ. It was a specific set of habits that almost nobody is willing to follow. The first habit was the one that hurts the most to hear. He said most scientists deliberately avoid the most important problem in their field because the odds of failure are too high. They pick a safe adjacent problem, solve it cleanly, publish it, and move on. And because they never swing at the hard problem, they never hit it. He said if you do not work on an important problem, it is unlikely you will do important work. That is not a motivational line. That is a logical one. The second habit was about doors. Literal doors. He noticed that the scientists at Bell Labs who kept their office doors closed got more done in the short term because they had no interruptions. But the scientists who kept their doors open got more done over a career. The open-door scientists were interrupted constantly. They also absorbed every new idea passing through the hallway. Ten years in, they were working on problems the closed-door scientists did not even know existed. The third habit was inversion. When Bell Labs refused to give him the team of programmers he wanted, Hamming sat with the rejection for weeks. Then he flipped the question. Instead of asking for programmers to write the programs, he asked why machines could not write the programs themselves. That single inversion pushed him into the frontier of computer science. He said the pattern repeats everywhere. What looks like a defect, if you flip it correctly, becomes the exact thing that pushes you ahead of everyone else. The fourth habit was the one that hit me the hardest. He said knowledge and productivity compound like interest. Someone who works 10 percent harder than you does not produce 10 percent more over a career. They produce twice as much. The gap doesn't add. It multiplies. And it compounds silently for years before anyone notices. He finished the lecture with a line I have never been able to shake. He said Pasteur's famous quote is right. Luck favors the prepared mind. But he meant it literally. You don't hope for luck. You engineer the conditions where luck can land on you. Open doors. Important problems. Inverted questions. Compounded hours. Those are not traits. Those are choices you make every single day. The transcript has been sitting on the University of Virginia's computer science website for almost 30 years. The video is free on YouTube. Stripe Press reprinted the full lectures as a book in 2020 and Bret Victor wrote the foreword. Hamming died in 1998. He gave his final lecture a few weeks before. He was 82. The lecture that explains why some careers become legendary and others disappear is still free. Most people who could benefit from it will never open it.
144
1,875
8,183
1,171,732
The claude code GUI has been so buggy... Come on, it's just a local session, no ssh, no tmux, nothing fancy.
120
Tianjiao Ding retweeted
CAPTCHA for 3D vision people
wait wait wait. So a new kind of 3D representation (gaussian splats) was invented and they decided to use Y-DOWN as the standard up-direction ?? In 2023 ?!?!!! Need to have some words w/ my old friends at INRIA...and I guess we need a new chart...
34
332
1,766
172,077