[Paper Reading: Your Spending Needs Attention]
I've just finished reading the "Your Spending Needs Attention" paper by Nubank, and not only are the results impressive, but the ML and engineering approach is also very interesting. It shows the power of self-supervised representation learning to automatically understand user behavior from raw (transaction) data, which made me think about how many insightful representations we are missing by not using it, and why (engineering and money trade-offs come to mind).
Here's the research breakdown: causal self-attention tabular feature embedding fine-tuning for RecSys.
Transformer-based model:
> Text is All You Need: Individual transactions are tokenized, concatenated into a transaction string, and fed through a Transformer [0] to produce a transaction sequence embedding.
> No Positional Embeddings (NoPE) [1]: drop the temporal information
> FlashAttention [2] NoPE = Efficient Long Contexts (transaction = ~14 tokens — the sequence gets large very fast): the model can train on much larger context lengths
Tabular Features:
> Feature embeddings for numerical and categorical variables
> LightGBM: gradient-boosted tabular modeling
> Deep Cross Network V2 (DCNv2) [3]: learn feature interactions
Fine-Tuning — classification task for RecSys:
> Low-Rank Adaptation (LoRA) [4]: injecting trainable low-rank matrices into attention layers to handle the "overfitting and catastrophic forgetting" issues.
> Late Fusion: freeze the transformer embeddings and use them as static features passed into LightGBM or DCNv2 independently.
> Joint Fusion (nuFormer): keep the transformer embeddings trainable end-to-end alongside the tabular features.
It's very insightful how joint fusion trains the entire system end-to-end using a DNN, so gradients can flow through the embeddings compared to GBT.
Other insightful ideas from the paper:
> Context window problem: adding more data sources (e.g. financial products) can lead to worse results because each data source will "compete" for the available tokens for a fixed context window.
> Scaling laws: larger model size, context lengths, and data volume lead to improved performance.
There are still many interesting avenues they will explore, especially scaling laws and scaling the application to other products. It was also insightful how they are not just following the state of the art, but doing research to find new ideas [5].
---
Paper:
arxiv.org/abs/2507.23267
---
[0]
arxiv.org/abs/1706.03762
[1]
arxiv.org/abs/2305.19466
[2]
arxiv.org/abs/2205.14135
[3]
arxiv.org/pdf/2008.13535
[4]
arxiv.org/abs/2106.09685
[5]
open.spotify.com/episode/11v…