Ok, so I get there are two main approaches for solving RL problem - value-based method and policy-based method. But is there a rule of thumb for choosing between them?

mlfromscratch

mlfromscratch @mlfromscratch

12 Aug 2025

What I just read: huggingface.co/learn/deep-rl…

Two main approaches for solving RL problems · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

mlfromscratch

mlfromscratch @mlfromscratch

12 Aug 2025

Let me explain the attention mechanism in Transformers really simply. Let's say there is this sentence: "The animal didn't cross the street because it was too tired." We want the model to focus on "the animal" when processing "it" - like humans do.

more replies

mlfromscratch

mlfromscratch @mlfromscratch

12 Aug 2025

Finally, we use these weights to combine the Value vectors from all words. This gives "it" a context vector - a smart blend of relevant info. In our example, the weight for "animal" will be the largest, so most of the context vector for "it" comes from "animal’s" Value.

mlfromscratch

mlfromscratch @mlfromscratch

12 Aug 2025

And that’s attention: 1. Q asks the question 2. K says “this is what I am” 3. dot product of Q and K gives a score 4. V provides the answer 5. Softmax decides who to listen to most

mlfromscratch

mlfromscratch @mlfromscratch

11 Aug 2025

huggingface deep rl course is awesome and I am going to go through them for the next few days :) huggingface.co/learn/deep-rl…

Welcome to the 🤗 Deep Reinforcement Learning Course · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

mlfromscratch

mlfromscratch @mlfromscratch

11 Aug 2025

Why do we need value function when we can just try to maximize the reward? tl;dr - unfairness: someone who worked hard and improved a lot must not be penalized because of low absolute reward - instability: we should aim for stable 90 , not one time 100 with extreme strategy

mlfromscratch

mlfromscratch @mlfromscratch

11 Aug 2025

very insightful article from @zyh2022: huggingface.co/blog/NormalUh…

DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge

A Blog post by Yihua Zhang on Hugging Face

huggingface.co