One year later with Omni and this test can pass. I saw it getting pretty close, so I tweaked the prompt:
> A video of a man counting to 10 on his fingers, show the number in the corner. A new number every 1s, no dialogue other than the numbers he says. He uses two hands for numbers bigger than 5.
- the model does 1 to 5 consistently well
- struggles more when two hands are used, usually on 7 and 8
- if you ask it to count faster, errors increase
- it keeps a good cadence
A new prompt to add to the fofr-benchmark:
> a man counts out loud from 1 to 10, using his fingers and holding them up as he goes
> a man counts out loud from 1 to 10, "1, 2, 3, 4, 6, 7, 8, 9, 10", he counts using his fingers and holds them up as he goes