the cost of a virtual call was never the vtable lookup. that's just a pointer load, and modern CPUs are very good at loads.
what actually matters is the indirect branch that comes after it. the CPU wants to know where execution goes next so it can keep speculating ahead. When the branch predictor gets that target right, a virtual call is often surprisingly cheap.
the trouble starts when the same call site sees many different targets. prediction becomes harder, mispredictions increase, and the CPU has to throw away speculative work and start again.
that's why a monomorphic virtual call site can be almost free while a megamorphic one can become expensive.
virtual calls are slow was always an incomplete explanation.
the real question is how predictable the call target is.