Large pre-trained Transformers are great, but expensive to run. But making them more efficient (e.g., early exits) can give undesirable performance hits.
In our new work, we speed up inference while guaranteeing consistency with the original model up to a specifiable tolerance.