Random idea to run bigger models on consumer GPUs: stream every second layer, keep ever other resident. That way you can pipeline bandwidth with computer.
It's going to be slow... but still fast enough to be useful!
If someone prototypes this, let me know!