Some super cool nerdy information in this post.
I recently got this portable triple monitor setup. It works fine on macOS and Windows, but it turns out the Linux driver was extremely unstable as well as very CPU intensive
Obviously, this nerdsniped me into reversing it and making a better one. In doing so, I managed to figure out some device internals/quirks that I could leverage to make optimizations that none of the vendor drivers for any platform could do :)
Turns out that one of the screens (the leftmost one) presents as a standard USB-C DisplayPort (DP Alt Mode) screen, which means it has more than enough bandwidth available for uncompressed efficient video
For the other two screens, the device presents as a USB 2.0 Hub, with a USB device that is handled by a custom userspace helper application
Turns out that what essentially happens is that it's streaming JPEG encoded data (with a custom header, however) for every frame presented on the two other screens. To be specific, two JPEG:s per frame, one for the upper half and one for the lower half, for each screen
On both macOS, Windows and Linux, it was always sending JPEG:s with a resolution of 1920x544 for each half, with an adaptive quality setting from 98 to 70. The reason being both the fact that JPEG encoding is relatively expensive, as well as the fact that it's only able to use a bandwidth of 30MB/s in total
As a side effect of this, the actual frame rate you're getting on each of the USB screens is generally more around 25-30 than anything near 60
I wasn't all that concerned with the frame rate for my purposes though, and even though JPEG is technically using lossy compression, I wouldn't even really have noticed that in general use
What I did notice, however, is the CPU usage required for all that JPEG encoding, and I wanted to minimize that as much as possible
The vendors Linux driver was using 2-3 CPU cores continuously on full blast, and even then it was frequently either crashing or slowing down to a crawl
I was able to get the worst case CPU usage down to a factor of about 5 (i.e. about 60% of a single CPU), and for the cases where the screens are mostly static, terminals and such, the CPU usage was only 1-3% of a single CPU
Besides the obvious stuff, such as parallelizing the JPEG encoding etc, I thought it was pretty wasteful to send full frames in cases where the actual changes from the previous frame were small
Since the packet header contained what seemed to correspond to the width, height and x/y coordinates of where the JPEG should be rendered, I thought it would be possible for me to send a JPEG with just the part that was different from the previous frame
That method, however, ended up getting very strange results where it seemed to loop parts of previous frames on the parts of the screen that were not being updated
After making a test program that rendered a simple animation against a static background with a low FPS (FPS as in the frames actually sent to the device per second), I could see that it seemed to repeat previous frames with a cycle of 3
I.e. if I try to animate a box by painting over the previous frames box as well as the part where I'm placing the slightly moved box, instead of a smooth animation I would see the box I painted as well as part of the box from 3 frames back
By this observation, I could deduce that the device internally uses three framebuffer slots, and by mirroring this with three "shadow buffers" in my custom userspace driver I could minimize the work required for each frame
First off, I would compare my new frame N with frame N-3 in order to see the minimum rectangular area that would patch up frame N-3 to become my new frame N
While doing so I would also compute a very simple hash of frame N and compare it with the hash of frame N-1, since if frame N and frame N-1 are the same I can just skip presenting that frame (I still need to send a periodic heartbeat in order for the screen to not turn off, but I don't need to send a new frame unless something has actually changed)
So instead of encoding and sending two 1920x544 JPEGs for each frame on each screen, I would generally be encoding and sending far less, which allows me to use far less CPU and be a lot more responsive especially for coding and "office work" in general
As a bonus, I also got a lot more intimately familiar with the EVDI subsystem in Linux (EVDI = Extensible Virtual Display Interface), which turns out to be quite useful in the context of my QVM and GRAFIT projects that I've mentioned from time to time in other posts ;)