Spent the last 48 hours tracking down a massive memory leak in one of our production services. Node.js processes using up to 12-14GB within 5 hours. Went down 6 or 7 rabbit holes trying to recreate the leak locally, disabling various instrumentation, metrics, logging, prepared the service to be safely heapsnapshotted.
Of course we created a script to slowly recycle the service every 4 hours before they would crash to buy us some time.
Finally decided to look again at our logs and thats when I spotted the dreaded "Failed to convert rust `String` into napi `string`" error coming from a Prisma query.
It turned out to be a two stuck processes polling the service every 5 seconds, performing a query that returned too much data to be serialized between the rust and the js layer in Prisma, starting about 2 weeks ago.
The first arrow below is when we killed the stuck processes, as you can see memory flatlining. Then a few hours later we re-deployed the service with new tasks and now they are flat and using about 1GB of memory each at max.
Few things in life come close to the satisfaction of finally finding and fixing this stupid memory leak