Hey, thanks for writing this out. You do have some valid points on everything you said. I have touched on some of these points before in tweets, but I joined a few months ago as VP of Engineering, DeFi, to set standards for how we build at Polymarket. We, as a team, have hired many new senior people over the last few months, and we feel the difference internally and how we are starting to work together.
The CLOB was built over 4 years ago and has inherited technical debt due to wrong core low-level decisions; fixing these core parts isn't easy when you have so many parts to consider and continue to serve a huge amount of traffic. Note that the success of Polymarket would not have been possible without the speed of development, so no blame on these systems at all; it happens. We now have a dedicated trading team working solely on the CLOB, fixing and rebuilding those core components from the ground up. Every issue you raised was an issue before and has been fixed / is known and being fixed.
A few things which have been improved over the last 2 months:
1. Full observability into the systems with traces and alerts
2. Eliminated ghost fills for takers that were causing a high failure rate for trades.
3. We improved the core matching engine latency via faster summary generations by 41.2%
4. Introduced a new validation layer that helps us keep the core engine healthy; with this one, the p90 improved by 32% during highly volatile periods. Also, we currently support 50% more user requests on the engine.
Here are the things coming in the next weeks and months:
1. Feature flags in the clob to allow us to turn on and off features when doing critical deployments, eliminating unnecessary downtime for users while still allowing the team to safely ship
2. Revamped rate limit system that is maker-focused, improving latencies for both takers and makers. As we improve latency on the core engine, we're seeing more requests, which is causing degradation, so revamping the rate limits is key.
3. We are moving to an async flow on our current engine to support higher throughput.
4. Introduce new order types that allow makers to preserve their queue positions.
5. Decreasing the startup time so deployments can be easily put out and reverted if needed quickly
6. Way more improvements to the internal systems but can't list them all
Taker delay bypass exploits - This was implemented incorrectly initially, allowing cancellations for already delayed orders. We prevent this by preventing these cancellations and properly reserving the capital for all taker-delayed orders.
Ghost fill exploits - The clob is synced with the user's deposit wallet state, so whenever the user wants to perform a clob action, it is informed in advance, helping it consolidate the state. There is still a small amount of ghost fills (0.001%), as we still support the old wallet types and have edge case race conditions, but this will be fully eliminated once all users are on the deposit wallets and with the new async architecture.
Order spam exploits - Atm, we keep an open rate limit model; this means that users get the same limit across markets. We want to introduce a token-based system to make it harder for bad actors to exceed current limits.
RTDS: We missed observability on this, and that's on us. We fixed it, and now we have proper alerts. Sorry about that.
Stability is something we are actively solving; as said, it's our core focus with the trading team. We're growing the team by 2x and bringing in very senior ex-exchange engineers to rethink and rebuild a bunch of things. Our plan is to be the best exchange in the world, and we have a long way to go, but we will get there with the core improvements we are making highlighted above.
Most of our trading changes at the moment are fixes, and we alert Discord because it's our main place people trade alongside some signals; most don't require documentation, mainly because there are no breaking changes and it's a fix to address tech debt. We do need to work on communication; we're trying to improve this with Friday updates and improvements to our core documentation and Discord/Signal channels. We are also working on better communication overall, with restarts and downtimes being announced in advance. Also, we have a direct line to you on Signal; we should chat more there. The team is always on call to help with any issues you face.
Taker tiers was something we slipped on, and we should have done better with the communication around when it is coming. We have started splitting certain tasks like this into smaller pods, with ownership and accountability for each. A lot has got better internally, but there is loads of room to improve, and this is a good example of how core communication could have fixed this; sorry about that.
We, of course, have staging, but most, if not all, of our issues only occur under production load. We just finished a full, dedicated environment powered by a Tenderly VNET, allowing us to run proper load tests and simulate production load before we put stuff out. We have hired a head of QA who is also working on proper e2e automated tests on the CLOB to catch these issues in CI before they go out.
Moral of the story: we really do care about building the best systems in the world. Polymarket moves very fast, and it wouldn't be here today without it, but we know that improving core stability is key to us thriving, and we're working on that right now, step by step. You are a smart person; you know that fixing these foundations cannot happen overnight, but we are on it and hear you.