Half of the benchmark does use BREP for input ground truth, and all of them use BREP for output ground truth. And the latter asks for a very specific edit on existing geometry- maybe you can just evaluate your system on that.
Also, drawings to ASME standards imply a lot. Your description that a drawing to standard would always produce the same model might be true in theory, but in practice not true at all, especially for more complex parts. Most of the time, for complex parts, engineers will use profile tolerances, which don’t work for this test case when there is no reference CAD. ASME 14.5 also implies a lot, for example RFS. But the whole world doesn’t use 14.5, and also these benchmark tasks aren’t meant for manufacturing! Because we needed the objective to have zero “implication,” because they aren’t going to be inspected after manufacturing, and because we wanted to avoid only American standards, we explicitly controlled every single feature, which made the drawing more complete at the cost of readability.