I've turned this into a blog post with an interactive visualization: "TurboQuant, or: why are random compression schemes often nearly as good as carefully-designed ones?". Link in reply.
The first time you hear about the JL lemma, it will seem too good to be true. And it is, kind of, I'll explain. The idea is: if you have points in large d-dimensional space, a RANDOM projection to much smaller k-dim subspace will be "nearly optimal" "in the general case." Or, more specifically: with high probability, the pairwise distances between points are preserved, given a couple other requirements around d and k.
So why don't we just use random projections instead of carefully-constructed ones all the time? This is the most common misunderstanding of the JL lemma, and the one thing to really understand about it: in many (most?) datasets that are meaningful to humans, you actually CAN do better with something like maybe PCA. If your dataset is pathological, e.g., the points all lie on a plane even though it's technically in 3 dimensions, then clearly some planes you project onto will be better than others. The JL lemma does not apply to 2 and 3 dimensions, but you can imagine this would be true in large numbers of dimensions too. (See screenshot 1, i hope you like it because i made it myself lol.)
If you know just those facts, you will be pretty well-prepared to answer most questions about its use. Most of the papers Delip mentions do presuppose that you know this. At least when I was a student, I found this to be non-obvious.