i sometimes realize that no matter how much people work on something, there will always be some who don’t really “understand” what they are doing. this can happen to professionals too, which often use well known facts or tricks without knowing how they really work, and without actually questioning why they work in the first place. some, however, keep asking and trying to understand fully what they do. i have put some examples of such well known but rarely understood tricks here in this blog before. today i got to answer a question about another one of those tricks. here i won’t show why it works, but how it works. you do the rest of the homework.
so, the situation is this. you need some linear eye space z values or some eye space 3d point from a regular hardware zbuffer to implement your effect. like depth of field or ssao. say you couldn’t indeed afford rendering nor storing an extra linear eyespace zbuffer in your gbuffers or forwards engine, so you are going to try to recover those eye space z values and points directly from the regular hardware 24 bit zbuffer.
good graphics programmer as you are, you will tell me to read from the zbuffer, transform back the point to eye space with the inverse of the projection matrix, and then do a division by w. that’s the correct answer, indeed. furthermore, perhaps you even understand why this works. or perhaps you have learnt this by heart… cause, why does the division by w happen after the transformation. furthermore, why a division? shouldn’t we be undoing a perspective division??
thing is, if you think about it, despite it works, the process doesn’t really make that much sense at first glance. bear with me, and let’s go through the regular rasterization/rendering process:
we start with an arbitrary point p in eye space with coordinates (x,y,z), and a projection matrix M, which implements a regular perspective projection conditioning,

first thing we do in our (vertex, tesselation or geometry) shader is to transform p by M to get our point in clip space:

perhaps you are wondering now why on earth should we use an actual expensive matrix multiplication when you can transform your point with a single fma() instruction. if you are making that question to yourself, it simply means you are not into 4 or 1 kilobyte intro coding.
anyway, at this stage the hardware can do polygon clipping against the (-w,-w,-w,+w,+w,+w) frustum cube. once that’s done, and before the rasterization can proceed, the perspective division happens. that’s the one responsible for making the far distant polygons look smaller than those near by (the projection matrix isn’t). note how indeed the w component has been conveniently set to the eye linear z (signed) distance. after the perspective division, we get our point p in ndc space, ranging from -1 to 1 (hence the name normalized device coordinates):

indeed the x and y components follow a simple proportion rule with eye z that produces the intended perspective effect of pinhole cameras. in the other hand, the z component encodes an inverse distance shape, which compresses more eye z values near the camera and less far from it. again, how convenient, huh.
so far so good, all old news. now the real deal, proceeding backwards:
we start from a pixel in our zbuffer, which has coordinates 0 to 1. you sample the zbuffer, and store it in z, such that we build a point (u,v,zbuffer(u,v)), which we rescale and bias it to map into the -1..1, meaning we got our 3D point already in ndc space. sweet, and easy.
to get to clip space, you would have to undo the perspective division now. and that’s where the problem arises, cause we don’t have w at this stage. the original pre-perspective division w has never been stored in the hardware’s zbuffer, so just don’t have enought information to proceed. panic.
the trick we all use, which most people don’t seem to understand of even question why it works, is to proceed as if nothing: just transform back directly by using inverse projection matrix into some sort of weird space, and performing a perspective (un)division afterwards… what? i know! why on earth would that works at all, right? ok, see this:
let’s assume that w was just 1, and so effectively build a fake clip space point with (-1+2u, -1+2v, -1+2zbuffer(u,v), 1 ). this must match the point we had rasterized to the zbuffer before:

now, if we transform it the with inverse projection matrix…

…without first performing any perspective (un)division, we get p in some sort of weird space

now the “magic happens”: we can divide the whole vector by its w component (negative one over z), to get

which is nothing but our original point p in eye space! ta-daaaaaa, magic!