You should be able to do better than just aligning and averaging frames. (Edit: ...

You should be able to do better than just aligning and averaging frames. (Edit: looks like MauranKilom knows what they're talking about here, and expresses in their comment it clearer than I could.)

Imagine you were running averages on successive windows of a 1D array--when the average changes, that tells you the difference between the values that entered your window and the ones that just left. That's information about a sliver of data much smaller than the overall window. It's weirder with 2D and random-ish movement, but if your average (pixelation) filter is moving across text due to camera wobble or such, when the average goes up and down tells you something about where edges are in the content underneath.

I'm butchering the words because this isn't my thing, but this feels like it might be related to some actual signal-processing task (i.e. undoing some kind of signal-mangling that happens in the wild) which increases the chance that there's some good or at least well-studied solution.

The brute-force-ish approach for text reconstruction would also probably more effective if it checked against a few shifted-around blurred copies of the text, rather than just one.