Can you learn to see without images?

May 10, 2026 · Paper

Hypothesis & Idea:

Transformers have become the backbone for all text, image or video tasks. It has been shown for language tasks that pretraining on procedural data (some patterns) can improve the performance. Can you do the same for vision tasks? -> generate patterns from formal grammers, encode symbols with an embedding layers (as a replacement to the visual patch embedding), and train using a masked token prediction supervision.

Results

It works; otherwise it would not be a paper :) Interesting is the evaluations they do to “prove” or show that this actually works.

Does it help? Pre-train + “standard” train -> pretraining leads to better performance
How does it help?
- Additive setting: 1st pretraining on their formal grammer, then pretrain on ImageNet, then finetune. -> shows that the results are complementary
- Substitutive setting: Given total pretraining budget, assign some pretraining to prodecural data and some to real world data.
What other factors could lead to the improvement?
- Use procedural data from different chomsky hierarchy: simple grammer negligible improvement, too complex also not good.
- Destroy the structure of the grammer by shuffling but keeping the distribution in tact -> removes the improvement.
- Pretraining length: if trained for too long, it can decrease performance again. Catastrophic overtraining?!

Thoughts

A quote from the paper: “The problem of reasoning over images is primarily a reasoning problem, not an image problem.”.
Its interesting that too simple and too complex formal grammer does not yield improvements. Kind of weird. The complex formal grammer does not seem that complex?
They claim their pretraining is complementary also because most improvements are in the later parts of the network (whereas standard vision pretraining mostly helps the earlier parts). Could this also just come frome the fact that the network has to repurpose the earlier parts for low-level vision tasks when fine-tuning?
Figure 2’s curve indeed looks very distinct! to the non-pretrained variant.