The primate ventral visual system is constrained by the dual evolutionary pressures of (i) enabling strong visual behavior, as observed via high-accuracy readout of object and action categories, and (ii) being wiring efficient. The resulting cortical architecture exhibits putative hierarchical linear-nonlinear processing layers, systematic growth of receptive fields across areas, and functional organization of cortical cells. A longstanding challenge has been to identify the cortical microcircuit that satisfies the dual constraints. Convolutional neural networks (CNNs), built around the convolution operation, have served as the de facto model of the ventral stream. However, previous works have shown that CNNs display a small drop in task performance when optimizing for wiring efficiency. In this work, we analyze a taxonomy of computational microcircuits and propose that the transformer provides the current best model of the cortical microcircuit. Although transformers are commonly assumed to rely on global, long-range lateral-like connections, we show that vision transformers (ViTs) in fact develop a hierarchy of receptive field sizes with depth. When trained under the TDANN framework, topographic ViTs exhibit strong functional similarity to neural data, predict the micro- and meso-scale organization of early and late visual areas, minimize both feedforward and intra-layer wiring length, and maintain high task performance following spatial optimization. Critically, it is the presence of multiplicative operations within the transformer microcircuit, rather than the globality of lateral interactions per se, that drives alignment between the transformer and the cortex.
The transformer may be the universal cortical microcircuit
Yash Shah · April 24, 2026