Can Norms Make Transformers Smarter, Leaner, and Faster?
The promise of transformers is seductive: a single architecture that can read a line of text, parse a scene in an image, and even reason through a problem across languages. But the price of that power is steep. The self-attention mechanism at the heart of modern transformers squats on quadratic time and memory as sequence…