In the laboratory hum of the University of Alberta, a team led by Abhineet Singh and Nilanjan Ray is reimagining how machines see video. They don’t just tweak the old detectors or stack a bigger CNN on top of a sequence of frames. They ask a deeper question: what if objects in a video aren’t just boxes in images, but a stream of discrete tokens that unfold over time? The answer, they suggest, could make video understanding more flexible, more end-to-end differentiable, and better aligned with how language models think about sequences. This isn’t a sci‑fi leap; it’s a concrete proposal that extends a token-based object detector called Pix2Seq to the dynamic world of video, producing a single, end-to-end handle on moving scenes.
Traditional detectors churn out fixed-sized real-valued outputs, like coordinates and confidence scores, that are stitched together post hoc to form a video interpretation. Singh and Ray flip the script. They represent each object not as a single two-dimensional box, but as a sequence of tokens that encodes its presence, location, and even its movement across a window of frames. The essential idea is to treat videos as collections of tracklets—three-dimensional boxes that stretch across time—expressed as discrete tokens the model can predict one by one. The appeal is twofold: it sidesteps the lossy, heuristic postprocessing that plagues many detectors, and it scales more gracefully with longer video horizons as computing power catches up. The work stands as a reminder that the best way to teach a machine to see a video might be to let it speak in a language it already knows well—the language of tokens and sequences.
Singh and Ray’s paper, published from the University of Alberta, surveys a familiar problem in computer vision: how to make outputs discrete and variable, like the number of objects in a frame, compatible with end-to-end training. The team builds on Pix2Seq, a framework that treats detection and related vision tasks as language modeling problems. The leap here is to extend that tokenization to video, asserting that a video object can be captured not by a handful of frame-by-frame boxes, but by a contiguous,.frame-spanning sequence of tokens that describes its footprint across time. The result is a detector that can, in principle, grow with the video length—provided hardware catches up with the data.
For curious readers, what matters is less about the minute math and more about what this approach promises: fewer ad-hoc rules, end-to-end learning of spatiotemporal structure, and a framework that can adapt as we feed it longer, richer video. If this line of thinking holds, the future of vision could look less like “spotting objects in each frame” and more like “reading the motion story as a sentence.” The paper is a tour through a jury-rigged, but increasingly capable, architecture that banks on one simple conjecture: if you want a model to understand a video, give it a language it already knows how to speak—tokens that unfold autoregressively, frame by frame, across time.
At the core is a basic but powerful shift in how we think about outputs. Classical detectors say: here are a fixed set of boxes, with a fixed number of predictions per image. The new approach says: here is a variable-length sequence of tokens per object, per video window. The tokens encode location through coordinates, then the class label, and crucially, do so not for a single moment but as a tubelet—an object’s footprint across N frames. If the object disappears for a frame or two, a special NA token signals its absence, and the sequence continues with a new frame where it reappears or a different object takes its place. This elegant mechanism lets the model gracefully handle objects entering, leaving, occluding, or reappearing without the brittle postprocessing that often bedevils video detectors.
Ultimately, the authors are wrestling with a truth of modern AI perception: as we push toward more capable, end-to-end systems, the best path forward may require rethinking not just model architectures, but also the very language we use to describe what a model is predicting. If you can tokenize a video’s moving objects, the model has a common, differentiable target to optimize, from detection to tracking, with a single output space. The potential is not just incremental gains on a benchmark; it is a step toward more unified, end-to-end perception that can scale as compute improves and as data grows.