A Short Video’s Long Shadow on Global Networks

The digital planet is wired with tiny accelerators of attention: 15-second clips that leap from feed to feed, across platforms and languages. A video’s journey isn’t just about one platform; it’s a web of echoes across Douyin, Kuaishou, Xigua, Toutiao, and Bilibili. A new study asks not just how many views a video racks up in its first hour, but how its influence accumulates across the entire ecosystem over days and weeks. The researchers describe a task called Short-video Propagation Influence Rating, SPIR, and they built a cross-platform dataset and a new model to measure it.

Led by Dizhan Xue and colleagues from the Institute of Automation, Chinese Academy of Sciences, with collaborators from the University of Chinese Academy of Sciences, Tianjin University of Technology, and Qihoo 360 AI Lab, the team stitched together a dataset that feels like a map of an enormous, multinational social network. They crawled 117,720 short videos across five Chinese platforms, gathered 381,926 samples, and annotated each video with a propagation influence level from 0 to 9 over a two-week horizon. And they built a graph with 5.5 million nodes and 1.7 billion directed edges, a scale that makes most social-media studies look like a rough sketch in chalk.

Cross-platform Propagation and the SPIR Concept

Traditional popularity metrics tend to boil a video’s impact down to a single number, often views or likes, and usually within a short window. SPIR, by contrast, asks a bigger question: what is the long run influence of a video when you consider the whole ecosystem of interactions across multiple platforms? The task is to predict a propagation influence rating on a scale from 0 to 9 that captures not just how many people watched, but how many engaged, shared, collected, commented, and discussed it over weeks. It is the digital version of forecasting an idea’s social gravity rather than just its pulse in the moment.

To fuel SPIR, the team built XS-Video, a large-scale, cross-platform short-video dataset that stands apart from prior collections. It spans five of the largest Chinese platforms: Douyin, Kuaishou, Xigua, Toutiao, and Bilibili. The dataset contains 381,926 video samples drawn from 117,720 videos and covers 535 topics, with the full suite of interactions recorded, including views, likes, shares, collects, fans, comments, and even the actual content of comments. Each video is annotated with a long-term propagation influence level, assessed by the accumulation of these multi-dimensional indicators over roughly two weeks after publication. And the team went a step further by aligning cross-platform indicators to make apples-to-apples comparisons across platforms with very different audience scales and engagement norms.

The XS-Video dataset is more than a catalog. It lets researchers see the connective tissue between platforms as a single network. Across the dataset, the authors describe a propagation graph that contains about 5.5 million nodes—think videos, topics, comments, and the many flavors of engagement—and roughly 1.7 billion directed edges that embody the relationships among those nodes. It is a living, cross-platform snapshot of how a single creative act ripples through a sprawling social ecosystem, rather than a single-platform case study. The result is a rare resource that promises to push forward our understanding of online influence in the real world, not just in a lab or a single app.

NetGPT: A Large Graph Model that Speaks Graph and Language

Turning a graph this large into something a computer can reason about is the core challenge. Graph neural networks (GNNs) are natural for this kind of data, but graphs of this scale, with heterogenous node types and cross-modal signals, stretch ordinary models to the limit. The authors answer with a new large graph model named NetGPT, a three-stage training pipeline that blends the strengths of graph reasoning with the broad-world knowledge of large language models (LLMs).

The first stage, heterogeneous graph pretraining, uses a two-layer RGCN to extract node features from the multi-modal graph. Video nodes borrow ViT-derived features, text nodes use a Chinese RoBERTa representation, time features are encoded with sinusoidal time signals, and scalar interactions like views and likes are log-transformed. This creates a rich, high-dimensional representation of the nodes and their relations, capturing both content and context. A supervised pretraining loss nudges the network to predict the propagation level from the node features, establishing a baseline encoder for the graph.

The second stage, supervised language fine-tuning, is where NetGPT learns to talk language with the graph. A small, trainable projection maps the GNN features into the token space of an open-source LLM, and the model is fed with a carefully crafted instruction that frames the graph as a set of structured information about the short video. The ground truth is the level of propagation, and the objective is to maximize the probability that the LLM outputs the correct label. Importantly, the authors freeze the backbone LLM and the GNN during this stage, training only the projection layer. It is a delicate ballet between structured graph knowledge and flexible language reasoning, designed to avoid corrupting pretrained knowledge while teaching the model to read the graph in human-friendly terms.

The third stage, task-oriented predictor fine-tuning, adds a lightweight regression head that translates the LLM’s last-token state into a propagation level. This step leverages the LLM’s capacity to attend to many tokens and to fuse textual and graph-derived signals into a single, calibrated score. The entire pipeline—graph encoder plus language conditioning plus a small predictor—works end-to-end to produce the SPIR rating for a newly posted short video.

When tested against a broad benchmark that includes 4 GNN baselines, 4 LLM baselines, and 2 multimodal LLMs, NetGPT wins on every front. It outperforms the strongest graph baselines by substantial margins and surpasses language-only models that have no direct access to the propagation graph. The ablation studies underscore two messages: content matters, and the graph structure matters. Removing video content or severing the links between video nodes and their interactive metadata both degrade performance; removing the cross-stage training also hurts accuracy. NetGPT’s gains come from a principled fusion of graph structure and language reasoning, not from any single component alone.

Why This Matters: The Promise and Peril of Predicting Long-Term Influence

Why should we care about predicting long-term propagation influence for short videos? Because this is where the rubber meets the road in how information spreads, how communities form, and how platforms monetize attention. A robust SPIR framework can inform better recommendations that reward genuinely engaging content rather than shallow bursts of attention. It can help advertisers target messages more responsibly by distinguishing videos that will sustain discussion and value from those that merely go viral for a moment. And it opens a window into how cross-platform dynamics shape public discourse, allowing researchers and policymakers to study how ideas travel through the entire social graph rather than within a single app.

At the same time, the work surfaces the cultural and technical challenges of measuring influence in a cross-platform world. The long-tail distributions observed in the XS-Video data—a small number of topics and videos dominating the heat while a long tail languishes—mirror the real world: popularity is not evenly shared, and the outliers often pull the conversation along. The cross-platform alignment method, which uses influential creators as a normalization anchor to scale indicators across platforms, is a clever workaround for platform-by-platform normalization problems. It also hints at a broader truth: understanding influence in a global, multi-platform ecosystem requires that we see the patterns across many surfaces, not just a single feed.

Another takeaway is methodological humility. LLMs are mighty, but they do not automatically understand the structure of a heterogeneous social graph. Raw graph data cannot be pasted into a vanilla language model and expect smart reasoning to emerge. NetGPT’s three-stage approach acknowledges the limits of current AI while pushing the envelope—showing that when graph structure and language-capable models cooperate, predictions become materially better. The researchers quantify the difference clearly: even strong LLMs without graph inputs lag far behind NetGPT on the SPIR task, illustrating that the graph is not just decoration but essential context for long-horizon social dynamics.

What This Means for the Near Future of AI and Social Platforms

First, the XS-Video dataset and SPIR task provide a blueprint for how to study influence in the wild, across platforms that rarely speak to each other. If this approach generalizes beyond the five Chinese platforms studied, it could become a standard for cross-platform analytics, across languages and geographies. The potential impact is practical: better content discovery that balances novelty with sustained value, more responsible advertising that respects user experience across ecosystems, and improved tools for content moderation and public-sphere health that can detect when a message is likely to propagate harmful narratives over time.

Second, NetGPT and its three-stage training scheme offer a pathway for future AI systems that must reason over large, heterogeneous graphs. As AI systems increasingly touch the real world—finance networks, supply chains, scientific collaboration graphs—the lesson is clear: to leverage large-scale knowledge and reasoning, we must ground language models in structure. Language alone, without a map of relationships and temporal dynamics, misses the deeper story of how things spread, evolve, and influence one another.

Third, the study also invites reflection on ethics and governance. If we can predict long-term influence, should platforms use these predictions to shape what users see? How do we guard against manipulation, bias, or the amplification of fragile misinformation? The XS-Video work doesn’t answer these questions, but it foregrounds them. By making the cross-platform propagation graph legible to AI, researchers—and, by extension, platforms and policymakers—gain new levers to steer, study, and, when necessary, curb harmful dynamics. The values we bake into these systems will matter as much as the technical prowess behind them.

Towards a More Curious, More Responsible Digital Ecology

In the end, this research is as much about curiosity as it is about engineering. It invites us to see short videos not as isolated sparkles of entertainment but as living threads in a vast, cross-platform tapestry. The XS-Video dataset acts like a new kind of map, one that integrates content with the social currents that carry it. NetGPT then offers a language for that map—an AI that can converse with the graph, reason about its shape, and translate that reasoning into a score that captures something as elusive as influence across time.

As technology spectators and participants, we should celebrate the ambition here: to quantify and understand a phenomenon that touches millions of lives every day. At the same time, we should demand thoughtful stewardship as these tools evolve. The science is exciting, but the social responsibility is equally urgent. A long shadow over a short video is not necessarily a problem; it becomes a problem if we forget to ask who benefits, who gets harmed, and how we can shape a healthier digital commons for the future.

The XS-Video project, a collaboration led by the Institute of Automation at the Chinese Academy of Sciences, highlights a practical truth about modern AI research: real-world data, real-world scale, and a real appetite for bridging disciplines—graph theory, language modeling, and social science—can together illuminate the hidden choreography of online life. For readers who track the next step in AI as it touches media, democracy, and everyday curiosity, this work offers a compelling glimpse of what a more informed, more accountable digital era could look like.