Padding Shifts the Ground Under Function Detectors in PE Files

In the hidden world of binary analysis, researchers hunt for the invisible rails that carry a program from start to finish. Every time a computer runs a compiled program, it hides its function boundaries in a tangle of bytes, calls and jumps, and crafty padding between blocks. For security researchers—malware analysts, vulnerability researchers, and anyone who wants to understand what a binary really does—finding where one function begins is the first real step toward deciphering the whole map. The new study from the Institute for Internet Security at Westphalian University of Applied Sciences in Gelsenkirchen tackles this problem head on for Windows PE files, a format that powers most Windows executables. It builds a big, bold bridge between old school heuristics and modern learning based detectors, all while revealing a stubborn wrinkle: the padding between functions can fool or mislead even the best detectors.

The authors behind FuncPEval — a Windows PE dataset built specifically for this challenge — include Raphael Springer and colleagues at the Westphalian University of Applied Sciences. Their work is notable not just for listing eight different function start detectors (five heuristics based and three machine learning based) but for revealing how these tools perform on real Windows binaries that mix benign software with dangerous samples like the Conti ransomware. The study makes a candid, almost human case for why it matters to stress test detectors on Windows PE code with ground truth that covers 1,092,820 function starts. In other words, they ask the simple but hard question: when the padding between functions changes, do learning based detectors still see the same shapes in the code that humans expect to find? The short answer is a nuanced yes, and a warning label hanging from the other side of the door.

A new dataset and a clear target

FuncPEval is the centerpiece of the paper. It is a Windows PE dataset that captures both 32 bit and 64 bit binaries and includes ground truth for function starts drawn from real samples such as Chromium components and a Conti ransomware sample. The ground truth comes from reliable symbol information embedded in debugging data, which the authors extract via standard tooling. The result is a ground truth that maps hundreds of thousands of function starts to concrete byte addresses across both x86 and x64 code. The scale is striking: the FuncPEval dataset catalogs hundreds of thousands of function starts across Chromium and Conti, with tens of millions of bytes in play. As a reference point, the paper notes that FuncPEval spans more than twice the PE focused data previously available, offering a more demanding testbed for detectors that have often been trained on Linux focused ELF samples. FuncPEval thus becomes not just a dataset but a new standard for evaluating how well detectors generalize to PE files that reflect real world toolchains and malware realities.

On the ground, the authors describe two families of detectors they test: heuristics based systems that rely on hand crafted patterns or signatures, and machine learning based systems that try to learn where function starts come from by looking at sequences of bytes or disassembled instructions. The eight tools in total cover a spectrum from classic disassembly workhorses to modern, data driven learners. The study does not pretend that one method rules them all; instead it lays bare the tradeoffs between precision, recall, speed, and robustness to subtle changes in the binary, especially padding. The human side of the story—why these starts matter—receives a strong emphasis as well. If you misdetect a function start, you can derail entire pipelines that rely on function boundaries for decompilation, binary similarity, and malware lineage tracking.

The Westphalian team also shows that even small changes in the way compilers pad functions can ripple through detectors. They find that their ground truth and the learners trained on unmodified padding patterns can suddenly misfire when the padding is altered. In a field where automation must scale to millions of samples, that kind of brittleness matters as much as raw accuracy.

Eight tools, two families, one clash with padding

The core experiment pits eight detectors against PE samples in the FuncPEval dataset: five heuristics based approaches and three learning based systems. The heuristics include stalwarts like Ghidra and IDA Pro whose function boundary detection relies on signature patterns and static analysis, plus other static analysis tools such as Nucleus, SMDA, and rev.ng that blend different strategies to locate function starts. On the learning side, the study tests DeepDi, Shin et al.s RNN based approach, and the XDA transfer learning based model. Between them they cover a broad range of what researchers have proposed over the last decade to wrestle with the function start problem.

The results reveal consistent patterns. In particular, for a representative Windows PE sample like Chromium x64, IDA emerges as the top performer with a F1 score around 98.4 percent, followed closely by DeepDi at about 97 percent. The rest of the learning based and heuristic tools cluster a bit lower, though several still attain strong F1 scores in the 80s and 90s. When the authors widen the lens to include both a benign target (Chromium) and a malicious sample (Conti), the same theme holds: the best detectors are those that blend robust disassembly with solid pattern recognition, and the fastest options slide in behind them with slightly lower accuracy but much greater throughput. The speed vs accuracy tradeoff is laid bare in a striking way: DeepDi, the fastest among the ML style tools, closes the gap with the best detectors on Chromium while delivering a near real time performance that makes it attractive for large scale analysis. The human takeaway is that there is no single tool that is best for every job; the best choice depends on whether you value precision, speed, or a balance of both.

Beyond raw numbers, the authors use the FuncPEval results to probe a deeper question: how much do machine learning based detectors actually rely on the padding that sits between functions? Their analysis shows that the padding carries a surprising amount of information for several ML based detectors, especially for the RNN and XDA. In other words, these models sometimes learn to recognize a padding pattern that tends to precede a function start rather than learning the intrinsic cues that truly signal a function boundary. The result is a vulnerability: if padding is altered or randomized, the machine learning based tools can collapse in performance. The authors quantify this with a sobering set of controlled experiments where they replace padding bytes with random data. For Chromium x86, the RNNs F1 score plunges from around 78 percent to a shade above 6 percent. For Chromium x64, XDA drops from near 87 percent to roughly 12 percent. DeepDi, which operates at a higher level of granularity in the instruction stream, also suffers but not as dramatically. Nucleus, a tool that leans on control flow graphs, suffers the most among the non learning based approaches when padding is randomized. The upshot is clear: padding is not just a nuisance; it can be a crutch for learning based detectors, a phenomenon the authors call out as spurious correlation.

To build confidence and push the field forward, the authors do something notable: they dont just report results, they actively refine models. They reproduce Shin et al.s RNN pipeline, introduce a variant with a single output neuron that yields improved F1 scores, and, on the XDA side, diagnose a labeling quirk that made adjacent functions hard to separate and then fix it with a new encoding. The net effect is a demonstration that small, well reasoned tweaks to training data and labeling can yield meaningful gains in the messy real world. The team even retrains XDA on a revised encoding and observes a meaningful improvement in F1 on their target data, though they also show that the gains do not fully erase the generalization gaps when unseen toolchains appear. The theme here is not triumphalism but a careful, reproducible, and incremental improvement in a field where such discipline matters.

Padded and fragile learning, robust defenders

The padding experiments are not an exercise in despair but a beacon for how to design more trustworthy detectors. The study makes two core claims about robustness. First, the ML based approaches are highly sensitive to the specific padding pattern they were trained on. When padding is changed to random bytes, their accuracy can fall by 30 percentage points or more, turning what looked like reliable detectors into brittle tools. Second, non learning based tools show more resilience to padding changes, though they are not immune. Tools like IDA and Ghidra show only modest drops when padding is randomized, while Nucleus and DeepDi fall more in line with the ML based detectors. The practical implication is sobering: relying solely on learning based detectors for large scale malware triage could create blind spots if attackers exploit padding as a form of obfuscation.

And yet the study is not all warning signs. It also gives a clear path forward. The FuncPEval dataset itself becomes a baseline for future work, enabling researchers to test across more samples, other toolchains, and a wider variety of compiler configurations. The authors also demonstrate that careful adjustments to model design and data representation can yield tangible improvements, pushing XDA and RNN based approaches toward more robust operation. They publish code and data to invite independent replication and extension, a move that has become essential for making progress in binary analysis as a practical science rather than a collection of clever tricks. In a field where reproducibility can be as hard as the problem itself, this openness matters as much as the numbers.

The study also keeps the human stake in view. It underscores how a single misidentified function start can derail a chain of analyses: decompilers, code similarity measures, or risk assessments that rely on accurate function boundaries. In the real world, malware researchers want pipelines that can scale to millions of samples while staying honest about what they know and what they do not. Padding matters because it is not just a cosmetic feature of compilers; it is a marker that can either help or hinder automated systems that keep the software ecosystem safer.

Padding as a lens on generalization and defense

The Westphalian team places a keen eye on the broader implications of their results. One big theme is generalization: ML based detectors trained on one set of binaries and one padding regime often stumble when faced with a different toolchain or different compiler options. In practical terms, if researchers want detectors to work across Windows updates, across different malware families, and across the many ways a binary can be built, they need datasets that reflect that diversity. The authors provide such a dataset in FuncPEval, including Chromium variants and Conti, and they compare detectors across x86 and x64 samples built with widely used toolchains. This is an important step toward more credible, broadly applicable tooling.

Another theme is the tug of war between speed and precision. For large scale malware analysis or incident response, throughput can be the deciding factor. The study highlights that DeepDi, a learning based approach, is the fastest among the ML tools while preserving strong accuracy. For analysts who must chew through enormous corpora, this kind of speed with acceptable accuracy makes a big difference. But the paper also reminds us that speed should not come at the expense of robustness. In other scenarios, IDAs precision remains the gold standard, even if it takes longer. The best choice depends on the job at hand.

Finally, the authors lean into the philosophy of reproducible research. They not only release their dataset and models but also provide transparent analysis of where previous work diverged from their own findings. They reveal labeling quirks in XDA that affected reported results and demonstrate how a revised encoding can restore much of the missing ground truth signal. Such openness matters for a field that increasingly depends on learned models to guide decisions that affect security, risk, and safety.

It is worth noting that the study is anchored in a very real institution: the Westphalian University of Applied Sciences in Gelsenkirchen, Germany, and its Institute for Internet Security. The team is led by Raphael Springer with coauthors Alexander Schmitz, Artur Leinweber, Tobias Urban, and Christian Dietrich. Their work embodies a pragmatic blend of rigorous evaluation and a willingness to expose weaknesses in the tools we rely on. In a domain where the gap between algorithm and consequence can be measured in lives and systems, this kind of careful, transparent, and humanly aware research matters.

The broader takeaway is nuanced: there is no silver bullet for function start detection in PE files. The best performers combine solid static analysis with robust deconstruction of binary structure, and even they can be sensitive to the exact way a binary was built. The padding between functions, a detail that might seem trivial to the untrained eye, is in fact a hinge that can swing results dramatically. This is not just a niche concern for researchers; it is a reminder that software realities—the way compilers lay out code, the variability of toolchains, and the ever shifting strategies of malware authors—shape the capabilities of the security tools we depend on.

In the end, the paper offers both a map and a warning. It maps out the terrain of function start detectors on Windows PE files with a clean, well documented dataset and an open invitation to reproduce and extend. It warns that padding is not a cosmetic feature but a real adversary to ML based detectors, and that defense in depth will require combining human insight, solid engineering, and data that represents the diversity of real world binaries. The result is a healthier setting for progress: a community that knows its own limits, shares its data, and builds tools that can stand up to the unpredictable padding that real code always carries with it.

Key takeaway the padding between functions is a critical and surprisingly fragile anchor for modern detectors, and the field is moving toward datasets and models that recognize this, not pretend it does not exist. The study from Westphalia shows that robust, scalable function start detection in PE files will likely come from a blended approach: leverage the speed of learned models where possible, but anchor them with strong, tool aware static analysis and a commitment to reproducibility and diverse training data. In that sense, the padding debate is not a sideshow; it is a litmus test for how ready we are to trust the software that helps keep other software safe.