Why Your Java Project Might Be Hiding More Than You Think
Software developers often rely on third-party libraries to speed up building applications. These libraries are like ready-made Lego blocks—pieces of code crafted by others that you can snap into your project instead of building everything from scratch. But what if some of these Lego blocks are sneaking into your project without proper labels or documentation? That’s the tricky problem tackled by a team of researchers from Nanyang Technological University, Singapore, and Huazhong University of Science and Technology, China, led by Lida Zhao and Yueming Wu.
They developed JC-Finder, a tool designed to uncover hidden third-party libraries in Java projects—libraries that have been copied and pasted directly into the code rather than imported through official package managers. This kind of code reuse, known as clone-based reuse, is surprisingly common but often invisible to traditional tools that track dependencies.
The Invisible Library Problem
Most software composition analysis (SCA) tools detect third-party libraries by looking at package managers—tools like Maven or Gradle that explicitly declare which libraries a project depends on. But developers sometimes copy code directly from libraries into their projects, bypassing package managers entirely. This practice can cause headaches for maintenance, security, and licensing compliance.
Imagine you’re trying to audit a Java project for security vulnerabilities or license violations. If some libraries are hidden inside the code as clones, your tools might miss them entirely. Worse, this can lead to outdated or vulnerable code lingering unnoticed, or even unintentional license infringements.
Why Java Needs a Different Lens
Existing clone-detection tools often work at the function level or file level, but Java’s object-oriented nature makes these approaches less effective. Java code is organized into classes, which bundle data and methods together. Functions inside a class often interact closely through inheritance and polymorphism, creating a web of relationships that function-level analysis misses.
The JC-Finder team realized that to truly capture the essence of cloned libraries in Java, they needed to analyze code at the class level, preserving the relationships between methods inside a class. This approach respects Java’s design principles and captures the full functionality of cloned code segments.
How JC-Finder Sees What Others Miss
JC-Finder works by parsing Java source code into abstract syntax trees (ASTs) at the class level, stripping away superficial details like variable names and formatting to focus on the structural skeleton of the code. It then links related methods within classes to maintain their interconnections, which is crucial for understanding how the code actually works.
To avoid being overwhelmed by trivial or duplicated code—like simple getters and setters that don’t reveal much about library reuse—JC-Finder filters out these noise elements using complexity metrics and dependency analysis. It also uses timestamps and version histories to identify the original sources of cloned classes, helping to distinguish between genuine reuse and coincidental similarity.
Proof in the Numbers
The researchers tested JC-Finder on nearly 10,000 popular Java libraries and 1,000 GitHub projects. Compared to the best existing function-level clone detection tool adapted for Java, JC-Finder was nearly twice as accurate, achieving an F1-score of 0.818 versus 0.391. It was also about nine times faster at scanning projects.
When applied to almost 8,000 GitHub projects, JC-Finder found that about 10% contained cloned third-party libraries not declared in package managers. Even more striking, it uncovered 26% more third-party libraries than traditional package-manager-based tools alone. This means a significant portion of reused code is flying under the radar.
Why Developers Copy Instead of Importing
The study sheds light on why developers sometimes prefer copying code rather than using package managers. Sometimes the needed library isn’t available in a package repository or is too large and unwieldy, so developers cherry-pick just the parts they need. Other times, older or specialized code is only available as source snippets on websites or blogs. Unfortunately, this practice can lead to poor version control and security risks.
What This Means for Software Security and Ethics
JC-Finder’s ability to reveal hidden library reuse has important implications. For security teams, it means better detection of vulnerable code lurking inside projects. For legal and compliance officers, it helps identify potential license violations where copied code lacks proper attribution. For developers, it’s a reminder to prefer official package imports over copy-pasting, to keep projects maintainable and secure.
Looking Ahead
JC-Finder is a pioneering step toward more transparent and comprehensive software composition analysis in Java. By respecting the language’s unique structure and filtering out noise, it opens a clearer window into how code is reused across the ecosystem. The researchers have made their data and tool publicly available, inviting the community to build on their work.
In a world where software supply chains are increasingly complex and security-critical, tools like JC-Finder help us see the hidden threads woven into our codebases—threads that might otherwise unravel our trust in the software we build and use.