Skip to content

aboutcode-org/back2source-data

Repository files navigation

back2source-data: Checking if package's sources and binaries match.

back2source is designed to provide accurate information about whether and which source code of a package was used and built in the binaries. It can also help to determine if the source archive matches the version control checkout. For instance it was used to detect positively that there were potentially malicious scripts in xz-utils that needed review and that were only present in the release source archives and were missing from the source code repositories.

back2source goal is to effectively and automatically help software development teams to trust but verify that the packages they use do not contain unknown statically linked and other third-party packages be they from trustable origin or malicious.

back2source consists in a set of pipelines and pipeline options in ScanCode.io and command line tools. back2source supports the analysis of binaries including byte-compiled Java, JavaScript with map files, ELF with DWARFS debug symbols, Go binaries and plain source archives.

This repository contains the scan results of running back2source on many packages.

To validate the accuracy of back2source, at scale we collected a list of about 1000 open source packages from the Fedora Linux distribution.

For each of these packages, we collected the source archive and the built binary package URLs.

Then, we used a command line client to run scancode.io's back2source analysis on each pair of package URLs.

Finally, we ran a script to summarize the results and produce a table of the key results from these analysis.

This repository contains both the summary and detail results of these analyses: a summary JSON together with the detailed results of the JSON back2source analysis for each of the pair of source and binary packages .

There are a couple of interesting trends that emerged from this analysis:

First, using the current code there are only a few packages that are not reported as missing some source code. After investigation it happens, these are mostly false positives related to the use of the standard libraries. We plan to improve back2source capabilities to weed out these incorrect reports. For the true false positives, these are bugs to be fixed.

For the true --but noisy-- positives pointing to the standard library, we will implement a feature to detect accurately if a problematic file path is part of the standard library or the tool chain or other common files that may not be present in the source code when the package is built, but are instead the result of a system-wide package installation. Such packages are commonly not considered as explicit dependencies as thought to be used only in development, but their code is in practice commonly reused and injected in the build of native binaries.

Second, there are several insightful findings that seem to be mostly oversights, rather than malicious cases.

For instance, all required Go dependencies are statically linked in a single executable (ELF, Mach-O or Windows PE). The corresponding source code if the third-party packages are seldom included in the source archive. They may represent a large number of "ghost" packages that are silently ignored by most analysis tools and most package manifests. As a result these binary packages harbor "unknown unknowns" problems that go unnoticed:

  • open source license compliance violations where required license notices, copyright statements and

other due credits are completely missing.

  • security vulnerability risks where package with known vulnerabilities may be included unknowingly

Another case is with C and C++ code that are built with CMake and QT like is common for KDE-based utilities where a significant volume of the compiled code comes actual third-party dependencies. Typical C++ "includes" contain the full function definitions that are inlined and statically linked in the resulting binaries. These are reported as missing sources, because they are never part of what is considered as actual source for the package, but rather build-time dependencies. We will need to account for these build-time dependencies to ensure they are part of the source code side of the analysis. Like with Go, these dependencies may be ignored by most detection tools and they may be subject to vulnerabilities.

To understand the magnitude of the issue, a Go package like asnmap consists of ten Go source files, but is compiled from 350 files, 340 of which come from non reported dependencies. But this is not always the case. The Go aerc code has a source RPM that contains vendored code for all its included compiled dependencies.

Content of this repository

  • d2d-summary.csv: the summary of the analysis. This is the main attraction. The interesting columns are the following and all these counters should have a value of zero

    • codebase_resources_not_deployed: these are source files not part of the binaries
    • codebase_resources_requires_review: these are deployed binary files for which we could not find one or more source files
    • codebase_resources_discrepancies_total: this is the total number of files with an issue
    • For DWARF-based analysis we have these columns:
      • dwarf_compiled_paths_not_mapped_total: The total number of DWARF compilation unit paths found in an ELF for which we could not find the corresponding source source in the source archive.
      • dwarf_included_paths_not_mapped_total: The total number of DWARF "include" paths found in an ELF for which we could not find the corresponding source source in the source archive.
  • etc/scripts: the scripts used to execute the back2source analysis

  • data: a directory with a d2d-details.json and d2d-summary.json file for each pair of packages. The file tree is organized as a mirror of the original web site tree.

  • package-pairs.csv: the list of current download URLs for each analyzed package pair

Instructions to re-run this experiment:

  1. Clone this git repository using git clone https://github.com/aboutcode-org/back2source-data
  2. Create a virtualenv and install requirements using pip install --requirement requirements.txt.
  3. Optionally, run the script using python3 etc/scripts/get_fedora_urls.py. This generates a file named package-pairs.csv. This file is already present in this repo.

5. Install purldb as explained at https://github.com/nexB/purldb using its instructions 7. Run python3 etc/scripts/run_d2d.py. This will run the analysis, and generate a summary file

named d2d-summary.csv

License

SPDX-License-Identifier: Apache-2.0

The ScanCode.io and PurlDB software is licensed under the Apache License version 2.0. Data generated with ScanCode.io is provided as-is without warranties. ScanCode is a trademark of nexB Inc.

Funding

This project is funded through NGI0 Entrust - https://nlnet.nl/entrust, a fund established by NLnet - https://nlnet.nl with financial support from the European Commission's Next Generation Internet https://ngi.eu program. Learn more at the NLnet project page https://nlnet.nl/project/Back2source

It also receives ongoing support from nexB and other contributors.

About

Checking if package sources and binaries match

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages