What Treasure is Hid?

1. Case Study

It is estimated that around one trillion lines of code have already been written, with an additional 35 billion created every year. Programmers frequently organize their code into one or more widely-available, open-source archives. SourceForge.net has over 30,000 registered Java programs. The FreeBSD ports collection consists of over 20,000 software projects organized for rapid deployment. The Ultimate Debian Database holds all software from the Debian and Ubuntu Linux distributions, including source code, documentation, and bug reports. The K Desktop Environment is an intricate system of over 4 million lines of code with a history of change logs and bug data.

Yet there are many challanges facing empirical research into the composition of all this existing code. First, search engines typically treat source files in large repositories as unstructured text, which masks many of the programming language-specific features of the code. Second, the format of the repositories varies substantially -- FreeBSD ports insists that programmers follow certain organization patterns, while in SourceForge developers can place files freely. Finally, many repositories are polluted with poorly-functioning or out-of-date projects.

We addressed these difficulties by creating an infrastructure for empirical research into source code artifacts. We randomly chose 2,080 Java applications from SourceForge and posed 28 research questions to obtain various facts about the composition of the subject source code.

The companion paper to this website explains the rationale and results of the questions. We divided the computation of the infrastructure for examining the projects among 16 nodes, each of which produced a MySQL database of extracted information. View the schema for this database. You may download the resulting databases from our analysis of the 2,080 projects here: treasure_db.tar.gz (962MB compressed, 9.2GB uncompressed)

2. The Team

Current Members

  • Mark Grechanik

    E-mail: drmark at uic dot edu
    Affiliation: Accenture Technology Labs and University of Illinois at Chicago

  • Collin McMillan

    E-mail: cmc at cs dot wm dot edu
    Affiliation: College of William & Mary

  • Luca DeFerrari

    E-mail: luca dot deferrari at mail dot polimi dot it
    Affiliation: Politecnico di Milano

  • Marco Comi

    E-mail: marco dot comi at mail dot polimi dot it
    Affiliation: Politecnico di Milano

  • Stefano Crespi

    E-mail: stefano dot crespi at mail dot polimi dot it
    Affiliation: Politecnico di Milano

  • Chen Fu

    E-mail: chen dot fu at accenture dot com
    Affiliation: Accenture Technology Labs

  • Denys Poshyvanyk

    E-mail: denys at cs dot wm dot edu
    Affiliation: College of William & Mary

  • Qing Xie

    E-mail: qing dot xie at accenture dot com
    Affiliation: Accenture Technology Labs

  • Carlo Ghezzi

    E-mail: carlo dot ghezzi at mail dot polimi dot it
    Affiliation: Politecnico di Milano

3. Support

We gratefully acknowledge financial support from the NSF on this research project.