Online Appendix

This web page is a companion to our ICSM 2011 submission entitled "Categorizing Software Applications for Maintenance".
  1. Projects
  2. Here is the initial list of projects and categories which were extracted from the repositories.
    Data Sharejar Sourceforge
    Projects sharejar_projects.csv sourceforge_projects.zip
    Categories sharejar_categories.csv sourceforge_categories.csv
    Goldset (Categories & Projects) sharejar_goldset.csv sourceforge_goldset.csv

  3. WEKA Datasets
  4. Datasets files are in arff format for WEKA.
    Attributes Sharejar Sourceforge* Sourceforge**
    Packages sharejar-packages-arff.zip sourceforge2-packages-arff.zip sourceforge-packages-arff.zip
    Classes sharejar-classes-arff.zip sourceforge2-classes-arff.zip sourceforge-classes-arff.zip
    Terms sourceforge2-terms-arff.zip sourceforge-terms-arff.zip

  5. Metrics results
  6. True Positive Rate and False Positive Rate for Sourceforge*, Sourceforge**, Sharejar.
  7. Statistical validation
  8. We validate our research questions with Friedman and Nemenyi tests. Click here to see results and excel files with the tests.

* This dataset does not include unlabeled projects. The total of projects in the dataset is 3286
** This dataset includes unlabeled projects. The total of projects in the dataset is 8310

Collin McMillan, Mario Linares-Vásquez, Denys Poshyvanyk, Mark Grechanik