Automatically Classifying Source Code and Byte Code into Domain Categories- ESE Online Appendix


This web page is a companion to our Empirical Software Engineering (ESE) submission entitled "Automatically Classifying Source Code and Byte Code into Domain Categories".

Applications and Categories
We extracted applications from Sharejar and Sourceforge. The categories for both repositories are listed here.
Below is the initial list of projects and categories which were extracted from the repositories.

Data
Sharejar
Sourceforge
Applications
sharejar_projects.csv
sourceforge_projects.zip
Categories
sharejar_categories.csv
sourceforge_categories.csv
Goldset (Categories and Applications)
sharejar_goldset.csv
sourceforge_goldset.csv



Expected Entropy Loss (EEL) files
The zip files listed below have the EEL values of each category. Each file has four columns:
EEL, number of applications in the category with attribute, total number of projects with attribute, attribute.
In the case of packages and classes of sourceforge, the last column has identifiers of the attributes. The description
of how to retrieve the real packages and classes from the identifiers in the EEL files is here.

Dataset (Attribute)
Sharejar
Sourceforge
Packages
sharejar-packages-eel.zip
sourceforge-packages-eel.zip
Classes
sharejar-classes-eel.zip sourceforge-classes-eel.zip
Methods
sharejar-methods-eel.zip sourceforge-methods-eel.zip
Terms

sourceforge-terms-eel.zip



WEKA Datasets
Datasets files are in the arff format for WEKA. A description of the arff format is here. Each column is an attribute and rows are projects. The last column in the arrf file has the ids of the categories of the project. These ids are listed in "Applications and Categories" (See Above).

Dataset (Attribute)
Sharejar
Sourceforge
Packages
sharejar-packages.arff.zip
sourceforge-packages.arff.zip
Classes
sharejar-classes.arff.zip sourceforge-classes.arff.zip
Methods
sharejar-methods.arff.zip sourceforge-methods.arff.zip
Terms

sourceforge-terms.arff.zip



Algorithms
We used the following algorithms implemented on WEKA:

Metrics Results and Statistical Validation
In order to validate our research questions we used boxplots and non-parametric tests. The spreasheets with the results and figures of boxplots are here.

Tools


Authors