Generating Reproducible and Replayable Bug Reports from Android Application Crashes - ICPC 2015 Online Appendix

This web page is a companion to our ICPC 2015 paper entitled Generating Reproducible and Replayable Bug Reports from Android Application Crashes.




Data


Original Source and APKs

  • HERE is a ZIP file containing the original source code for the AUTs.
  • HERE is a ZIP file containing the original APKs for the AUTs.

Modified Source and APKs

  • HERE is a ZIP file containing the modified source code for the AUTs.
  • HERE is a ZIP file containing the modified APKs for the AUTs.



Scripts

  • HERE is a GitHub repository of scripts for tasks such as partitioning event traces, validating event traces and scoring profiles.



Study Design


Collecting Natural Language Descriptions and Event Traces

  • HERE is the README we provided to participants to guide the collection of natural language descriptions and event traces. It specifies how to form scenario descriptions and record/validate/replay event traces.
  • HERE is an exemplar archive.

Measuring the Effectiveness and Expressiveness of Bug Reports

  • HERE is an instance of a user study template.



Study Results


RQ1-Is CrashDroid an effective approach for identifying scenarios that are relevant to a sequence of methods describing crashes in Android apps?

Our approach to measuring effectiveness was to measure the amount of time for CrashDroid to crash the AUT and produce a bug report with a replay script, and subsequently, compare this time to the times for two experienced developers to crash the AUT and produce a bug report. We also measured/compared the time for Monkey to crash the AUT.

In many cases, the normalized length of the longest common subsequence did not effectively prioritize the profiles in the document database such that the profiles' scores were strictly decreasing. Given the call stack from a crash report, there were typically several profiles with the same score. For example, if 10 profiles score a 1.0, any one of the 10 corresponding scenarios can be the first to be placed in a queue to validate whether it crashes the AUT before being presented to the user. Naturally, while these scenarios may have the same score, they likely take different amounts of time to replay, and not every scenario is necessarily a bug producer. Therefore, in order to measure the amount of time for CrashDroid to crash the AUT and produce a bug report with a replay script, we ran a Monte Carlo simulation to measure the expected time. The Monte Carlo simulation shuffled the rank in the prioritized list of profiles with the same score—running 100 trials for each bug and counting the number of seconds to crash the AUT as well as the number of scenarios replayed.

  • HERE are the CrashDroid results from the Monte Carlo simulation.
  • HERE are the Human Participant 1 (P1) results. HERE are the Human Participant 2 (P2) results.
  • HERE are the Monkey results.

The similarity measure's poor performance is best illustrated with a ROC graph which represents the tradeoff between the hit rate and the false alarm rate. Clearly, the normalized length of the longest common subsequence does not provide any lift in this context: The slope of the plot in the bottom left-hand corner suggests that the top profiles all have the same score. Indeed, in this particular case, several profiles have one and only one method in common with the call stack, yielding a score of 1.0. This means the heuristic is no better than randomly sampling from these "top" profiles without replacement until the AUT crashes.

While there is anecdotal evidence of the similarity measure providing some lift in particular contexts, e.g. some GameMaster Dice bugs, we plan to examine more robust functionals for scoring artifacts in the document database against information in a crash report.


RQ2-How do CrashDroid bug reports compare to human-written bug reports in terms of readability, conciseness and reproducibility?



Additional Tools

  • Android SDK contains the Android Debug Bridge, Android Activity Manager and dmtracedump tools.


Authors

  • Martin White - The College of William and Mary, VA, USA.
    E-mail: mgwhite at cs dot wm dot edu
  • Mario Linares-Vásquez - The College of William and Mary, VA, USA.
    E-mail: mlinarev at cs dot wm dot edu
  • Peter Johnson - The College of William and Mary, VA, USA.
    E-mail: pj at cs dot wm dot edu
  • Carlos Bernal-Cárdenas - The College of William and Mary, VA, USA.
    E-mail: cebernal at cs dot wm dot edu
  • Denys Poshyvanyk - The College of William and Mary, VA, USA.
    E-mail: denys at cs dot wm dot edu