SEMERU Reading Group

Next Meeting

Thursday, October 25, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Bogdan, Mario, Daniel, Qi, Evan, Michael, Andrew)

Update on current status of projects (All)

Future Meetings

Thursday, November 2, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: TBD)

Thursday, November 9, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: TBD)

Previous Meetings

Thursday, March 21, 2013, 3:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan)

Technology update

Wednesday, February 27, 2013, 2pm, McGlothlin-Street Hall 128 (Leaders: Bogdan, Mario, Daniel, Qi, Evan, Michael, Andrew)

Update on current status of projects (All)

Friday, February 22, 2013, 9:30am-11:00am, McGlothlin-Street Hall 128

Visit from Brian Robinson, PhD (ABB)

Thursday, February 21, 2013, 3pm, McGlothlin-Street Hall 128 (Leaders: Bogdan, Mario, Daniel, Qi, Evan, Michael, Andrew)

Update on current status of projects (All)

Friday, February 1, 2013, 9:30am-11:00am, McGlothlin-Street Hall 128

Visit from Professor Jim Cordy (Queen's University)

Thursday, January 31, 2013, 3pm, McGlothlin-Street Hall 128 (Leaders: Bogdan, Mario, Daniel, Qi, Evan, Michael, Andrew)

Update on current status of projects (All)

Thursday, November 15, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Bogdan, Mario, Daniel, Qi, Evan, Michael, Andrew)

Update on current status of projects (All)

Thursday, October 25, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Bogdan, Mario, Daniel, Qi, Evan, Michael, Andrew)

Update on current status of projects (All)

Thursday, October 11, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Evan and Michael)

FLAT3 extension (Michael)
TraceLab components (Evan)

Thursday, September 13, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Bogdan, Mario, Daniel, Qi, Evan, Michael, Andrew)

Hosting SEMERU projects on ProjectLocker (Bogdan)
Update on current status of projects (All)

Tuesday, June 26, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Collin, Malcom, Bogdan, Daniel, Qi, Evan, Michael, Andrew)

Projects update

Friday, May 11, 2012, 11:00am, McGlothlin-Street Hall 128 (Leaders: Malcom, Collin, Bogdan, Evan, Daniel)

Projects update

Wednesday, April 25, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Collin, Malcom, Bogdan, Daniel and Evan)

Projects update

Wednesday, April 4, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Malcom, Collin, Bogdan, Evan, Daniel)

Projects update

Wednesday, March 21, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Malcom, Collin, Bogdan, Evan, Daniel)

Projects update

Wednesday, March 21, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Evan, Bogdan)

TraceLab update (Evan)
Preliminary results for the contextual link analysis project (Bogdan)

Wednesday, February 22, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Malcom, Daniel)

Introduction to R (Malcom)
Introduction to RapidMiner (Daniel)

Wednesday, February 8, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Bogdan)

Project update

Wednesday, January 25, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Evan, Daniel)

Projects update

Wednesday, January 18, 2012, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Mario, Collin)

Projects update

Wednesday, November 23, 2011, 2:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

Project update

Wednesday, November 9, 2011, 2:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Project update

Wednesday, October 26, 2011, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Evan, Mario)

Projects update

Wednesday, October 19, 2011, 2:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan)

Project update

Wednesday, October 12, 2011, 2:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan)

Project update

No meetings in preparation of organizing ICSM'11, Williamsburg, VA
Wednesday, August 17, 2011, 3:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan)

Project update

Wednesday, August 10, 2011, 3:00pm, McGlothlin-Street Hall 128 (Leaders: Evan, Collin)

RESIST (Collin)
TraceLab (Evan)

Wednesday, August 3, 2011, 3:00pm, McGlothlin-Street Hall 128 (Leaders: Sam, Malcom)

Projects update

Wednesday, July 27, 2011, 3:00pm, McGlothlin-Street Hall 128 (Leader: Sam)

Foutse Khomh, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol, An exploratory study of the impact of antipatterns on class change- and fault-proneness, in Journal of Empirical Software Engineering (EMSE), 2011.

Abstract: Antipatterns are poor design choices that are conjectured to make objectoriented systems harder to maintain. We investigate the impact of antipatterns on classes in object-oriented systems by studying the relation between the presence of antipatterns and the change- and fault-proneness of the classes. We detect 13 antipatterns in 54 releases of ArgoUML, Eclipse, Mylyn, and Rhino, and analyse (1) to what extent classes participating in antipatterns have higher odds to change or to be subject to fault-fixing than other classes, (2) to what extent these odds (if higher) are due to the sizes of the classes or to the presence of antipatterns, and (3) what kinds of changes affect classes participating in antipatterns. We show that, in almost all releases of the four systems, classes participating in antipatterns are more changeand fault-prone than others. We also show that size alone cannot explain the higher odds of classes with antipatterns to underwent a (fault-fixing) change than other classes. Finally, we show that structural changes affect more classes with antipatterns than others. We provide qualitative explanations of the increase of change- and fault-proneness in classes participating in antipatterns using release notes and bug reports. The obtained results justify a posteriori previous work on the specification and detection of antipatterns and could help to better focus quality assurance and testing activities.

Yann-Gaël Guéhéneuc and Giuliano Antoniol, DeMIMA: A Multilayered Approach for Design Pattern Identification, in IEEE Transactions on Software Engineering, Vol. 34, No. 5, pp. 667-684, September-October 2008.

Abstract: Design patterns are important in object-oriented programming because they offer design motifs, elegant solutions to recurrent design problems, which improve the quality of software systems. Design motifs facilitate system maintenance by helping maintainers to understand design and implementation. However, after implementation, design motifs are spread throughout the source code and are thus not directly available to maintainers. We present DeMIMA, an approach to semiautomatically identify microarchitectures that are similar to design motifs in source code and to ensure the traceability of these microarchitectures between implementation and design. DeMIMA consists of three layers: two layers to recover an abstract model of the source code, including binary class relationships, and a third layer to identify design patterns in the abstract model. We apply DeMIMA to five open-source systems and, on average, we observe 34 percent precision for the 12 design motifs considered. Through the use of explanation-based constraint programming, DeMIMA ensures 100 percent recall on the five systems. We also apply DeMIMA on 33 industrial components.

Wednesday, April 27, 3:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

Gabriele Bavota, Andrea De Lucia, and Rocco Oliveto, Identifying Extract Class refactoring opportunities using structural and semantic cohesion measures, in the Journal of Systems and Software (JSS), Volume 84, Issue 3, March 2011, Pages 397-414.

Abstract: Approaches for improving class cohesion identify refactoring opportunities using metrics that capture structural relationships between the methods of a class, e.g., attribute references. Semantic metrics, e.g., C3 metric, have also been proposed to measure class cohesion, as they seem to complement structural metrics. However, until now semantic relationships between methods have not been used to identify refactoring opportunities. In this paper we propose an Extract Class refactoring method based on graph theory that exploits structural and semantic relationships between methods. The empirical evaluation of the proposed approach highlighted the benefits provided by the combination of semantic and structural measures and the potential usefulness of the proposed method as a feature for software development environments.

Wednesday, April 20, 3:00pm, McGlothlin-Street Hall 128 (Leader: Evan)

Bram Adams, Zhen Ming Jiang, and Ahmed E. Hassan, Identifying Crosscutting Concerns Using Historical Code Changes, in Proceedings of 32nd ACM/IEEE International Conference on Software Engineering (ICSE'10), pages 125-134, Cape Town, South Africa, May 2-8, 2010.

Abstract: Detailed knowledge about implemented concerns in the source code is crucial for the cost-effective maintenance and successful evolution of large systems. Concern mining techniques can automatically suggest sets of related code fragments that likely contribute to the implementation of a concern. However, developers must then spend considerable time understanding and expanding these concern seeds to obtain the full concern implementation. We propose a new mining technique (COMMIT) that reduces this manual effort. COMMIT addresses three major shortcomings of current concern mining techniques: 1) their inability to merge seeds with small variations, 2) their tendency to ignore important facets of concerns, and 3) their lack of information about the relations between seeds. A comparative case study on two large open source C systems (Post-greSQL and NetBSD) shows that COMMIT recovers up to 87.5% more unique concerns than two leading concern mining techniques, and that the three techniques complement each other.

Wednesday, April 6, 3:00pm, McGlothlin-Street Hall 128 (Leader: Collin/Malcom)

Joel Brandt, Mira Dontcheva, Marcos Weskamp, and Scott R. Klemmer, Example-Centric Programming: Integrating Web Search into the Development Environment, in Proceeding of the 28th ACM Conference on Human Factors in Computing Systems (CHI'10), pages 513-522, Atlanta, Georgia, 2010.

Abstract:The ready availability of online source code examples has fundamentally changed programming practices. However, current search tools are not designed to assist with programming tasks and are wholly separate from editing tools. This paper proposes that embedding a task-specific search engine in the development environment can significantly reduce the cost of finding information and thus enable programmers to write better code more easily. This paper describes the design, implementation, and evaluation of Blueprint, a Web search interface integrated into the Adobe Flex Builder development environment that helps users locate example code. Blueprint automatically augments queries with code context, presents an example-centric view of search results, embeds the search experience into the editor, and retains a link between copied code and its source. A comparative laboratory study found that Blueprint enables participants to write significantly better code and find example code significantly faster than with a standard Web browser. Log analysis from a three-month field deployment with 2,024 users suggested that task-specific search interfaces may cause a fundamental shift in how and when individuals search the Web.

Gerardo Canfora, and Luigi Cerulo, Fine Grained Indexing of Software Repositories to Support Impact Analysis, in Proceedings of the 2006 International Workshop on Mining Software Repositories (MSR'06), pages 105-111, Shanghai , China, 2006.

Abstract: Versioned and bug-tracked software systems provide a huge amount of historical data regarding source code changes and issues management. In this paper we deal with impact analysis of a change request and show that data stored in software repositories are a good descriptor on how past change requests have been resolved. A fine grained analysis method of software repositories is used to index code at different levels of granularity, such as lines of code and source files, with free text contained in software repositories. The method exploits information retrieval algorithms to link the change request description and code entities impacted by similar past change requests. We evaluate such approach on a set of three open-source projects..

Wednesday, March 23, 3:00pm McGlothlin-Street Hall 128 (Leader: Bogdan/Collin/Malcom)

Project Reports

Wednesday, March 16, 3:00pm, McGlothlin-Street Hall 128 (Leader: Collin/Bogdan)

Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle, Software Bertillonage: Finding the Provenance of an Entity, in Proceeding of the 8th Working Conference on Mining Software Repositories (MSR'11), Waikiki, Honolulu, Hawaii, USA, 2011.

Abstract: Deployed software systems are typically composed of many pieces, not all of which may have been created by the main development team. Often, the provenance of included components — such as external libraries or cloned source code — is not clearly stated, and this uncertainty can introduce technical and ethical concerns that make it difficult for system owners and other stakeholders to manage their software assets. We liken our provenance goals to that of Bertillonage, a simple and approximate forensic analysis technique based on bio-metrics that was developed in 19th century France before the advent of fingerprints. As an example, we have developed a fast, simple, and approximate technique called anchored signature matching for identifying library version information within a given Java application. This technique involves a type of structured signature matching performed against a database of candidates drawn from the Maven2 repository, a 150GB collection of open source Java libraries. An exploratory case study using a proprietary ecommerce Java application illustrates that the approach is both feasible and effective.

Wednesday, February 09, 3:00pm, McGlothlin-Street Hall 128 (Leaders: Collin)

Richard Chow, Philippe Golle, and Jessica Staddon, Detecting privacy leaks using corpus-based association rules, in Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08), pages 893-901, Las Vegas, Nevada, USA, 2008.

Abstract: Detecting inferences in documents is critical for ensuring privacy when sharing information. In this paper, we propose a refined and practical model of inference detection using a reference corpus. Our model is inspired by association rule mining: inferences are based on word co-occurrences. Using the model and taking the Web as the reference corpus, we can find inferences and measure their strength through web-mining algorithms that leverage search engines such as Google or Yahoo!. Our model also includes the important case of private corpora, to model inference detection in enterprise settings in which there is a large private document repository. We find inferences in private corpora by using analogues of our Web-mining algorithms, relying on an index for the corpus rather than a Web search engine.We present results from two experiments. The first experiment demonstrates the performance of our techniques in identifying all the keywords that allow for inference of a particular topic (e.g. "HIV") with confidence above a certain threshold. The second experiment uses the public Enron e-mail dataset. We postulate a sensitive topic and use the Enron corpus and the Web together to find inferences for the topic. These experiments demonstrate that our techniques are practical, and that our model of inference based on word co-occurrence is well-suited to efficient inference detection.

Wednesday, February 2, 3:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan/Collin)

Feature Location with History (Bogdan)
ExPort (Collin)

Wednesday, January 12, 3:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan/Malcom)

Eclipse's AST - generating Java corpora
XP-DEV SVN repository
Gensim
Google Ngram Dataset

Tuesday, November 16, 2:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan)

Lucia, David Lo, Lingxiao Jiang and Aditya Budi, Comprehensive Evaluation of Association Measures for Fault Localization, in Proceedings of 26th International Conference on Software Maintenance (ICSM'10), Timi&scedil;oara, Romania, 2010.

Abstract: In statistics and data mining communities, there have been many measures proposed to gauge the strength of association between two variables of interest, such as odds ratio, confidence, Yule-Y, Yule-Q, Kappa, and gini index. These association measures have been used in various domains, for example, to evaluate whether a particular medical practice is associated positively to a cure of a disease or whether a particular marketing strategy is associated positively to an increase in revenue, etc. This paper models the problem of locating faults as association between the execution or non-execution of particular program elements with failures. There have been special measures, termed as suspiciousness measures, proposed for the task. Two state-of-the-art measures are Tarantula and Ochiai, which are different from many other statistical measures. To the best of our knowledge, there is no study that comprehensively investigates the effectiveness of various association measures in localizing faults. This paper fills in the gap by evaluating 20 well- known association measures and compares their effectiveness in fault localization tasks with Tarantula and Ochiai. Evaluation on the Siemens programs show that a number of association measures perform statistically comparable as Tarantula and Ochiai.

Tuesday, November 2, 2:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

Gerardo Canfora, Michele Ceccarelli, Luigi Cerulo and Massimiliano Di Penta, Using Multivariate Time Series and Association Rules to Detect Logical Change Coupling: an Empirical Study, in Proceedings of 26th International Conference on Software Maintenance (ICSM'10), Timi&scedil;oara, Romania, 2010.

Abstract: In recent years, techniques based on association rules discovery have been extensively used to determine change- coupling relations between artifacts that often changed together. Although association rules worked well in many cases, they fail to capture logical coupling relations between artifacts modified in subsequent change sets. To overcome such a limitation, we propose the use of multi- variate time series analysis and forecasting, and in particular the use of Granger causality test, to determine whether a change occurred on a software artifact was consequentially related to changes occurred on some other artifacts. Results of an empirical study performed on four Java and C open source systems show that Granger causality test is able to provide a set of change couplings complementary to association rules, and a hybrid recommender built combining recommendations from association rules and Granger causality is able to achieve a higher recall than the two single techniques.

Tuesday, October 26, 2:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

Jafar M. Al-Kofahi, Ahmed Tamrawi, Tung Thanh Nguyen, Hoan Anh Nguyen and Tien N. Nguyen, Fuzzy Set Approach for Automatic Tagging in Evolving Software, in Proceedings of 26th International Conference on Software Maintenance (ICSM'10), Timi&scedil;oara, Romania, 2010.

Abstract: Software tagging has been shown to be an efficient, lightweight social computing mechanism to improve different social and technical aspects of software development. Despite the importance of tags, there exists limited support for automatic tagging for software artifacts, especially during the evolutionary process of software development. We conducted an empirical study on IBM Jazz's repository and found that there are several missing tags in artifacts and more precise tags are desirable. This paper introduces a novel, accurate, automatic tagging recommendation tool that is able to take into account users' feedbacks on tags, and is very efficient in coping with software evolution. The core technique is an automatic tagging algorithm that is based on fuzzy set theory. Our empirical evaluation on the real-world IBM Jazz project shows the usefulness and accuracy of our approach and tool.

Tuesday, October 26, 2:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

Abstract: Software tagging has been shown to be an efficient, lightweight social computing mechanism to improve different social and technical aspects of software development. Despite the importance of tags, there exists limited support for automatic tagging for software artifacts, especially during the evolutionary process of software development. We conducted an empirical study on IBM Jazz's repository and found that there are several missing tags in artifacts and more precise tags are desirable. This paper introduces a novel, accurate, automatic tagging recommendation tool that is able to take into account users' feedbacks on tags, and is very efficient in coping with software evolution. The core technique is an automatic tagging algorithm that is based on fuzzy set theory. Our empirical evaluation on the real-world IBM Jazz project shows the usefulness and accuracy of our approach and tool.

Tuesday, October 26, 2:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

Abstract: Software tagging has been shown to be an efficient, lightweight social computing mechanism to improve different social and technical aspects of software development. Despite the importance of tags, there exists limited support for automatic tagging for software artifacts, especially during the evolutionary process of software development. We conducted an empirical study on IBM Jazz's repository and found that there are several missing tags in artifacts and more precise tags are desirable. This paper introduces a novel, accurate, automatic tagging recommendation tool that is able to take into account users' feedbacks on tags, and is very efficient in coping with software evolution. The core technique is an automatic tagging algorithm that is based on fuzzy set theory. Our empirical evaluation on the real-world IBM Jazz project shows the usefulness and accuracy of our approach and tool.

Tuesday, October 19, 2:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan)

Susan Elliott Sim, Steve Easterbrook, Richard C. Holt, Using Benchmarking to Advance Research: A Challenge to Software Engineering, in Proceedings of 25th International Conference on Software Engineering (ICSE'03), Portland, Oregon, 2003.

Abstract: Benchmarks have been used in computer science to compare the performance of computer systems, information retrieval algorithms, databases, and many other technologies. The creation and widespread use of a benchmark within a research area is frequently accompanied by rapid technical progress and community building. These observations have led us to formulate a theory of benchmarking within scientific disciplines. Based on this theory, we challenge software engineering research to become more scientific and cohesive by working as a community to define benchmarks. In support of this challenge, we present a case study of the reverse engineering community, where we have successfully used benchmarks to advance the state of research.

Tuesday, October 19, 2:00pm, McGlothlin-Street Hall 128 (Leader: Harry)

Benjamin Hummel, Elmar Juergens, Lars Heinemann and Michael Conradt, Index-Based Code Clone Detection: Incremental, Distributed, Scalable, in Proceedings of 26th International Conference on Software Maintenance (ICSM'10), Timi&scedil;oara, Romania, 2010.

Abstract: Although numerous different clone detection approaches have been proposed to date, not a single one is both incremental and scalable to very large code bases. They thus cannot provide real-time cloning information for clone management of very large systems. We present a novel, index-based clone detection algorithm for type 1 and 2 clones that is both incremental and scalable. It enables a new generation of clone management tools that provide real-time cloning information for very large software. We report on several case studies that show both its suitability for real-time clone detection and its scalability: on 42 MLOC of Eclipse code, average time to retrieve all clones for a file was below 1 second; on 100 machines, detection of all clones in 73 MLOC was completed in 36 minutes.

Tuesday, October 5, 2:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

Andrea De Lucia, Rocco Oliveto, and Paola Sgueglia, Incremental Approach and User Feedbacks: a Silver Bullet for Traceability Recovery, in Proceedings of 22nd International Conference on Software Maintenance (ICSM'06), Philadelphia, Pennsylvania, 2006.

Abstract: Several authors apply Information Retrieval (IR) techniques to recover traceability links between software artefacts. Recently, the use of user feedbacks (in terms of classification of retrieval links as correct or false positives) has been proposed to improve the retrieval performances of these techniques. In this paper we present a critical analysis of using feedbacks within an incremental traceability recovery process. In particular, we analyse the trade-off between the improvement of the performances and the link classification effort required to train the IR-based traceability recovery tool. We also present the results achieved in case studies and show that even though the retrieval performances generally improve with the use of feedbacks, IR-based approaches are still far from solving the problem of recovering all correct links with a low classification effort.

Tuesday, October 5, 2:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Project Report

Tuesday, September 28, 2:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Thanh H. D. Nguyen, Bram Adams and Ahmed E. Hassan, Studying the Impact of Dependency Network Measures on Software Quality, in Proceedings of 26th International Conference on Software Maintenance (ICSM'10), Timi&scedil;oara, Romania, 2010.

Abstract: Dependency network measures capture various facets of the dependencies among software modules. For example, betweenness centrality measures how much information flows through a module compared to the rest of the network. Prior studies have shown that these measures are good predictors of post-release failures. However, these studies did not explore the causes for such good performance and did not provide guidance for practitioners to avoid future bugs. In this paper, we closely examine the causes for such performance by replicating prior studies using data from the Eclipse project. Our study shows that a small subset of dependency network measures have a large impact on post-release failure, while other network measures have a very limited impact. We also analyze the benefit of bug prediction in reducing testing cost. Finally, we explore the practical implications of the important network measures.

Tuesday, Septemeber 21, 2:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

ICSM 2010 - Informal Report

Tuesday, September 14, 2:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan)

Feature Location Project

Tuesday, September 14, 2:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Portfolio Project and Exemplar Project

Tuesday, August 31, 2:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Portfolio Project

Monday, June 21, 2:00pm, McGlothlin-Street Hall 002 (Leader: Collin)

Andrew Begel, Khoo Yit Phang, and Thomas Zimmermann, Codebook: Discovering and Exploiting Relationships in Software Repositories, in Proceedings of 32nd ACM/IEEE International Conference on Software Engineering (ICSE'10), pages 125-134, Cape Town, South Africa, May 2-8, 2010.

Abstract: Large-scale software engineering requires communication and collaboration to successfully build and ship products. We conducted a survey with Microsoft engineers on inter-team coordination and found that the most impactful problems concerned finding and keeping track of other engineers. Since engineers are connected by their shared work, a tool that discovers connections in their work-related repositories can help. Here we describe the Codebook framework for mining software repositories. It is flexible enough to address all of the problems identified by our survey with a single data structure (graph of people and artifacts) and a single algorithm (regular language reachability). Codebook handles a larger variety of problems than prior work, analyzes more kinds of work artifacts, and can be customized by and for end-users. To evaluate our framework’s flexibility, we built two applications, Hoozizat and Deep Intellisense. We evaluated these applications with engineers to show effectiveness in addressing multiple inter-team coordination problems.

Monday, June 21, 2:00pm, McGlothlin-Street Hall 002 (Leader: Bogdan)

Markus M. Geipel, and Frank Schweitzer, Software change dynamics: evidence from 35 java projects, in Proceedings of The 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE'09), pages 203-212, Amsterdam, The Netherlands, August 24-28, 2009.

Abstract: In this paper we investigate the relationship between class dependency and change propagation in Java software. By analyzing 35 large Open Source Java projects, we find that in the majority of the projects more than half of the dependencies are never involved in change propagation. Furthermore, our analysis shows that only a few dependencies are transmitting the majority of change propagation events. An additional analysis reveals that this concentration cannot be explained by the different ages of the dependencies. The conclusion is that the dependency structure alone is a poor measure for the change dynamics. This contrasts with current literature.

Monday, June 21, 2:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

Thomas Fritz, Jingwen Ou, Gail C. Murphy, and Emerson Murphy-Hill, A degree-of-knowledge model to capture source code familiarity, in Proceedings of 32nd ACM/IEEE International Conference on Software Engineering (ICSE'10), to appear, Cape Town, South Africa, May 2-8, 2010.

Abstract: The size and high rate of change of source code comprising a software system make it difficult for software developers to keep up with who on the team knows about particular parts of the code. Existing approaches to this problem are based solely on authorship of code. In this paper, we present data from two professional software development teams to show that both authorship and interaction information about how a developer interacts with the code are important in characterizing a developer's knowledge of code. We introduce the degree-of-knowledge model that computes automatically a real value for each source code element based on both authorship and interaction information. We show that the degree-of-knowledge model can provide better results than an existing expertise finding approach and also report on case studies of the use of the model to support knowledge transfer and to identify changes of interest.

Wednesday, June 16, 2:00pm, McGlothlin-Street Hall 002 (Leader: Bogdan)

Brendan Cleary, Chris Exton, Jim Buckley, and Michael English, An empirical analysis of information retrieval based concept location techniques in software comprehension, in Empirical Software Engineering 14(1), pages 93-130, February 2009.

Abstract: Concept location, the problem of associating human oriented concepts with their counterpart solution domain concepts, is a fundamental problem that lies at the heart of software comprehension. Recent research has attempted to alleviate the impact of the concept location problem through the application of methods drawn from the information retrieval (IR) community. Here we present a new approach based on a complimentary IR method which also has a sound basis in cognitive theory. We compare our approach to related work through an experiment and present our conclusions. This research adapts and expands upon existing language modelling frameworks in IR for use in concept location, in software systems. In doing so it is novel in that it leverages implicit information available in system documentation. Surprisingly, empirical evaluation of this approach showed little performance benefit overall and several possible explanations are forwarded for this finding.

Wednesday, June 16, 2:00pm, McGlothlin-Street Hall 002 (Leader: Bogdan)

Zachary M. Saul, Vladimir Filkov, Premkumar Devanbu, and Christian Bird, Recommending random walks, in Proceedings of The 6th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE'07), pages 15-24, Dubrovnik, Croatia, September 3-7, 2007.

Abstract: We improve on previous recommender systems by taking advantage of the layered structure of software. We use a random-walk approach, mimicking the more focused behavior of a developer, who browses the caller-callee links in the callgraph of a large program, seeking routines that are likely to be related to a function of interest. Inspired by Kleinberg's work [10], we approximate the steady-state of an infinite random walk on a subset of a callgraph in order to rank the functions by their steady-state probabilities. Surprisingly, this purely structural approach works quite well. Our approach, like that of Robillard's "Suade" algorithm [15], and earlier data mining approaches [13] relies solely on the always available current state of the code, rather than other sources such as comments, documentation or revision information. Using the Apache API documentation as an oracle, we perform a quantitative evaluation of our method, finding that our algorithm dramatically improves upon Suade in this setting. We also find that the performance of traditional data mining approaches is complementary to ours; this leads naturally to an evidence-based combination of the two, which shows excellent performance on this task.

Wednesday, June 16, 2:00pm, McGlothlin-Street Hall 002 Leader: Collin)

Fan Long, Xi Wang, and Yang Cai, API hyperlinking via structural overlap, in Proceedings of The 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE'09), pages 203-212, Amsterdam, The Netherlands, August 24-28, 2009.

Abstract: This paper presents a tool Altair that automatically generates API function cross-references, which emphasizes reliable structural measures and does not depend on specific client code. Altair ranks related API functions for a given query according to pair-wise overlap, i.e., how they share state, and clusters tightly related ones into meaningful modules. Experiments against several popular C software packages show that Altair recommends related API functions for a given query with remarkably more precise and complete results than previous tools, that it can extract modules from moderate-sized software (e.g., Apache with 1000+ functions) at high precision and recall rates (e.g., both exceeding 70% for two modules in Apache), and that the computation can finish within a few seconds.

Wednesday, June 9, 2:00pm, McGlothlin-Street Hall 002 (Leader: Malcom)

Carlos Castro Herrera and Jane Cleland-Huang , Utilizing Recommender Systems to Support Software Requirements Elicitation, in Proceedings of 2nd ICSE 2010 International Workshop on Recommendation Systems for Software Engineering (RSSE'10), Cape Town, South Africa, May 4, 2010.

Abstract: Requirements Engineering involves a number of human intensive activities designed to help project stakeholders discover, analyze, and specify the functional and non-functional needs for a software intensive system. Recommender systems can support several different areas of this process including identifying potential subject matter experts for a topic, keeping individual stakeholders informed of relevant issues, and even recommending possible features for stakeholders to consider and explore. This position paper summarizes an extensive series of experiments that were conducted to identify best-of-breed algorithms for recommending forums to stakeholders and recommending unexplored topics to project managers.

Wednesday, June 9, 2:00pm, McGlothlin-Street Hall 002 (Leader: Malcom)

Jane Cleland-Huang, Adam Czauderna, John Emenecker, Marek Gibiec, A Machine Learning Approach for Tracing Regulatory Codes to Product Specific Requirements, in Proceedings of 32nd ACM/IEEE International Conference on Software Engineering (ICSE'10), pages 155-164, Cape Town, South Africa, May 2-8, 2010.

Abstract: Regulatory standards, designed to protect the safety, security, and privacy of the public, govern numerous areas of software intensive systems. Project personnel must therefore demonstrate that an as-built system meets all relevant regulatory codes. Current methods for demonstrating compliance rely either on after-the-fact audits, which can lead to significant refactoring when regulations are not met, or else require analysts to construct and use traceability matrices to demonstrate compliance. Manual tracing can be prohibitively time-consuming; however automated trace retrieval methods are not very effective due to the vocabulary mismatches that often occur between regulatory codes and product level requirements. This paper introduces and evaluates two machine-learning methods, designed to improve the quality of traces generated between regulatory codes and product level requirements. The first approach uses manually created traceability matrices to train a trace classifier, while the second approach uses web-mining techniques to reconstruct the original trace query. The techniques were evaluated against security regulations from the USA government‟s Health Insurance Privacy and Portability Act (HIPAA) traced against ten healthcare related requirements specifications. Results demonstrated improvements for the subset of HIPAA regulations that exhibited high fan-out behavior across the requirements datasets.

Wednesday, June 9, 2:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Theory of relevance

Wednesday, May 19, 2:00pm, McGlothlin-Street Hall 002 (Leader: Malcom)

Per Runeson and Martin Höst, Guidelines for conducting and reporting case study research in software engineering, in Empirical Software Engineering 14(2), pages 131-164, April 2009.

Abstract: Case study is a suitable research methodology for software engineering research since it studies contemporary phenomena in its natural context. However, the understanding of what constitutes a case study varies, and hence the quality of the resulting studies. This paper aims at providing an introduction to case study methodology and guidelines for researchers conducting case studies and readers studying reports of such studies. The content is based on the authors' own experience from conducting and reading case studies. The terminology and guidelines are compiled from different methodology handbooks in other research domains, in particular social science and information systems, and adapted to the needs in software engineering. We present recommended practices for software engineering case studies as well as empirically derived and evaluated checklists for researchers and readers of case study research.

Wednesday, May 19, 2:00pm, McGlothlin-Street Hall 002 (Leader: Bogdan)

Author(s) Topology Analysis of Software Dependencies, in ACM Transactions on Software Engineering and Methodology (TOSEM), 17(4):Article 18 (36 pages), August 2008.

Abstract: Before performing a modification task, a developer usually has to investigate the source code of a system to understand how to carry out the task. Discovering the code relevant to a change task is costly because it is a human activity whose success depends on a large number of unpredictable factors, such as intuition and luck. Although studies have shown that effective developers tend to explore a program by following structural dependencies, no methodology is available to guide their navigation through the thousands of dependency paths found in a nontrivial program. We describe a technique to automatically propose and rank program elements that are potentially interesting to a developer investigating source code. Our technique is based on an analysis of the topology of structural dependencies in a program. It takes as input a set of program elements of interest to a developer and produces a fuzzy set describing other elements of potential interest. Empirical evaluation of our technique indicates that it can help developers quickly select program elements worthy of investigation while avoiding less interesting ones

Monday, April 26, 2:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Mark Grechanik, Chen Fu, Qing Xie, Collin McMillan, Denys Poshyvanyk, and Chad Cumby, Exemplar: EXEcutable exaMPLes ARchive, in Proceedings of 32nd ACM/IEEE International Conference on Software Engineering (ICSE'10), Formal Research Tool Demonstration, Cape Town, South Africa, May 2-8, 2010.

Abstract: Searching for applications that are highly relevant to development tasks is challenging because the high-level intent reflected in the descriptions of these tasks doesn’t usually match the low-level implementation details of applications. In this demo we show a novel code search engine called Exemplar (EXEcutable exaMPLes ARchive) to bridge this mismatch. Exemplar takes natural-language query that contains high-level concepts (e.g., MIME, data sets) as input, then uses information retrieval and program analysis techniques to retrieve applications that implement these concepts.

Monday, April 26, 2:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Collin McMillan, Denys Poshyvanyk, and Mark Grechanik, Recommending Source Code Examples via API Call Usages and Documentation, in Proceedings of 2nd ICSE 2010 International Workshop on Recommendation Systems for Software Engineering (RSSE'10), to appear, Cape Town, South Africa, May 4, 2010.

Monday, April 26, 2:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan)

Evaluating SITIR with other IR-based techniques

Wednesday, April 21, 12:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Abstract: Searching for applications that are highly relevant to development tasks is challenging because the high-level intent reflected in the descriptions of these tasks doesn’t usually match the low-level implementation details of applications. In this demo we show a novel code search engine called Exemplar (EXEcutable exaMPLes ARchive) to bridge this mismatch. Exemplar takes natural-language query that contains high-level concepts (e.g., MIME, data sets) as input, then uses information retrieval and program analysis techniques to retrieve applications that implement these concepts.

Wednesday, April 21, 12:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Wednesday, April 21, 12:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

LDA-based coupling metric

Wednesday, April 21, 12:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

Combining MSR and IR

Monday, April 12, 12:00pm, McGlothlin-Street Hall 002 (Leader: Meghan)

Supporting Feature-level Software Maintenance

Tuesday, April 6, 3:00pm, McGlothlin-Street Hall 128 (Leader: Bogdan)

Brian de Alwis and Gail C. Murphy, Answering conceptual queries with Ferret, in International Conference on Software Engineering (ICSE'08), pages 21-30, Leipzig, Germany, 2008.

Abstract: Programmers seek to answer questions as they investigate the functioning of a software system, such as "which execution path is being taken in this case?" Programmers attempt to answer these questions, which we call conceptual queries, using a variety of tools. Each type of tool typically highlights one kind of information about the system, such as static structural information or control-flow information. Unfortunately for the programmer, the tools seldom directly answer the programmer's conceptual queries. Instead, the programmer must piece together results from different tools to determine an answer to the initial query. At best, this process is time consuming and at worst, this process can lead to data overload and disorientation. In this paper, we present a model that supports the integration of different sources of information about a program. This model enables the results of concrete queries in separate tools to be brought together to directly answer many of a programmer's conceptual queries. In addition to presenting this model, we present a tool that implements the model, demonstrate the range of conceptual queries supported by this tool, and present the results of use of the conceptual queries in a small field study.

Tuesday, April 6, 3:00pm, McGlothlin-Street Hall 128 (Leader: Denys)

Thomas Fritz and Gail C. Murphy, Using Information Fragments to Answer the Questions Developers Ask., in Proceedings of 32nd ACM/IEEE International Conference on Software Engineering (ICSE'10), to appear, Cape Town, South Africa, May 2-8, 2010.

Abstract: Each day, a software developer needs to answer a variety of questions that require the integration of different kinds of project information. Currently, answering these questions, such as "What have my co-workers been doing?", is tedious, and sometimes impossible, because the only support available requires the developer to manually link and traverse the information step-by-step. Through interviews with eleven professional developers, we identified 78 questions developers want to ask, but for which support is lacking. We introduce an information fragment model (and prototype tool) that automates the composition of different kinds of information and that allows developers to easily choose how to display the composed information. In a study, 18 professional developers used the prototype tool to answer eight of the 78 questions. All developers were able to easily use the prototype to successfully answer 94% of questions in a mean time of 2.3 minutes per question.

Tuesday, April 6, 3:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

CLAN/Exemplar-new user study results

Tuesday, March 30, 3:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Secil Ugurel, Robert Krovetz, C. Lee Giles, David M. Pennock, Eric Glover, and Hongyuan Zha, What's the Code? Automatic Classification of Source Code Archives, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'02), pages 632-638, Edmonton, Alberta, Canada, 2002.

Abstract: There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifiers are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class.

Tuesday, March 30, 3:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

LDA-based coupling metric

Tuesday, March 23, 3:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

Raymond P.L. Buse, Westley R. Weimer, Learning a Metric for Code Readability., in IEEE Transactions on Software Engineering, to appear.

Abstract: In this paper, we explore the concept of code readability and investigate its relation to software quality. With data collected from 120 human annotators, we derive associations between a simple set of local code features and human notions of readability. Using those features, we construct an automated readability measure and show that it can be 80% effective, and better than a human on average, at predicting readability judgments. Furthermore, we show that this metric correlates strongly with three measures of software quality: code changes, automated defect reports, and defect log messages. We measure these correlations on over 2.2 million lines of code, as well as longitudinally, over many releases of select projects. Finally, we discuss the implications of this study on programming language design and engineering practice. For example, our data suggests that comments, in of themselves, are less important than simple blank lines to local judgments of readability.

Tuesday, Febuary 23, 3:00pm, McGlothlin-Street Hall 128 (Leader: Meghan)

Supporting Feature-level Software Maintenance

Tuesday, February 16, 3:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Raphael Hoffmann, James Fogarty, Daniel S. Weld, Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers, in Proceedings of the 20th annual ACM symposium on User interface software and technology (UIST'07), pages 13-22, Newport, Rhode Island, USA, Canada, 2007.

Abstract: Programmers regularly use search as part of the development process, attempting to identify an appropriate API for a problem, seeking more information about an API, and seeking samples that show how to use an API. However, neither general-purpose search engines nor existing code search engines currently fit their needs, in large part because the information programmers need is distributed across many pages. We present Assieme, a Web search interface that effectively supports common programming search tasks by combining information from Web-accessible Java Archive (JAR) files, API documentation, and pages that include explanatory text and sample code. Assieme uses a novel approach to finding and resolving implicit references to Java packages, types, and members within sample code on the Web. In a study of programmers performing searches related to common programming tasks, we show that programmers obtain better solutions, using fewer queries, in the same amount of time spent using a general Web search interface.

Thursday, February 11, 2:00pm, McGlothlin-Street Hall 128 (Leaders: Meghan and Bogdan)

Andy Zaidman, Serge Demeyer, Automatic Identification of Key Classes in a Software System Using Webmining Techniques., in Journal of Software Maintenance and Evolution: Research and Practice, 20(6), pages 387-417, Wiley, November/December 2008.

Abstract: Software engineers new to a project are often stuck sorting through hundreds of classes in order to find those few classes that offer a significant insight into the inner workings of the software project. To help stimulate this process, we propose a technique that can identify the most important classes in a system or the key classes of that system. Software engineers can use these classes to focus their understanding efforts when starting to work on a new software project. Those key classes are typically characterized with having a lot of 'control' within the application. In order to find these controlling classes, we present a detection approach that is based on dynamic coupling and webmining. We demonstrate the potential of our technique using two open-source software systems that have a rich documentation set. During the case studies we use dynamically gathered coupling information that vary between a number of coupling metrics. The case studies show that we are able to retrieve 90% of the classes deemed important by the original maintainers of the systems, while maintaining a level of precision of around 50%.

Tuesday, Febuary 2, 3:00pm, McGlothlin-Street Hall 128 (Leader: Collin)

Portfolio Project

Tuesday, January 26, 3:00pm, McGlothlin-Street Hall 128 (Leader: Malcom)

G. Capobianco, A. De Lucia, R. Oliveto, A. Panichella, and S. Panichella, On the Role of the Nouns in IR-based Traceability Link Recovery, in Proceedings of the 17th International Conference on Program Comprehension (ICPC'09), pages 148-157, Vancouver, British Columbia, Canada, 2009.

Abstract: The intensive human effort needed to manually manage traceability information has increased the interest in utilising semi-automated traceability recovery techniques. This paper presents a simple way to improve the accuracy of traceability recovery methods based on information retrieval techniques. The proposed method acts on the artefact indexing considering only the nouns contained in the artefact content to define the semantics of an artefact. The rationale behind such a choice is that the language used in software documents can be classified as a sectorial language, where the terms that provide more indication on the semantics of a document are the nouns. The results of a reported case study demonstrate that the proposed artefact indexing significantly improves the accuracy of traceability recovery methods based on the probabilistic or vector space based IR models.

Software Engineering Maintenance and Evolution Research Unit

at the College of William and Mary

SEMERU Reading Group

Next Meeting

Future Meetings

Previous Meetings