Header image
Reserach Group @ William and Mary
line decor
line decor

Data Locality Enhancement of Dynamic Simulations for Exascale Computing

Sponsor: DOE (Early Career)

Memory performance is key to maximizing computing efficiency in the era of Chip Multiprocessors (CMP) due to the growing disparity between the slowly expanded memory bandwidth and the rapidly increased demands for data by processors. The importance is underlined by the trend towards exascale computing, in which, the processors are expected to each contain hundreds or thousands of (heterogeneous) cores.

Computer simulation is important for scientific research in many disciplines. Many such programs are complex, and transfer a large amount of data in a dynamically changing pattern. Unfortunately, today's computer systems lack support for high degree of memory transfer. Although hardware cache can reduce the amount of memory access, scientific programs---especially those with dynamic reference patterns---are among the most difficult to cache. The issue have become more critical than before in CMP systems, where certain on-chip storage and off-chip pin bandwidth are shared among cores on a single chip. The sharing causes co-running threads to compete for the limited cache and bandwidth, further worsening the issue of memory performance. Many traditional optimization techniques, being oblivious to the new features, lose their effectiveness.

The goal of this research is to advance existing locality optimizations to match scientific simulations with the new features of the memory hierarchy in a heterogeneous and massively parallel system. It will produce a robust tool for scientific users to enhance program locality on multi- and many-core systems that is not possible to achieve with existing tools. It is the next step from the PI's previous work in memory optimizations for scientific programs. It will contribute to the advance of computational sciences and promote academic research and education in the challenging field of scientific computing.

Input-Centric Program Behavior Analysis and Adaptation


adaptivity-proactivity diilemma
(Input-centric paradigm overcomes an adaptivity-proactivity dilemma facing existing program optimizations.

Program behavior analysis lies at the core of software-hardware matching. By analyzing and predicting program dynamic behaviors, it offers the fundamental support for program transformation by compilers, resource manage- ment by operating systems (OS), and runtime adaptation by virtual machines. Its effectiveness is crucial for the maximization of computing efficiency.

In this project, we propose to include program inputs—a so far virtually ignored dimension— into the focus of program behavior analysis, cultivating a new paradigm, namely input-centric program behavior analysis and adaptation.

In the last decades, program behavior analysis has developed from static code analysis in compilers, to offline profiling, and dynamic sampling in runtime systems. But none of them has systematically explored and exploited program inputs. Although these techniques have been remarkably successful in supporting software-hardware matching, their lack of explorations in the inputs dimension has resulted in a proactivity-adaptivity dilemma. On the one hand, compilers and offline profiling-based techniques analyze and predict the behaviors of the entire program before the start of its production runs (proactivity), but cannot adapt well to the changes in inputs and execution environments (adaptivity). On the other hand, the lack of proactivity in runtime adaptive behavior analysis causes delays in dynamic optimizations, limits the scope of software-hardware matching, and results in local optimum traps in adaptation. With the continuous increase of the parallelism and complexity in both modern hardware and soft- ware, the limitations that the dilemma imposes have become increasingly prominent in software-hardware matching.

Triggered by the strong and predictable correlations between program inputs and behaviors that recent studies have uncovered, we propose to add explorations on program inputs into the core of program behavior analysis. This input-centric paradigm will create many new opportunities to program behavior prediction, most prominently, the co-existence of the proactivity and adaptivity, resolving the inherent dilemma in existing techniques.

Input-centric program behavior analysis and adaptation consists of three components. The fundamental compo- nent is program input characterization. It tries to extract important features from raw program inputs, which may contain involved syntactic structures, semantics, and other complexities. The second component, input-behavior modeling, concentrates on the recognition and modeling of the correlations between characterized input features and program behaviors. These two components constitute input-centric program behavior analysis, which (ideally) will predict the large-scope behaviors of a program’s execution directly from its inputs as soon as the execution starts. The third component is input-centric adaptation. In this component, dynamic optimizations, by capitalizing on the novel opportunities that the first two components create, become proactive and holistic, but without losing the adaptivity to inputs and environmental changes.

Together, the three components make evolvable programming systems more feasible than before. In such a system, the input-behavior models embody the central knowledge base, which grows incrementally across program production runs. As the knowledge base becomes larger, behavior prediction becomes more accurate, stimulating better software-hardware matching and making the program and runtime systems perform increasingly better.

Because program behavior analysis is fundamental to software-hardware matching, this research is expected to create many new opportunities for optimizations in various layers in the software execution stack (compilers, virtual machines, OS, etc.). In this project, we will particularly concentrate on three uses that are among the most needed in the current and foreseeable future computing: dynamic optimizations in runtime systems (e.g., Java Virtual Machine), adaptive program parallelization, and shared-cache management on CMP. The third one, in particular, will exemplify the benefits for cross-execution-layer optimizations.

Exploration and Exploitation of Cache Shareing on CMP


(Illustration of a divide-conquer approach for CMP scheduling through graph theory.)

The increasing problems of power, heat dissipation, and design complexity have caused a shift in processor technology to favor multicore multiprocessors. Along with that shift, the sharing of memory hierarchy becomes deeper, heterogeneous and more complex, causing cache contention, increased conflicts, and also, synergy sharing. Without understanding the implications of this change, current multicore systems suffer from considerable performance degradation, poor performance isolation and inferior fairness guarantees. The urgency of these issues increases as the degree of processor-level parallelism increments rapidly.

Prior studies, mostly in areas of architecture and operating systems, rely on simple heuristics to estimate cache requirement of corunning programs; the inaccuracy and overhead limits their scalability and effectiveness. This work tackles these challenges uniquely from the compiler aspect by constructing predictive behavior models for corunning processes, developing cache-sharing-aware program transformations and loop scheduling, and combining the program-level knowledge of programming systems with the proactive resource management by runtime systems. In particular, this research develops the following two techniques.

Inclusive Locality and Affinity Analysis: This work proposes inclusive reuse signatures to characterize inclusive locality---the memory behavior of corunning programs on shared caches, and inter-thread affinity models to capture data locality among parallel threads. We will develop a set of techniques to tackle the challenges facing the measurement, prediction and exploitation of inclusive locality, producing a better understanding to the implications of heterogeneous cache sharing. The techniques include a statistical model that bridges the memory behavior of programs' single-runs and co-runs, a three-stage framework unifying profiling-based learning and runtime sampling for the prediction of inclusive locality, a set of solutions designed to handle various factors---program inputs, corunners, phases and architectures---that determine the impact of shared caches on inclusive locality, and an integrative cost/benefit model for considerations of various effects of shared caches.

Cache-sharing Aware Optimizations: The analysis of inclusive locality and affinity opens new opportunities for shared-cache optimizations by both compilers and runtime systems. This work will develop some program transformations, such as inter-thread memory reorganization and cache-sharing aware loop scheduling, to increase inter-thread spacial locality and ameliorate conflicts, contention and false sharing. For runtime systems, this work proposes {\em proactive cache management} which partitions caches or schedules processes according to predicted inclusive locality. It overcomes the limitations of current reactive schemes on scalability, accuracy and effectiveness. To efficiently incorporate inclusive locality models into proactive cache management, we will explore the tradeoff between accuracy and efficiency, developing a set of lightweight models and hybrid strategies to make the prediction fast and accurate enough. We will develop randomized and approximation algorithms to tackle the complexities in determining the optimal solutions in proactive cache management.


BOP (Behavior-Oriented Parallelization)


Collaborators: Chen Ding (U of Rochester), Michael L. Scott (U of Rochester)

(Illustration of the speculation scheme in behavior-oriented parallelization.)

The increasing problems of power, heat dissipation, and design complexity have caused a shift in processor technology to favor multicore multiprocessors. Most programs, however, cannot yet fully utilize all CPUs in a system. The redundant computing resource opens new opportunities for software speculation, where program code is speculatively executed to improve speed at the cost of having to monitor for errors and the risk of having to re-execute code when an error happens.

Many existing programs have dynamic high-level parallelism, such as, a compression tool that processes data buffer by buffer, an English parser parsing sentence by sentence, and an interpreter interpreting expression by expression. They are complex and may make extensive use of bit-level operations, unrestricted pointers, exception handling, custom memory management, and third-party libraries. The unknown data access and control flow make such applications difficult if not impossible to parallelize in a fully automatic fashion. On the other hand, manual parallelization is a daunting task for complex programs, pre-existing ones in particular. Moreover, many programs have input-dependent behavior where both the degree and the granularity of parallelism are not guaranteed or even predictable. The complexity and the uncertain performance gain make it difficult to justify the investment of time and the risk of error.

This work will address this paralyzing dilemma with a new, behavior-based approach. Unlike traditional code-based techniques, the new approach allows a user or a profiling tool to parallelize or optimize a program based on partial information about the program code and the input. It uses modern parallel processors to reduce and hide the overhead of dynamic correctness checking and error recovery. In particular, the research will develop Behavior-oriented parallelization (BOP) technique. It lets a user identify possible parallel regions (PPR), which the system speculatively executes in parallel on available processors. To protect against unpredictable control flow and data access, BOP uses heavy-weight Unix processes for incremental data privatization and full error recovery. To control the cost of speculation, the research will develop new compiler and run-time techniques to detect and manage speculative parallelism, reduce the cost of monitoring, avoid unprofitable speculation attempts, and bound the worse-case running time. The speculation will be programmable with a language interface to take user hints and provide run-time feedback. Unlike most parallel languages, BOP will support parallelism that is likely but not definite, handle unpredictable behavior in unknown code, and require no parallel debugging.

Links to Software Used in the Projects

Jikes RVM





      Last updated on 8/2010