My research lies in the area of
Program Analysis and Compiler Technology. I focus on modeling and
predicting large-scale dynamic behavior patterns to improve
performance, control memory size, and support coarse-grain
parallelization through offline program transformation and online
program adaptation. The ultimate goal is an intelligent programming
system, which injects into a program the ability to automatically adapt
and evolve its code and data and configure its running environment such
that a better version of the program could dynamically match its input,
behavior and running environment.
Program adaptation is not possible without accurately forecasting a
program's behavior. However, traditional modular program design and
analysis are ill-fitted for finding large-scale composite patterns from
increasingly complicated code, dynamically allocated data, and
multi-layered execution environment (e.g. interpreters, virtual
machines, operating systems and computer architecture.) My
research views a program as a composition of large-scale behavior
patterns, each of which may span a large number of loops and procedures
statically and billions of instructions dynamically. I apply statistical
learning techniques to automatically recognize the patterns, build models of
program behavior, and exploit them in offline program transformation and online program adaptation
to improve the efficiency, reliability, and fairness of computing.
Behavior-oriented parallelization
Many sequential applications are difficult to parallelize because of unpredictable control flow, indirect data access, and input-dependent parallelism. These difficulties led us to build a software system for behavior-oriented parallelization (BOP), which allows a program to be parallelized based on partial information about program behavior, for example, a user reading just part of the source code, or a profiling tool examining merely one or few executions.
The basis of BOP is programmable software speculation, where a user or an analysis tool marks possibly parallel regions in the code, and the run-time system executes these regions speculatively. It is imperative to protect the entire address space during speculation. The main goal of the paper is to demonstrate that the general protection can be made cost effective by three novel techniques: programmable speculation, critical-path minimization, and value-based correctness checking. On a recently acquired multi-core, multi-processor PC, the BOP system reduced the end-to-end execution time by integer factors for a Lisp interpreter, a data compressor, a language parser, and a scientific library, with no change to the underlying hardware or operating system [PLDI'07, LCPC'05].
Input-centric optimization
Program dynamic optimization, for being adaptive to runtime behavior changes, has become increasingly important for both performance and energy savings. However, most runtime optimizations often suffer from the lack of a global picture of a program's execution, and cannot afford sophisticated program analysis. On the other hand, offline profiling techniques overcome both obstacles but are oblivious to the effects of program inputs.
An approach in the between is to offline find the connections between program inputs and runtime behavior, and then apply the knowledge to runtime optimizations. Although it potentially gets the best of both worlds, it faces a fundamental challenge: How to discover and model the relations between inputs and runtime behavior for general programs.
Input-centric optimization concentrates on the exploration and exploitation of the relation between program inputs and runtime behavior.
It differs from previous program optimizations in that it covers all the factors---code, inputs, runtime environments---that determine program behavior, and makes runtime optimizations both proactive and adaptive to the changes in program inputs and dynamic behaviors [EuroPar'08, LCPC'07, PACT'07]
Whole-program data access patterns and offline data reorganization
Program Locality
Locality or data
reuse determines the effect of caching, which increasingly affects
system performance, cost, and energy usage as new cache designs are
adding more cache levels and allowing dynamic reconfiguration. However,
the high cost of measuring it urges a breakthrough before its practical
uses in performance debugging, locality analysis and optimizations of
long-running applications. My research improves the efficiency by exploring the connection
between time and locality. We propose a statistical model converting cheap time
distance to costly reuse distance. Compared to the state-of-the-art technique,
this approach reduces measuring time by 17 times, and approximates cache line
reuses with over 99% accuracy. This work, for
the first time, reveals the strong correlations between time and locality. It
makes precise locality as easy to obtain as data access frequency, removes the
obstacles to reuse distance's practical uses, and opens new opportunities for
program optimizations [POPL07].
My earlier
research studies the connection between program input and
whole-program locality, represented by the distribution of data reuse
distance in program executions. I have developed a mixture model
with regression analysis to predict whole-program locality and the
cache miss rate of a program for all inputs and cache sizes. The new
approach relaxes an
assumption used in previous work by allowing mixed data reuse
patterns, which makes the analysis sensitive enough to use very
small inputs and consequently reduces the majority of the training
overhead. Secondly through regression analysis it makes locality prediction benefit
from more than two training inputs. Compared to previous methods, the
new locality prediction reduces about half of the prediction error,
removes 95% of space cost, and uses much smaller inputs and faster data
collection [TOPLAS, LACSI'03]. Based on the locality analysis, we
generate a parameterized model of program cache behavior. Given a
cache size and associativity, the model predicts the miss rate for an
arbitrary data input. It also identifies critical data input
sizes where cache behavior exhibits marked changes. Experiments
show this technique is within 2% of the hit rates for set associative
caches on a set of floating-point and integer programs using array- and
pointer-based data structures. (The miss rate error can be larger
especially for low miss rates.) The model enables better understanding
of program cache behavior, helps machine and benchmark design, and
assists reconfigurable memory systems [TC'07].
Memory Reference Affinity
While the memory of
most machines is organized as a hierarchy, program data are laid out in
a uniform address space. Data reorganization is critical for
improving cache spatial locality and thus memory bandwidth. Array
regrouping, for example, combines arrays that are often accessed
together into a structure array so that a single cache load operation
can load the elements from multiple arrays residing on the same cache
line after regrouping. We have proposed reference affinity, the first
trace-based model of hierarchical data locality, and a distance-based
affinity analysis, which finds array and structure field organization
(among an exponential number of choices) that is consistently better
than the layout given by the programmer, compiler, or statistical
clustering [TOPLAS, PLDI'04, LCPC'03]. To make the affinity model
practical for general-purpose compilers, my research proposes a
lightweight affinity model based on static interprocedure analysis. The
prototype has been implemented in the IBM® C/C++ and Fortran
production compiler. Both techniques lead to significant performance
improvements, up to 96% (on average 30%) speedup of a set of SPEC 2000
floating-point benchmarks [ICS'05, PATENT'05].
Behavior phase patterns and run-time program adaptation
While
whole-program models give the average behavior, my research in program
phases goes one step further to model and predict dynamic behavior
changes at run time. Ideally, during its execution, a program should
adapt its data, code, and execution environment (e.g. cache size)
to different run-time behavior. My research explores
profiling-based techniques to identify large-scale behavior patterns,
which we call behavior phases,
and exploits phases to guide
online program adaptation to improve cache performance, better memory
management, and increase parallelism.
Many programs, e.g. scientific simulation programs, have long
continuous phases of execution that have dynamic but predictable
locality. To support phased-based memory adaptation (e.g. reorganizing
data for different phases), I apply signal processing technique,
wavelet transform, to identify phases from billions of data accesses.
Then frequency-based phase marking inserts code markers that mark
phases in all executions of the program. Phase hierarchy construction
identifies the structure of all phases through grammar
compression. With phase markers inserted, the run-time
system uses the first few executions of a phase to predict all its
later executions. I utilize the prediction to guide cache resizing in
reconfigurable systems, where the cache size can be
adjusted for energy and performance, and memory remapping, where data
can be reorganized at phase boundaries. The results demonstrate the
effectiveness of our technique for identifying large, recurring phases
in complex programs [JPDC'07, ASPLOS'04, LCPC'04].
Outside the realm of scientific computing, many programs, such as
programming tools, server applications, databases, and interpreters,
produce (nearly) identical service to a sequence of requests. Those
programs typically use dynamic data and control structures and reveal
different behavior for different requests, which can easily hide those
aspects of behavior that are uniform across inputs and make it
difficult or impossible for current analysis to predict run-time
behavior. We refer to such programs as utilities.
The repetitive
behavior of these programs, while often clear to users, has been
difficult to capture automatically. We present an active
profiling technique, which uses controlled inputs to trigger regular behavior and then recognizes and inserts
common phase markers through profiling runs on normal inputs. Because
users control the selection of regular inputs, active profiling can
also be used to build specialized versions of utility programs for
different purposes, breaking away from the traditional
``one-binary-fits-all'' program model. Experiments with five utilities
from SPEC benchmark suites show that phase behavior is surprisingly
predictable in many (though not all) cases [ExpCS'07]. This predictability
can in turn be used for better memory management as preventive garbage
collection (to invoke garbage collection at phase boundaries which
usually correspond to memory-usage boundaries), memory-usage monitoring
(to better predict memory usage trend through monitoring at phase
boundaries), and memory leak detection (to detect potential memory
leaks by identifying phase-local objects), leading in several cases to
multi-fold improvements in performance [ISMM'06, MSP'05].
Other Research
Dynamic
Data
Partitioning for Parallel Sorting
Many computing
problems benefit from dynamic data partitioning--- a
large amount of data into smaller chunks with better
locality.
When data can be sorted, two methods are commonly used in partitioning.
The first selects pivots, which enable balanced partitioning but cause
a large overhead of up to half of the sorting time. The second method
uses simple functions, which is fast but requires that the input data
confirm to a uniform distribution. In this work, a new
method is proposed to partition data using the cumulative distribution
function. It partitions data of any distribution in linear time,
independent of the number of sublists to be partitioned into.
Experiments show 10-30% improvement in partitioning balance and
20-70% reduction in partitioning overhead. The new method is more
scalable than existing methods. It yields greater benefit when the
data set and the number of sub-lists grow larger. By applying this
method, our sequential sorting beats Quick-sorting by 20% and parallel
sorting exceeds the previous sorting algorithm by 33-50% [ICPP'04].
Scene Image Classification
Semantic scene
classification, categorizing images into semantic
classes such as beaches, sun-sets or parties is a useful endeavor. It
finds application in many areas, including content-based image analysis
and organization (CBIR) and digital photo-finishing. This work
is
based on support vector machine (SVM) using color features [EI'04,
PR'04].
Dialogue System
From 2001 to 2002,
I worked in TRIPS (The Rochester Interactive Planning System) group with
James Allen, George Ferguson et al. The goal of the project is an
intelligent planning assistant that interacts with its human manager
using a combination of natural language and graphical displays (maps,
charts, windows, and the like).
Speech Recognition and Synthesis
From 1999 to 2001,
I worked on speech recognition and
synthesis in National Lab of Pattern
Recognition, Chinese Academy
of Sciences.
Publications
Journal Publications
[TOPLAS] ``Distance-Based
Locality Analysis and Prediction'', Chen Ding, Yutao Zhong, and
Xipeng
Shen, in ACM Transactions on Programming Languages and Systems
(TOPLAS), accepted in May 2007.
[JPDC'07] ``Predicting Locality Phases for Dynamic Memory Optimization'', X. Shen, Y. Zhong, C. Ding, the Journal of Parallel and Distributed Computing, Volume 67, Number 7, July 2007, pages 783-796.
[TC'07] ``Miss
Rate
Prediction across Program Inputs and Cache Configurations'', Yutao
Zhong, Steven G. Dropsho, Xipeng Shen, Ahren Studer, and Chen Ding, in
IEEE Transaction on Computers (TC), Vol. 56,
No. 3, March, 2007.
[PR'04] ``Learning
multi-label scene
classification'', Matthew R. Boutell, Jiebo Luo, Xipeng Shen and
Christopher M. Brown, in Pattern
Recognition, Volume 37, Issue 9, 2004,
pages 1757-1771.
Refereed Conference and Workshop Papers
[PACT'08] ``Analysis and Approximation of Optimal Co-scheduling on CMP'', Yunlian Jiang, Xipeng Shen, Jie Chen, and Rahul Tripathi, the International Conference on Parallel Architecture and Compilation Techniques, Toronto, Canada, October, 2008. [PDF]
[ICPP'08] ``Adaptive Software Speculation for Enhancing the Efficiency of Behavior-Oriented Parallelization'', Yunlian Jiang, and Xipeng Shen, the 37th International Conference on Parallel Processing, Portland, Oregon, September, 2008.[PDF]
[EuroPar'08] ``Exploration of the Influence of Program Inputs on CMP Co-Scheduling'', Yunlian Jiang, and Xipeng Shen, the Euro-Par Conference 2008, Canary Island, Spain, August, 2008. (DOI:http://dx.doi.org/10.1007/978-3-540-85451-7_29) [PDF]
[LCPC'08] ``Scalable Implementation of Efficient Locality Approximation'', Xipeng Shen, and Jonathan Shaw, The 21st International Workshop on Languages and Compilers for Parallel Computing, Edmonton, Canada, July, 2008. [PDF][NSFNGS'08] ``Adaptive Speculation in Behavior-Oriented Parallelization", Yunlian Jiang, and Xipeng Shen, the NSF Next Generation Software Workshop (Colocated with IPDPS'08), Miami, Florida, April, 2008. (invited paper)
[LCPC'07] ``Modeling Relations Between Inputs and Dynamic Behavior for General Programs'', Xipeng Shen, and Feng Mao, in Proceedings of the 20th International Workshop on Languages and Compilers for
Parallel Computing (LCPC 2007), Urbana, IL, USA, October, 2007.
[PDF]
[PACT'07] ``Bridging Inputs and Program Dynamic Behavior'', X. Shen, and F. Mao, in Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques, Brasov, Romania, September, 2007. [PDF] (1-page abstract)
[PLDI'07] ``Behavior-oriented Parallelization'', C. Ding, X. Shen, K. Kelsey, C. Tice, R. Huang, C. Zhang, the Conference on Programming Language Design and Implementation, San Diego, CA, June 2007.[PDF]
[ExpCS'07] ``Analysis of Input-Dependent Program Behavior Using Active Profiling, X. Shen, C. Zhang, C. Ding, M. Scott, S. Dwarkadas, M.
Ogihara, the Workshop on Experimental Computer Science, held at FCRC, San Diego, CA, June 2007. [PDF]
[POPL'07] ``Locality Approximation Using Time, X. Shen, J. Shaw, B. Meeker, C. Ding, the Symposium on Principles of Programming Languages, Nice, France, January 2007.[PDF](7-page short paper)
[ISMM'06] `` Program-level Adaptive
Memory Management'', Chengliang Zhang, Kirk Kelsey, Xipeng Shen, Chen
Ding, Matthew Hertz, and Mitsu Ogihara, in Proceedings of the 2006
International Symposium on Memory Management (ISMM 2006), Ottawa,
Canada, June 2006. [PDF]
[LCPC'05] ``Parallelization
of Utility Programs Based on Behavior Phase Analysis'', Xipeng Shen, and Chen Ding, in Proceedings of the
Eighteenth International Workshop on Languages and Compilers for
Parallel Computing (LCPC 2005), Hawthorne, NY, USA, October 2005.
(short paper) [PDF]
[ICS'05] ``Lightweight
Reference Affinity Analysis'', Xipeng Shen,
Yaoqing Gao, Chen Ding, and Roch Archambault, in Proceedings of the
Ninteenth ACM International Conference on Supercomputing (ICS 2005),
Cambridge, MA, USA, June 2005, pages 131--140.[PDF]
[MSP'05] ``Gated
Memory Control for
Memory Monitoring, Leak Detection
and Garbage Collection'', Chen Ding, Chengliang Zhang, Xipeng Shen,
and
Mitsunori Ogihara, in Proceedings of the Third
Annual ACM SIGPLAN
Workshop on Memory Systems Performance (MSP 2005), Chicago, NY,
USA,
June 2005.[PDF]
[ASPLOS'04] ``Locality
Phase
Prediction'', Xipeng Shen, Yutao Zhong,
and Chen Ding, in Proceedings of the
Eleventh
International Conference
on Architectural Support for Programming Languages and Operating
Systems (ASPLOS XI), Boston, MA, USA, October 2004, pages 165--176.[PDF]
[LCPC'04] ``Phase-Based
Miss Rate
Prediction Across Program Inputs'',
Xipeng Shen, Yutao Zhong, and Chen Ding, in Proceedings of the
Seventeenth International Workshop on Languages and Compilers for
Parallel Computing (LCPC 2004), West Lafayette, Indiana, USA,
September
2004.[PDF]
[PLDI'04] ``Array
Regrouping and
Structure Splitting Using
Whole-Program Reference Affinity'', Yutao Zhong, Maksim Orlovich,
Xipeng Shen, Chen Ding, in Proceedings of ACM
SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2004), Washington DC, USA,
June 2004, pages
255--266.[PDF]
[ICPP'04] ``Adaptive
Data Partition
for Sorting using Probability
Distribution'', Xipeng Shen, and Chen Ding, in Proceedings of the
International Conference on Parallel Processing (ICPP 2004),
Montreal,
Canada, August 2004, pages 250--258. [PDF]
[EI'04] ``Multi-label
Machine Learning and Its Application to Semantic
Scene Classification'', Xipeng Shen, Matthew Boutell, Jiebo Luo,
and
Christopher Brown, In Proceedings of IS&T/SPIE's
Sixteenth Anaual
Symposium on Electronic Imaging: Science and Technology (EI 2004),
San
Jose, California, USA, January 2004, pages 188--199.[PDF]
[LACSI'03] ``Regression-Based
Multi-Model Prediction of Data Reuse
Signature'', Xipeng Shen, Yutao Zhong, and Chen Ding, in
Proceedings of the Fourth
Annual Symposium
of the Los Alamos Computer Science
Institute (LACSI 2003), Sante Fe, New Mexico, USA, October 2003. [PDF ]
[LCPC'03] ``A
Hierarchical Model of Reference Affinity'' ", Yutao Zhong,
Xipeng
Shen, and Chen Ding, in Proceedings of the 16th International
Workshop on Languages and Compilers for
Parallel Computing (LCPC 2003), College
Station, Texas, USA, October 2003.
[ PDF ]
[EuroSpeech'01b] ``The Study Of The
Effect Of Training Set On
Statistical Language Modeling'', Xipeng Shen, and Bo Xu, in Proceedings
of Seventh European Conference on Speech Communication and Technology
(Eurospeech 2001), Aalborg, Denmark, September 2001, pages 721--724.
[EuroSpeech'01a] ``Study and
Auto-Detection of Stress Based on Tonal
Pitch Range in Mandarin'', Xipeng Shen, and Bo Xu, in Proceedings of
Seventh European Conference on Speech Communication and Technology
(Eurospeech 2001), Aalborg, Denmark, September 2001, pages
123--126.
[ISCSLP'00] ``A CART-Based
Hierarchical Stochastic Model for Prosodic
Phrasing in Chinese'', Xipeng Shen, and Bo Xu, in Proceedings of
International Symposium on Chinese Spoken Language Processing 2000
(ISCSLP 2000), Beijing, China, October 2000.
Thesis
``Large Scale Program Behavior Analysis for Adaptation and Parallelization", Xipeng Shen, Ph.D. Thesis, University of Rochester, Rochester, NY, August, 2006.
Patents
[PATENT'07] ``Parallel Programming Using Possible Parallel Regions and its Language Profiling Complier, Run-Time System and
Debugging System", Chen Ding, Xipeng Shen, and Ruke Huang, US Patent pending, 2007.
[PATENT'05] ``Lightweight Reference Affinity Analysis'', Xipeng Shen, Yaoqing Gao, Chen Ding and Roch Archambault, US Patent filed by IBM, March 2005.
Software
Active Profiling
Locality Approximation from Time