November 2011 Newsletter 
Placing you one click away from the best new CAD research!
Plain-text version at http://www.umn.edu/~tcad/newsletter/2011-11.txt 

REGULAR PAPERS

EMBEDDED SYSTEMS

Kim, Y. Park, S. Cho, Y. Chang, N. System-Level Online Power Estimation Using
an On-Chip Bus Performance Monitoring Unit
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046165

Quality power estimation is a basis of efficient power management of electronic
systems. Indirect power measurement, such as power estimation using a CPU
performance monitoring unit (PMU), is widely used for its low cost and area
overheads. However, the existing CPU PMUs only monitor the core and cache
activities, which result in a significant accuracy limitation in the
system-wide power estimation including off-chip memory devices. In this paper,
we propose an on-chip bus (OCB) PMU that directly captures on-chip and off-chip
component activities by snooping the OCB. The OCB PMU stores the activity
information in separate counters, and online software converts counter values
into actual power values with simple first-order linear power models. We also
introduce an optimization algorithm that minimizes the energy model to reduce
the number of counters in the OCB PMU. We compare the accuracy of the power
estimation using the proposed OCB PMU with real hardware measurement and
cycle-accurate system-level power estimation, and demonstrate high estimation
accuracy compared with CPU PMU-based estimation method. 

FPGAs AND RECONFIGURABLE COMPUTING

Kim, Y. Lee, J. Shrivastava, A. Yoon, J. W. Cho, D. Paek, Y. High Throughput
Data Mapping for Coarse-Grained Reconfigurable Architectures
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046176

Coarse-grained reconfigurable arrays (CGRAs) are a very promising platform,
providing both up to 10Ð100 MOps/mW of power efficiency and software
programmability. However, this promise of CGRAs critically hinges on the
effectiveness of application mapping onto CGRA platforms. While previous
solutions have greatly improved the computation speed, they have largely
ignored the impact of the local memory architecture on the achievable power and
performance. This paper motivates the need for memory-aware application mapping
for CGRAs, and proposes an effective solution for application mapping that
considers the effects of various memory architecture parameters including the
number of banks, local memory size, and the communication bandwidth between the
local memory and the external main memory. Further we propose efficient methods
to handle dependent data on a double-buffering local memory, which is necessary
for recurrent loops. Our proposed solution achieves 59% reduction in the
energy-delay product, which factors into about 47% and 22% reduction in the
energy consumption and runtime, respectively, as compared to memory-unaware
mapping for realistic local memory architectures. We also show that our scheme
scales across a range of applications and memory parameters, and the runtime
overhead of handling recurrent loops by our proposed methods can be less than
1%.

MODELING AND SIMULATION

Zhao, X. Guo, Y. Chen, X. Feng, Z. Hu, S. Hierarchical Cross-Entropy
Optimization for Fast On-Chip Decap Budgeting
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046169

Decoupling capacitor (decap) has been widely used to effectively reduce dynamic
power supply noise. Traditional decap budgeting algorithms usually explore the
sensitivity-based nonlinear optimizations or conjugate gradient (CG) methods,
which can be prohibitively expensive for large-scale decap budgeting problems
and cannot be easily parallelized. In this paper, we propose a hierarchical
cross-entropy based optimization technique which is more efficient and
parallel-friendly.  Cross-entropy (CE) is an advanced optimization framework
which explores the power of rare event probability theory and importance
sampling. To achieve the high efficiency, a sensitivity-guided cross-entropy
(SCE) algorithm is introduced which integrates CE with a partitioning-based
sampling strategy to effectively reduce the solution space in solving the
large-scale decap budgeting problems. Compared to improved CG method and
conventional CE method, SCE with Latin hypercube sampling method (SCE-LHS) can
provide $2times$ speedups, while achieving up to 25% improvement on power
supply noise. To further improve decap optimization solution quality, SCE with
sequential importance sampling (SCE-SIS) method is also studied and
implemented. Compared to SCE-LHS, in similar runtime, SCE-SIS can lead to 16.8%
further reduction on the total power supply noise.

PHYSICAL DESIGN

Ding, D. Torres, J. A. Pan, D. Z. High Performance Lithography Hotspot
Detection With Successively Refined Pattern Identifications and Machine
Learning
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046168

Under the real and evolving manufacturing conditions, lithography hotspot
detection faces many challenges. First, real hotspots become hard to identify
at early design stages and hard to fix at post-layout stages. Second, false
alarms must be kept low to avoid excessive and expensive post-processing
hotspot removal. Third, full chip physical verification and optimization
require very fast turn-around time. Last but not least, rapid technology
advancement favors generic hotspot detection methodologies to avoid exhaustive
pattern enumeration and excessive development/update as technology evolves. To
address the above issues, we propose a high performance hotspot detection
methodology consisting of: 1) a fast layout analyzer; 2) powerful hotspot
pattern identifiers; and 3) a generic and efficient flow with successive
performance refinements. We implement our algorithms with industry-strength
engine under real manufacturing conditions and show that it significantly
outperforms state-of-the-art algorithms in false alarms (2.4X to 2300X
reduction) and runtime (5X to 237X reduction), meanwhile achieving similar or
better hotspot accuracies. Compared with pattern matching, our method achieves
higher prediction accuracy for hotspots that are not previously characterized,
therefore, more detection generality when exhaustive pattern enumeration is too
expensive to perform a priori. Such high performance hotspot detection is
especially suitable for lithography-friendly physical design.

Lee, Y.-J. Lim, S. K. Co-Optimization and Analysis of Signal, Power, and
Thermal Interconnects in 3-D Ics
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046179

Heat removal and power delivery have become two major reliability concerns in
3-D integrated circuit (IC) technology. To alleviate thermal problem, two
possible solutions have been proposed: thermal-through-silicon-vias (T-TSVs)
and micro- fluidic channel (MFC)-based cooling. In case of power delivery, a
complex power distribution network is required to deliver currents reliably to
all parts of the 3-D IC while suppressing the power supply noise to an
acceptable level.  However, these thermal and power networks pose major
challenges in signal routability and congestion. This is because signal, power,
and thermal interconnects are all competing for routing space, and the related
TSVs interfere with gates and wires in each die. We present a co-optimization
methodology for signal, power, and thermal interconnects in 3-D ICs based on
design of experiments (DOE) and response surface methodology (RSM). The goal of
our holistic approach is to improve signal, thermal, and power noise metrics
and to provide fast and accurate design space exploration for early design
stage. We also provide an in-depth comparison between T-TSV versus MFC-based
cooling method and discuss how to employ DOE and RSM techniques to co-optimize
the interconnects. Our DOE-based optimization found the optimal design point
with less effort than a gradient search-based optimization.

Chuang, Y.-L. Lee, P.-W. Chang, Y.-W. Voltage-Drop Aware Analytical Placement
by Global Power Spreading for Mixed-Size Circuit Designs
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046166

Excessive supply voltage drops in a circuit may lead to significant circuit
performance degradation and even malfunction.  To handle this problem, existing
power delivery aware placement algorithms model voltage drops as an
optimization objective. We observe that directly minimizing the voltage drops
in the objective function might not resolve voltage-drop violations effectively
and might cause problems in power-integrity convergence. To remedy this
deficiency, in this paper, we propose new techniques to incorporate device
power spreading forces into a mixed-size analytical placement framework. Unlike
the state-of-the-art previous work that handles the worst voltage-drop spots
one by one, our approach simultaneously and globally spreads all the blocks
with voltage-drop violations to desired locations directly to minimize the
violations. To apply the power force, we model macro current density and power
rails for our placement framework to derive desired macro/cell locations. To
further improve the solution quality, we propose an efficient mathematical
transformation to adjust the power force direction and magnitude. Experimental
results show that our approach can substantially improve the voltage drops,
wirelength, and runtime over the previous work.

SYSTEM-LEVEL DESIGN

Jain, T. N. K. Ramakrishna, M. Gratz, P. V. Sprintson, A. Choi, G. Asynchronous
Bypass Channels for Multi- Synchronous NoCs: A Router Microarchitecture,
Topology, and Routing Algorithm
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046181

Network-on-chip (NoC) designs have emerged as a replacement for traditional
shared-bus designs for on-chip communication. As with all current very large
scale integration designs, however, reducing power consumption in NoCs is a
critical challenge. One approach to reduce power consumption is to dynamically
scale the voltage and frequency of each network node or groups of nodes (DVFS).
Another approach is to replace the balanced clock tree with a globally-
asynchronous, locally-synchronous (GALS) clocking scheme. In both DVFS and GALS
designs, the chip as a whole is multi-synchronous. As the NoCs interconnecting
those nodes must communicate across these clock domain boundaries, they tend to
have high latencies as packets must be synchronized at the intermediate nodes.
In this paper, we propose a novel router microarchitecture which offers
superior performance with respect to typical synchronizing router designs for
multi-synchronous networks. Our approach features asynchronous bypass channels
which allow flit traversal of intermediate nodes within the network without the
latching or synchronization overheads of typical designs. We also propose a new
network topology and routing algorithm that leverage the advantages of the
bypass channel offered by our router design. We present a detailed analysis of
design decisions which affect the performance of the asynchronous bypass
channel network. Our experiments show that our design improves the performance
of a conventional synchronizing design with similar resources by up to 26% at
low loads and increases saturation throughput by up to 50% for a uniform random
traffic.

Hanumaiah, V. Vrudhula, S. Chatha, K. S. Performance Optimal Online DVFS and
Task Migration Techniques for Thermally Constrained Multi-Core Processors
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046183

Extracting high performance from multi-core processors requires increased use
of thermal management techniques. In contrast to offline thermal management
techniques, online techniques are capable of sensing changes in the workload
distribution and setting the processor controls accordingly. Hence, online
solutions are more accurate and are able to extract higher performance than the
offline techniques. This paper presents performance optimal online thermal
management techniques for multicore processors. The techniques include dynamic
voltage and frequency scaling and task- to-core allocation or task migration.
The problem formulation includes accurate power and thermal models, as well as
leakage dependence on temperature. This paper provides a theoretical basis for
deriving the optimal policies and computationally efficient implementations.
The effectiveness of our DVFS and task-to-core allocation techniques are
demonstrated by numerical simulations. The proposed task-to-core allocation
method showed a 20.2% improvement in performance over a power-based thread
migration approach. The techniques have been incorporated in a thermal-aware
architectural-level simulator called MAGMA that allows for design space
exploration, offline, and online dynamic thermal management. The simulator is
capable of handling simulations of hundreds of cores within reasonable time.

Boland, D. Constantinides, G. A. Bounding Variable Values and Round-Off Effects
Using Handelman Representations
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046164

The precision used in an algorithm affects the error and performance of
individual computations, the memory usage, and the potential parallelism for a
fixed hardware budget. This paper describes a new method to determine the
minimum precision required to meet a given error specification for an algorithm
consisting of the basic algebraic operations. Using this approach, it is
possible to significantly reduce the computational word-length in comparison to
existing methods, and this can lead to superior hardware designs. We
demonstrate the proposed procedure on an iteration of the conjugate gradient
algorithm, achieving proofs of bounds that can translate to global word-length
savings ranging from a few bits to proving the existence of ranges that must
otherwise be assumed to be unbounded when using competing approaches. We also
achieve comparable bounds to recent literature in a small fraction of the
execution time, with greater scalability.

TEST

Noia, B. Chakrabarty, K. Goel, S. K. Marinissen, E. J. Verbree, J.
Test-Architecture Optimization and Test Scheduling for TSV-Based 3-D Stacked
Ics
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046180

Through-silicon via (TSV)-based 3-D stacked ICs (SICs) are becoming
increasingly important in the semiconductor industry. In this paper, we address
test architecture optimization for 3-D stacked ICs implemented using TSVs. We
consider two cases, namely 3-D SICs with die-level test architectures that are
either fixed or still need to be designed. We next present mathematical
programming techniques to derive optimal solutions for the architecture
optimization problem for both cases. Experimental results for three handcrafted
3-D SICs comprising of various systems-on-a-chip (SoCs) from the ITC'02 SoC
test benchmarks show that compared to the baseline method of sequentially
testing all dies, the proposed solutions can achieve significant reduction in
test length. This is achieved through optimal test schedules enabled by the
test architecture. We also show that increasing the number of test pins
typically provides a greater reduction in test length compared to an increase
in the number of test TSVs. Furthermore, we show that shorter test lengths are
generally achieved with the larger, more complex dies lower in the stack. This
is because test data must pass through every die lower in a stack in order to
reach its target die, and with the larger dies lower in the stack, more test
bandwidth may be provided to these dies using fewer routing resources.

Zhong, S. Khursheed, S. Al-Hashimi, B. M. A Fast and Accurate Process
Variation-Aware Modeling Technique for Resistive Bridge Defects
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046171

Recent research has shown that tests generated without taking process variation
into account may lead to loss of test quality. At present, there is no
efficient device-level modeling technique that models the effect of process
variation on resistive bridge defects. This paper presents a fast and accurate
technique to achieve this, including modeling the effect of voltage and
temperature variation using the BSIM4 transistor model. To speed up the
computation time and without compromising simulation accuracy (achieved through
BSIM4), two efficient voltage approximation algorithms are proposed for
calculating logic threshold of driven gates and voltages on bridged lines of a
fault-site to calculate bridge critical resistance. Experiments are conducted
on a 65 nm gate library (for illustration purposes), and results show that on
average the proposed modeling technique is more than 53 times faster and in the
worst case, error in bridge critical resistance is 2.64% when compared with
HSPICE.

Hou, C.-S. Li, J.-F. Tseng, T.-W. Memory Built-in Self-Repair Planning
Framework for RAMs in SoCs
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046177

Built-in self-repair (BISR) techniques are widely used to enhance the yield of
random access memories (RAMs) in a system-on-chip (SoC) which typically
consists of hundreds of RAMs. Hence, many BISR circuits may be needed in a such
SoC. Effective techniques for planning these BISR circuits thus are imperative.
In this paper, we propose a memory BISR planning (MBiP) framework for the RAMs
in SoCs. The MBiP framework consists of a memory grouping algorithm for
selecting RAMs which can share a BISR circuit. Then, a test scheduling
algorithm is used to determine the test sequence of RAMs in a SoC under the
constraint of test power. Finally, a BISR scheme allocation algorithm is
proposed to allocate different BISR schemes for the RAMs under the constraints
of the results of memory grouping and test scheduling.  Simulation results show
that the proposed MBiP can effectively plan the BISR schemes for the RAMs in a
SoC. For example, about 22% area reduction can be achieved by the BISR schemes
planned by the proposed MBiP framework for 50 RAMs under 1.5 mm distance
constraint and 350 mW test power constraint in comparison with a dedicated BISR
scheme (i.e., each RAM has a self-contained BISR circuit).

Sinanoglu, O. Almukhaizim, S. Unified 2-D X-Alignment for Improving the
Observability of Response Compactors
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046182

Despite the advantages of performing response compaction in integrated-circuit
testing, unknown response bits (x's) inevitably reflect into loss in test
quality. The distribution of these x's within the captured response, which
varies for each test pattern, directly impacts the number of scan cells
observed through the response compactor. In this paper, we propose a unified
2-D x-alignment technique in order to judiciously manipulate the distribution
of x's in the test response prior to its compaction. The controlled response
manipulation is performed on a per pattern basis, in the form of scan chain
delay and intra-slice rotate operations, and with the objective that x's are
aligned within as few scan slices and chains as possible.  Consequently, a
larger number of scan cells are observed after compaction for any test pattern.
In an effort to tackle the unified 2-D x-alignment problem and to achieve
maximum overall observability, we first decipher the interaction between 1-D
x-alignment operations, and formulate 1-D and 2-D x-alignment operations all as
maximum satisfiability (MAX-SAT) problems; a weighted MAX-SAT formulation is
necessitated in the 2-D case to identify the best possible 2-D x-alignment,
which may differ from back to back application of the individual best possible
1-D alignments in two dimensions. The proposed technique is test set
independent, leading to a generic, simple, and cost-effective hardware
implementation.  While we show in this paper that x-alignment improves
horizontal and vertical compactors, covering a wide spectrum of compactors, it
is expected to improve other types of compactors as well by manipulating the
x-distribution properly.

SHORT PAPERS

Foreman, E. A. Habitz, P. A. Cheng, M.-C. Tamon, C. Inclusion of
Chemical-Mechanical Polishing Variation in Statistical Static Timing Analysis
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046170

Technology trends show the importance of modeling process variation in static
timing analysis. With the advent of statistical static timing analysis (SSTA),
multiple independent sources of variation can be modeled. This paper proposes a
methodology for modeling metal interconnect process variation in SSTA. The
developed methodology is applied in this study to investigate metal variation
in SSTA resulting from chemical-mechanical polishing (CMP). Using our
statistical methodology, we show that CMP variation has a smaller impact on
chip performance as compared to other factors impacting metal process
variation.

Chen, Z. Chakrabarty, K. Xiang, D. MVP: Minimum-Violations Partitioning for
Reducing Capture Power in At- Speed Delay-Fault Testing
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046167

Scan shift power can be reduced by activating only a subset of scan cells in
each shift cycle. In contrast to shift power reduction, the use of only a
subset of scan cells to capture responses in a cycle may cause capture
violations, thereby leading to fault coverage loss. In order to restore the
original fault coverage, new test patterns must be generated, leading to higher
test-data volume. In this paper, we propose minimum-violations partitioning, a
scan-cell clustering method that can support multiple capture cycles in delay
testing without increasing test-data volume. This method is based on an integer
linear programming model and it can cluster the scan flip-flops into balanced
parts with minimum capture violations. Based on this approach, hierarchical
partitioning is proposed to make the partitioning method routing-aware.
Experimental results on ISCAS'89 and IWLS'05 benchmark circuits demonstrate the
effectiveness of our method.

Liao, K.-Y. Chang, C.-Y. Li, J. C.-M. A Parallel Test Pattern Generation
Algorithm to Meet Multiple Quality Objectives
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046178

This paper proposes a bit-level parallel ATPG algorithm (SWK) that generates
multiple test patterns at a time. This algorithm converts decisions into
bitwise logic operation so that W (CPU word size) test patterns are searched
independently. Multiple objectives for different quality metrics can therefore
be achieved in a single test generation process. Experimental results on
ISCAS'89 and IWLS'05 benchmark circuits show that SWK test sets are better in
many quality metrics than traditional 50-detect test sets, while the length of
the former is shorter. Also, patterns selected from large N-detect pattern pool
cannot achieve the same or higher quality than patterns generated by SWK.