April 2012 Newsletter 
Placing you one click away from the best new CAD research!

REGULAR PAPERS

KEYNOTE PAPER

Zhang, J. Lin, A. Patil, N. Wei, H. Wei, L. Wong, H.-P. Mitra, S. 
Carbon Nanotube Robust Digital VLSI
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171056

Carbon nanotube field-effect transistors (CNFETs) are excellent candidates for
building highly energy-efficient electronic systems of the future. Fundamental
limitations inherent to carbon nanotubes (CNTs) pose major obstacles to the
realization of robust CNFET digital very large-scale integration (VLSI): 1) it
is nearly impossible to guarantee perfect alignment and positioning of all CNTs
despite near-perfect CNT alignment achieved in recent years; 2) CNTs can be
metallic or semiconducting depending on chirality; and 3) CNFET circuits can
suffer from large performance variations, reduced yield, and increased
susceptibility to noise. Today's CNT process improvements alone are inadequate
to overcome these challenges. This paper presents an overview of: 1)
imperfections and variations inherent to CNTs; 2) design and processing
techniques, together with a probabilistic analysis framework, for robust CNFET
digital VLSI circuits immune to inherent CNT imperfections and variations; and
3) recent experimental demonstration of CNFET digital circuits that are immune
to CNT imperfections. Significant advances in design tools can enable robust
and scalable CNFET circuits that overcome the challenges of the CNFET
technology while retaining its energy-efficiency benefits.

MODELING AND SIMULATION

Heloue, K. R. Onaissi, S. Najm, F. N. 
Efficient Block-Based Parameterized Timing Analysis Covering All Potentially
Critical Paths
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171052

In order for the results of timing analysis to be useful, they must provide
insight and guidance on how the circuit may be improved so as to fix any
reported timing problems. A limitation of many recent variability-aware timing
analysis techniques is that, while they report delay distributions, or verify
multiple corners, they do not provide the required guidance for re-design. We
propose an efficient block-based parameterized timing analysis technique that
can accurately capture circuit delay at every point in the parameter space, by
reporting all paths that can become critical. Using an efficient pruning
algorithm, only those potentially critical paths are carried forward, while all
other paths are discarded during propagation. This allows one to examine local
robustness to parameters in different regions of the parameter space, not by
considering differential sensitivity at a point (that would be useless in this
context) but by knowledge of the paths that can become critical at nearby
points in parameter space. We give a formal definition of this problem and
propose a technique for solving it, which improves on the state of the art,
both in terms of theoretical computational complexity and in terms of runtime
on various test circuits.

Chung, J. Abraham, J. A. 
Refactoring of Timing Graphs and Its Use in Capturing Topological Correlation
in SSTA
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171045

Reconvergent paths in circuits have been a nuisance in various computer-aided
design (CAD) algorithms, but no elegant solution to deal with them has been
found yet. In statistical static timing analysis (SSTA), they cause difficulty
in capturing topological correlation. This paper presents a technique that in
arbitrary block-based SSTA reduces the error caused by ignoring topological
correlation. We interpret a timing graph as an algebraic expression made up of
addition and maximum operators. We define the division operation on the
expression and propose algorithms that modify factors in the expression without
expansion. As a result, the algorithms produce an expression to derive the
latest arrival time with better accuracy in SSTA. Existing techniques handling
reconvergent fanouts usually use dependency lists, requiring quadratic space
complexity. Instead, the proposed technique has linear space complexity by
using a new directed acyclic graph search algorithm. Our results show that it
outperforms an existing technique in speed and memory usage with comparable
accuracy. More important, the proposed technique is not limited to SSTA and is
potentially applicable to various issues due to reconvergent paths in timing-
related CAD algorithms.

Chung, J. Xiong, J. Zolotov, V. Abraham, J. A. 
Path Criticality Computation in Parameterized Statistical Timing Analysis Using
a Novel Operator
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171046

This paper presents a method to compute criticality probabilities of paths in
parameterized statistical static timing analysis. We partition the set of all
the paths into several groups and formulate the path criticality into a joint
probability of inequalities. Before evaluating the joint probability directly,
we simplify the inequalities through algebraic elimination, handling
topological correlation. Our proposed method uses conditional probabilities to
obtain the joint probability, and statistics of random variables representing
process parameters are changed to take into account the conditions. To
calculate the conditional statistics of the random variables, we derive
analytic formulas by extending Clark's work. This allows us to obtain the
conditional probability density function of a path delay, given the path is
critical, as well as to compute criticality probabilities of paths. Our
experimental results show that the proposed method provides 4.2X better
accuracy on average in comparison to the state-of-art method.

Ibrahim, W. Beiu, V. Beg, A. 
GREDA: A Fast and More Accurate Gate Reliability EDA Tool
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171048

Generic as well as customized reliability electronic design automation (EDA)
tools have been proposed in the literature and used to estimate the reliability
of both present and future (nano)circuits. However, the accuracy of many of
these EDA tools is questionable as they: 1) either assume that all gates have
the same constant probability of failure (PF_GATE=const.) , or 2) use very
simple approaches to estimate the reliability of the elementary gates.  In this
paper, we introduce a gate reliability EDA tool (GREDA) that is able to
estimate more accurately the reliability of CMOS gates by considering: 1) the
gate's topology; 2) the variable probability of failure of the individual
devices PF_DEV; 3) the applied input vector; 4) the reliability of the input
signals; and 5) the input voltage variations (which can be linked to the
allowed noise margins). GREDA can be used to calculate PF_GATE due to different
types of faults and/or defects, and to estimate the effects of enhancing PF_DEV
on PF_GATE.  Simulation results show that GREDA can improve on the accuracy of
reliability calculations at the gate level.

Kazmierski, T. J. Wang, L. Al-Hashimi, B. M. Merrett, G. V. 
An Explicit Linearized State-Space Technique for Accelerated Simulation of
Electromagnetic Vibration Energy Harvesters
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171049

Vibration energy harvesting systems pose significant modeling and design
challenges due to their mixed-technology nature, extremely low levels of
available energy and disparate time scales between different parts of a
complete harvester. An energy harvester is a complex system of tightly coupled
components modeled in the mechanical, magnetic, as well as electrical analog
and digital domains. Currently available design tools are inadequate for
simulating such systems due to prohibitive CPU times. This paper proposes a new
technique to accelerate simulations of complete vibration energy harvesters by
approximately two orders of magnitude. The proposed technique is to linearize
the state equations of the system's analog components to obtain a fast estimate
of the maximum step-size to guarantee the numerical stability of explicit
integration based on the Adams-Bashforth formula. We show that the energy
harvester's analog electronics can be efficiently and reliably simulated in
this way with CPU times two orders of magnitude lower than those obtained from
two state-of-the-art tools, VHDL-AMS and SystemC-A. As a case study, a
practical, complex microgenerator with magnetic tuning and two types of power-
processing circuits have been simulated using the proposed technique and
verified experimentally.

Wang, Y. Zhang, Z. Koh, C.-K. Shi, G. Pang, G. K. H. Wong, N. 
Passivity Enforcement for Descriptor Systems Via Matrix Pencil Perturbation
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171054

Passivity is an important property of circuits and systems to guarantee stable
global simulation. Nonetheless, nonpassive models may result from passive
underlying structures due to numerical or measurement error/inaccuracy. A
postprocessing passivity enforcement algorithm is therefore desirable to
perturb the model to be passive under a controlled error. However, previous
literature only reports such passivity enforcement algorithms for pole-residue
models and regular systems (RSs). In this paper, passivity enforcement
algorithms for descriptor systems (DSs, a superset of RSs) with possibly
singular direct term (specifically, D+D^T or I-DD^T) are proposed.  The
proposed algorithms cover all kinds of state-space models (RSs or DSs, with
direct terms being singular or nonsingular, in the immittance or scattering
representation) and thus have a much wider application scope than existing
algorithms. The passivity enforcement is reduced to two standard optimization
problems that can be solved efficiently. The objective functions in both
optimization problems are the error functions, hence perturbed models with
adequate accuracy can be obtained. Numerical examples then verify the
efficiency and robustness of the proposed algorithms.

SYSTEM-LEVEL DESIGN

Cho, H. Leem, L. Mitra, S. 
ERSA: Error Resilient System Architecture for Probabilistic Applications
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171059

There is a growing concern about the increasing vulnerability of future
computing systems to errors in the underlying hardware. Traditional redundancy
techniques are expensive for designing energy-efficient systems that are
resilient to high error rates. We present Error Resilient System Architecture
(ERSA), a robust system architecture which targets emerging killer applications
such as recognition, mining, and synthesis (RMS) with inherent error
resilience, and ensures high degrees of resilience at low cost. Using the
concept of configurable reliability, ERSA may also be adapted for
general-purpose applications that are less resilient to errors (but at higher
costs). While resilience of RMS applications to errors in low-order bits of
data is well-known, execution of such applications on error-prone hardware
significantly degrades output quality (due to high-order bit errors and
crashes).  ERSA achieves high error resilience to high-order bit errors and
control flow errors (in addition to low-order bit errors) using a judicious
combination of the following key ideas: 1) asymmetric reliability in many-core
architectures; 2) error-resilient algorithms at the core of probabilistic
applications; and 3) intelligent software optimizations. Error injection
experiments on a multicore ERSA hardware prototype demonstrate that, even at
very high error rates of 20 errors/flip-flop/10^8 cycles (equivalent to 25000
errors/core/s), ERSA maintains 90% or better accuracy of output results,
together with minimal impact on execution time, for probabilistic applications
such as K- Means clustering, LDPC decoding, and Bayesian network inference. In
addition, we demonstrate the effectiveness of ERSA in tolerating high rates of
static memory errors that are characteristic of emerging challenges relate- to
SRAM V_ccmin problems and erratic bit errors.

Park, H. Yoo, S. Lee, S. 
A Multistep Tag Comparison Method for a Low-Power L2 Cache
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171055

Tag comparison in a highly associative cache consumes a significant portion of
the cache energy. Existing methods for tag comparison reduction are based on
predicting either cache hits or cache misses. In this paper, we present novel
ideas for both cache hit and miss predictions. We present a partial
tag-enhanced Bloom filter to improve the accuracy of the cache miss prediction
method and hot/cold checks that control data liveness to reduce the tag
comparisons of the cache hit prediction method. We also combine both methods so
that their order of application can be dynamically adjusted to adapt to
changing cache access behavior, which further reduces tag comparisons. To
overcome the common limitation of multistep tag comparison methods, we propose
a method that reduces tag comparisons while meeting the given performance
bound. Experimental results showed that the proposed method reduces the energy
consumption of tag comparison by an average of 88.40%, which translates to an
average reduction of 35.34% (40.19% with low-power data access) in the total
energy consumption of the L2 cache and a further reduction of 8.86% (10.07%
with low-power data access) when compared with existing methods.

Passas, G. Katevenis, M. Pnevmatikatos, D. 
Crossbar NoCs Are Scalable Beyond 100 Nodes
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171053

We describe the design and layout of a radix-128 crossbar in 90 nm CMOS. The
data path is 32 bits wide and runs at 750 MHz using a three-stage pipeline,
while fitting in a silicon area as small as 6.6 mm^2 by filling it at the 90%
level. The control path occupies 7 mm^2 next to the data path by filling it at
35% level, and reconfigures the data path once every three clock cycles. Next,
we arrange 128 1 mm2 "user tiles" around the crossbar, forming a 150 mm2 die,
and we connect all tiles to the crossbar via global links running on top of the
tiles. Including the overhead of repeaters and flip flops on global links, the
area cost of the crossbar is 11% of the die. Thus, we prove that crossbar
networks-on-chips (NoCs) are small enough for radices exceeding by far the few
tens of ports, that were believed to be the practical limit up to now, and
reaching above 100 ports. We also attempt a first-order comparison between our
crossbar and a model of a popular mesh NoC, and we find that our crossbar NoC
increases performance when traffic is global and stressed, at the cost of worse
performance when traffic is local and benign. Finally, we present an
experimental cost analysis showing that crossbar area practically grows as
O(N^2 W), as all wiring of the crossbar fits over its standard cells, while
crossbar delay grows as O(N sqrt W) , as wire length increases with the
perimeter of the crossbar.


Chouhan, S. Balakrishnan, M. Bose, R. 
System-Level Design Space Exploration Methodology for Energy-Efficient Sensor
Node Configurations: An Experimental Validation
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171044

For sensor nodes deployed at small distances, computation energy along with
radio energy determines the battery life. We have proposed a system-level
design space exploration methodology in for searching an energy-efficient
error-correcting code (ECC). This methodology takes into account the
computation and the radio energy in an integrated manner. In this paper, we
validate this methodology by deploying the Imote2 nodes and measuring energy
values under different operating modes, e.g., with and without ECC. In this
process, we propose a validation framework and node energy model. Experimental
results validate the methodology and show that with ECC we can save up to 14%
transmitter energy under a certain set of conditions. The main contribution of
this paper is that it establishes experimentally that our methodology is
effective in exploration of various node configurations and finding an
energy-efficient solution.

TEST

Hong, H.-C. 
A Static Linear Behavior Analog Fault Model for Switched-Capacitor Circuits
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171047

This paper proposes a static linear behavior (SLB) analog fault model for
switched-capacitor (SC) circuits. The SC circuits under test (CUT) are divided
into functional macros including the operational amplifiers, the capacitors,
and the switches. Each macro has specified design parameters from the design's
perspectives. These design parameters constitute a parameter set which
determines the practical transfer function of the CUT. The SLB fault model
defines that a CUT is faulty if its parameter set results in transfer functions
whose frequency responses are out of the design specification. We analyzed the
fault effects of the macros and derived their faulty signal-flow graph models
with which the faulty transfer function templates of the CUT can be
automatically generated. Based on the templates, we proposed a test procedure
that can estimate all the parameters in the parameter set so as to test the CUT
with multiple faults. Different from conventional single fault assumption, the
proposed SLB fault model covers concurrent multiple parametric faults and
catastrophic faults. In addition, it does not need to conduct fault simulations
before test as conventional analog fault models do. As a result, it addresses
the impractically long fault simulation time issue. A fully-differential
low-pass SC biquad filter was adopted as an example to demonstrate how to
design and use efficient multitone tests to test for the parameter set. The
multitone test results acquired during the test procedure also reveal the
distortion and noise performance of the CUT though the SLB fault model does not
include them.

Lee, D. Roy, K. 
Viterbi-Based Efficient Test Data Compression
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171050

This paper presents a Viterbi-based test compression algorithm/architecture
that provides high encoding efficiency and scalability with respect to the
number of test channels. The proposed scheme finds a set of compressed test
vectors using the Viterbi algorithm instead of solving linear equations. By
assigning a cost function to the branch metric of the Viterbi algorithm, an
optimal compressed vector is selected among the possible solution set. This
feature enables high flexibility to combine various test requirements such as
low-power compression and/or improving capability to repeat test patterns. The
proposed on-chip decompressor follows the structure of Viterbi encoders which
require only one input channel. Experimental results on test volume show
improvement on all ISCAS89 benchmark circuits (19.32% reduction on the average)
compared to previous test data compression architectures. The proposed scheme
also yields efficient power-dissipation/volume tradeoff.

Lu, S.-K. Wang, Z.-Y. Tsai, Y.-M. Chen, J.-L. 
Efficient Built-In Self-Repair Techniques for Multiple Repairable Embedded RAMs
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6170996

In this paper, efficient built-in self-repair (BISR) techniques for multiple
repairable memory cores with different sizes and divided redundancy mechanisms
are proposed. Embedded memory cores are first partitioned into memory groups.
For each memory group, a redundant memory module is added and divided into row
blocks and column blocks. Moreover, the memory cores within a memory group are
partitioned into divided arrays (consisting of row/column blocks) of the same
size. The redundant memory can be shared among all memory cores within the same
memory group. Therefore, unlike the traditional redundancy architectures, a row
(column) block is used as the basic replacement element. Based on the proposed
redundancy architecture, a heuristic heterogeneous extended spare pivoting
redundancy analysis algorithm suitable for built-in implementation is also
proposed. Experimental results show that the repair rate and manufacturing
yield can be improved significantly due to the efficient usage of redundancy.
Moreover, the area overhead of the BISR circuitry for an example memory group
consisting of four memory instances of size 9.25 Mbits is only 1.12%.

Fang, H. Chakrabarty, K. Wang, Z. Gu, X. 
Reproduction and Detection of Board-Level Functional Failure
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171058

No trouble found (NTF) due to functional failures is a common scenario today in
board-level testing at system companies. A component on a board fails during
the board-level functional test, but it passes the automatic test equipment
(ATE) test when it is returned to the supplier for warranty replacement or
service repair. To find the root cause of NTF, we propose an innovative
functional test approach and discrete Fourier transform (DFT) methods for the
detection of board-level functional failures. These DFT and test methods allow
us to reproduce and detect functional failures in a controlled deterministic
environment, and provide ATE tests to the supplier for early screening of
defective parts. Experiments on an industry design show that the proposed
functional scan test with appropriate functional constraints can adequately
mimic the functional state space, as measured by appropriate coverage metrics.
Experiments also show that most functional failures due to dominant bridging,
crosstalk, and delay faults due to power supply noise can be reproduced and
detected by functional scan test. We also describe two approaches to enhance
the mimicking of the functional state space. The first approach allows us to
select a given number of initial states in linear time and functional scan
tests resulting from these selected states are used to mimic the functional
state space. The second approach is based on controlled state injection.

SHORT PAPERS

Lee, L.-J. Tseng, W.-D. Lin, R.-B. Chang, C.-H. 
2n Pattern Run-Length for Test Data Compression
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171051

This paper presents a new pattern run-length compression method whose
decompressor is simple and easy to implement. It encodes 2^n runs of compatible
or inversely compatible patterns, either inside a single test data segment or
across multiple test data segments. Experimental results show that it can
achieve an average compression ratio of 67.64% and considerable test
application time savings.