TCAD Newsletter - June 2011 Issue

Placing you one click away from the best new CAD research!


ANNOUNCING THE 2011 DONALD O. PEDERSON BEST PAPER AWARD...  

The Donald O. Pederson Award recognizes the best paper published in the IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems in the
two calendar years preceding the award. Based on an open nomination process,
the TCAD editorial board voted to select a paper by Amith Singhee and Rob
Rutenbar as this yearżs winner:

Statistical Blockade: Very Fast Statistical Simulation and Modeling of Rare
Circuit Events and Its Application to Memory Design

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5166555

Circuit reliability under random parametric variation is an area of growing
concern. For highly replicated circuits, e.g., static random access memories
(SRAMs), a rare statistical event for one circuit may induce a not-so-rare
system failure. Existing techniques perform poorly when tasked to generate both
efficient sampling and sound statistics for these rare events. Statistical
blockade is a novel Monte Carlo technique that allows to efficiently filter-to
block-unwanted samples that are insufficiently rare in the tail distributions
that are sought. The method synthesizes ideas from data mining and extreme
value theory and, for the challenging application of SRAM yield analysis, shows
speedups of 10-100 times over standard Monte Carlo.

Congratulations to the winners!  The award will be presented at the opening
session of the ACM/EDAC/IEEE Design Automation Conference in San Diego, CA next
week.


JUNE 2011 ISSUE

REGULAR PAPERS

ANALOG, MIXED-SIGNAL, AND RF CIRUITS

Liu, B.  Fernandez, F. V.  Gielen, G. G. E. Efficient and Accurate Statistical
Analog Yield Optimization and Variation-Aware Circuit Sizing Based on
Computational Intelligence Techniques

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768139

In nanometer CMOS technologies, worst-case design methods and
response-surface-based yield optimization methods face challenges in accuracy.
Monte-Carlo (MC) simulation is general and accurate for yield estimation, but
its efficiency is not high enough to make MC-based analog yield optimization,
which requires many yield estimations, practical. In this paper, techniques
inspired by computational intelligence are used to speed up yield optimization
without sacrificing accuracy. A new sampling-based yield optimization approach,
which determines the device sizes to optimize yield, is presented, called the
ordinal optimization (OO)-based random-scale differential evolution (ORDE)
algorithm. By proposing a two-stage estimation flow and introducing the OO
technique in the first stage, sufficient samples are allocated to promising
solutions, and repeated MC simulations of non-critical solutions are avoided.
By the proposed evolutionary algorithm that uses differential evolution for
global search and a random-scale mutation operator for fine tunings, the
convergence speed of the yield optimization can be enhanced significantly. With
the same accuracy, the resulting ORDE algorithm can achieve approximately a
tenfold improvement in computational effort compared to an improved MC-based
yield optimization algorithm integrating the infeasible sampling and
Latin-hypercube sampling techniques. Furthermore, ORDE is extended from plain
yield optimization to process-variation-aware single-objective circuit sizing.

EMERGING TECHNOLOGIES

Maslov, D.  Saeedi, M. Reversible Circuit Optimization Via Leaving the Boolean
Domain

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768140

For years, the quantum/reversible circuit community has been convinced that: 1)
the addition of auxiliary quantum bits (qubits) is instrumental in constructing
a smaller quantum circuit, and 2) the introduction of quantum gates inside
reversible circuits may result in more efficient designs. This paper presents a
systematic approach to optimizing reversible (and quantum) circuits via the
introduction of auxiliary qubits and quantum gates inside circuit designs. This
advances our understanding of what may be achieved with 1) and 2).

Lin, C. C.-Y.  Chang, Y.-W. Cross-Contamination Aware Design Methodology for
Pin-Constrained Digital Microfluidic Biochips

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768126

Digital microfluidic biochips have emerged as a popular alternative for
laboratory experiments. Pin-count reduction and cross-contamination avoidance
are key design considerations for practical applications with different
droplets being transported and manipulated on highly integrated biochips. This
paper presents the first design automation flow that considers the
cross-contamination problems on pin-constrained biochips. The factors that make
the problems harder on pin-constrained biochips are explored. To cope with
these cross contaminations, this paper proposes: 1) early crossing minimization
algorithms during placement, and 2) systematic wash droplet scheduling and
routing that require only one extra control pin and zero assay completion time
overhead for practical bioassays. Experimental results show the effectiveness
and scalability of our algorithms for practical bioassays.

FPGAs AND RECONFIGURABLE COMPUTING

Golshan, S.  Kooti, H.  Bozorgzadeh, E. SEU-Aware High-Level Data Path
Synthesis and Layout Generation on SRAM-Based FPGAs

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768130

Although triple modular redundancy (TMR) has been widely used to mitigate
single event upsets (SEUs) in static random access memory-based
field-programmable gate arrays (FPGAs), SEU-caused bridging faults between the
triplicated modules do not guarantee the correctness of TMR designs under all
SEUs. In this paper, we present a novel computer-aided design flow for
redundancy-based applications on FPGAs in order to mitigate the impact of SEUs
in the configuration bitstreams. We introduce the notions of modular redundancy
conflicts and vulnerability-gap conflicts which maintain the fundamental
assumption underlying the integrity of redundancy-based designs (i.e.,
self-containment of SEU-induced faults within a single replica of redundant
resources). When the impact of SEU-induced bridging faults is considered in
high-level synthesis as well as physical synthesis, on average more than 30%
improvement in the number of potential SEU-induced bridging faults can be
reached as well as improvements in the performance and area utilization,
compared to post-synthesis TMR, in which the voters are applied to the feedback
structures in the circuit. Compared to the extreme case of post-synthesis TMR
in which the voters are applied at the end of every configurable logic block,
we reach 38% (26%) improvement in performance (area) of the implemented
circuits.

MODELING AND SIMULATION

Garcia-Loureiro, A. J.  Seoane, N.  Aldegunde, M.  Valin, R.  Asenov, A.
Martinez, A.  Kalna, K. Implementation of the Density Gradient Quantum
Corrections for 3-D Simulations of Multigate Nanoscaled Transistors

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768129

An efficient implementation of the density-gradient (DG) approach for the
finite element and finite difference methods and its application in
drift-diffusion (D-D) simulations is described in detail. The new, second-order
differential (SOD) scheme is compatible with relatively coarse grids even for
large density variations thus applicable to device simulations with complex 3-D
geometries. Test simulations of a 1-D metal-oxide semiconductor diode
demonstrate that the DG approach discretized using our SOD scheme can be
accurately calibrated against Schrödinger-Poisson calculations exhibiting lower
discretization error than the previous schemes when using coarse grids and the
same results for very fine meshes. 3-D test D-D simulations using the finite
element method are performed on two devices: a 10 nm gate length double gate
metal-oxide-semiconductor field-effect transistor (MOSFET) and a 40 nm gate
length Tri-Gate fin field-effect transistor (FinFET). In 3-D D-D simulations,
the SOD scheme is able to converge to physical solutions at high voltages even
if the previous schemes fail when using the same mesh and equivalent
conditions. The quantum corrected D-D simulations using the SOD scheme also
converge with an atomistic mesh used for the 10 nm double gate MOSFET saving
computational resources and can be accurately calibrated against the results
from non-equilibrium Green's functions approach. Finally, the simulated ID-VG
characteristics for the 40 nm gate length Tri-Gate are in an excellent
agreement with experimental data.

Veetil, V.  Chopra, K.  Blaauw, D.  Sylvester, D. Fast Statistical Static
Timing Analysis Using Smart Monte Carlo Techniques

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768142

In this paper, we propose a stratification+hybrid quasi Monte Carlo (SH-QMC)
approach to improve the efficiency of Monte Carlo-based statistical static
timing analysis (SSTA) using sample size reduction. Sample size reduction
techniques proposed in the literature exhibit a tradeoff between accuracy of
the Monte Carlo estimate with fewer samples and their ability to handle large
number of variables in multidimensional space. This paper proposes to target
several such techniques to different sets of process variation variables by
using information about the importance of these variables to the circuit delay,
and the capability of the techniques to handle multiple dimensions. Simulations
on benchmark circuits up to 90K gates show that the proposed method requires up
to 224 samples for varying levels of process variation to achieve accurate
timing estimates. Results also show that when SH-QMC is performed with multiple
parallel threads on a quad-core processor, the approach is faster than
traditional SSTA with comparable accuracy. When the proposed SH-QMC technique
is supplemented with a graph pruning method the runtime is further reduced by
46ż48% on average. The technique is also extended to include an incremental
approach to recompute a percentile delay metric after engineering change order.

Chen, Q.  Schoenmaker, W.  Meuris, P.  Wong, N. An Effective Formulation of
Coupled Electromagnetic-TCAD Simulation for Extremely High Frequency Onward

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768127

This paper presents an effective formulation tailored for
electromagnetic-technology computer-aided design coupled simulations for
extremely high-frequency ranges and beyond 50GHz. A transformation of variables
is exploited from the starting A-V formulation to the E-V formulation, combined
with adopting the gauge condition as the equation for scalar potential. The
transformation significantly reduces the cross-coupling between electric and
magnetic systems at high frequencies, providing therefore much better
convergence for iterative solution. The validation of such transformations is
ensured through a careful analysis of redundancy in the coupled system and
material properties. Employment of the advanced matrix permutation technique
further alleviates the extra computational cost introduced by the variable
transformation. Numerical experiments confirm the accuracy and efficiency of
the proposed E-V formulation.

PHYSICAL DESIGN

Rajaram, A.  Pan, D. Z. Robust Chip-Level Clock Tree Synthesis

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768141

Chip-level clock tree synthesis (CCTS) is a key problem that arises in complex
system-on-a-chip designs. A key requirement of CCTS is to balance the
clock-trees belonging to different IPs such that the entire tree has a small
skew across all process corners. Achieving this is difficult because the clock
trees in different IPs might be vastly different in terms of their clock
structures and cell/interconnect delays. The chip-level clock tree is expected
to compensate for these differences and achieve good skews across all corners.
Also, CCTS is expected to reduce clock divergence between IPs that have
critical timing paths between them. Reducing clock divergence reduces the
maximum possible clock skew in the critical paths between the IPs and thus
improves yield. This paper proposes effective CCTS algorithms to simultaneously
reduce multicorner skew and clock divergence. Experimental results on several
test-cases indicate that our methods achieve 30% reduction in the clock
divergence with significantly improved multicorner skew variance, at the cost
of 2% increase in buffer area and 1% increase in wirelength.

SYSTEM-LEVEL DESIGN

Drego, N.  Chandrakasan, A.  Boning, D.  Shah, D. Reduction of
Variation-Induced Energy Overhead in Multi-Core Processors

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768128

Core-to-core variability in future many-core chip multi-processors (CMPs)
negatively impacts energy. Under-performing cores necessitate increasing the
system voltage to maintain homogeneous core performance, introducing an energy
overhead. Multiple supply voltages can be used to mitigate the impact of delay
variation in CMPs. In this paper, we carefully analyze the use of a local
search algorithm to pick near-optimal supply voltages while meeting a fixed
performance target. With two system voltages, we prove our algorithm selects
the global optimum and in the more general multiple voltage case we develop
quantitative bounds. Using a custom simulation methodology on a real processor
core, we show that two system voltages provide the most incremental benefit,
reducing the energy overhead relative to a single voltage by 59ż75% and total
energy by 6ż16%. Additionally, the worst 5ż15% of cores in such systems
necessitate increasingly larger amounts of incremental energy for a constant
incremental performance gain. Therefore, turning off or disabling these cores
is beneficial to a joint performance-energy metric.

Kang, K.  Kim, J.  Yoo, S.  Kyung, C.-M. Runtime Power Management of 3-D
Multi-Core Architectures Under Peak Power and Temperature Constraints

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768133

3-D integration is a new technology that overcomes the limitations of 2-D
integrated circuits, e.g., power and delay induced from long interconnect
wires, by stacking multiple dies to increase logic integration density.
However, chip-level power and peak temperature are the major performance
limiters in 3-D multi-core architectures. In this paper, we propose a runtime
power management method for both peak power and temperature-constrained 3-D
multi-core systems in order to maximize the instruction throughput. The
proposed method exploits dynamic temperature slack (defined as peak temperature
constraint minus current temperature) and workload characteristics (e.g.,
instructions per cycle and memory-boundness) as well as thermal characteristics
of 3-D stacking architectures. Compared with existing thermal-aware power
management solutions for 3-D multi-core systems, our method yields up to 34.2%
(average 18.5%) performance improvement in terms of instructions per second
without significant additional energy consumption.

TEST

Lee, M.  Denq, L.-M.  Wu, C.-W. A Memory Built-In Self-Repair Scheme Based on
Configurable Spares

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768135

There is growing need for embedded memory built-in self-repair (MBISR) due to
the introduction of more and more system-on-chip (SoC) and other highly
integrated products, for which the chip yield is being dominated by the yield
of on-chip memories, and repairing embedded memories by conventional off-chip
schemes is expensive. Therefore, we propose an MBISR generator called BRAINS+,
which automatically generates register transfer level MBISR circuits for SoC
designers. The MBISR circuit is based on a redundancy analysis (RA) algorithm
that enhances the essential spare pivoting algorithm, with a more flexible
spare architecture, which can configure the same spare to a row, a column, or a
rectangle to fit failure patterns more efficiently. The proposed MBISR circuit
is small, and it supports at-speed test without timing-penalty during normal
operation, e.g., with a typical 0.13um CMOS technology, it can run at 333 MHz
for a 512 Kb memory with four spare elements (rows and/or columns), and the
MBISR area overhead is only 0.36%. With its low area overhead and zero
test-time penalty, the MBISR can easily be applied to multiple memories with a
distributed RA scheme. Compared with recent studies, the proposed scheme is
better in not only test-time but also area overhead.

SHORT PAPERS

Hsieh, T.-Y.  Lee, K.-J.  Breuer, M. A. An Error-Tolerance-Based Test
Methodology to Support Product Grading for Yield Enhancement

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768131

This paper presents a novel error-tolerance-based test methodology to grade
defective chips according to their degree of acceptability so as to improve the
effective yield of chips. We employ error rate as the attribute of
error-tolerance to determine acceptability. We show that the number of test
patterns that need to be applied to a circuit under test in estimating the
circuit's error rate is highly dependent on how close the circuit's actual
error rate is to the given grading thresholds. An iterative and adaptive error
rate estimation technique is developed by which an appropriate number of test
patterns can be efficiently determined and the circuit can be immediately
classified into appropriate grades to fit various application requirements.
Experimental results show that: 1) only a few iterations are required to
classify a circuit, and 2) the total number of test patterns used is in general
independent of the circuit size. Both of these observations imply that these
techniques are applicable to large circuits.

Tannir, D.  Khazaka, R. Adjoint Sensitivity Analysis of Nonlinear Distortion in
Radio Frequency Circuits

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768134

Measuring the effects of nonlinear intermodulation distortion is one of the
main requirements in the design of radio frequency circuits. The third-order
intercept point (IP3) is one of the main figures of merit that is used to
characterize this distortion and is expensive to compute due to the presence of
multi-tone inputs. Recently, an efficient method for computing the third order
intercept point, using moments analysis, was presented. However, this approach
does not provide any sensitivity information. In this paper, we propose an
efficient and robust method for computing the sensitivity of IP3 using adjoint
moments with minimal additional computational cost.

Cong, J.  Huang, H.  Jiang, W. Pattern-Mining for Behavioral Synthesis

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768132

Pattern-based synthesis has drawn wide interest from researchers who tried to
utilize the regularity in applications for design optimizations. In this
letter, we present a general pattern-based behavior synthesis framework which
can efficiently extract similar structures in programs. Our approach is very
scalable in benefit of advanced pruning techniques. The similarity of
structures is captured by a mismatch-tolerant metric: the graph edit distance.
The graph edit distance can naturally capture different program variations such
as bit-width, structure, and port variations. In addition, we further our
approach to handle control-intensive applications, and this leads to more
opportunities for optimization. Our algorithm uses a feature-based filtering
approach for fast pruning, and a graph similarity metric called the generalized
edit distance for measuring variations in control-data flow graphs.
Furthermore, we apply our pattern-based synthesis system to the resource
optimization problem in behavioral synthesis. Considering knowledge of
discovered patterns, the resource binding step can intelligently generate the
data-path to reduce interconnect costs. Experiments show that our approach can,
on average, reduce the total area by about 20% with 7% latency overhead with
our pattern techniques on the Xilinx Virtex-4 field-programmable gate arrays,
compared to the traditional behavioral synthesis flow.