May 2013 Newsletter 
Placing you one click away from the best new CAD research!
Plain-text version at 
http://www.umn.edu/~tcad/newsletter/2013-05.txt


Announcing the 2013 Donald O. Pederson Best Paper AwardÉ

The Donald O. Pederson Award recognizes the best paper published in the IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems in the
two calendar years preceding the award. This yearÕs award, which will be
presented at the Design Automation Conference in San Francisco, goes to:

Wangyang Zhang; Xin Li; T. Liu;Emrah Acar; Rob A. Rutenbar; R. D. (Shawn)
Blanton
Virtual Probe: A Statistical Framework for Low-Cost Silicon Characterization of
Nanoscale Integrated 
Circuits 
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6071091&isnumber=6071079

The paper proposes a novel statistical methodology for modeling spatial
variations of silicon wafers and dies from a small set of measurement data. It
is the first paper to present a solid theoretical framework derived from
compressive sensing theory and fundamentally changes the way that spatial
variations are modeled and interpreted. Unlike other conventional techniques
that require a lot of measurement data to capture the spatial variation pattern
and, hence, are slow and expensive, Virtual Probe is both accurate and
cost-efficient. It can be applied to a broad range of practical problems
related to integrated circuit design and manufacturing, such as yield learning
and test cost reduction. This paper thus makes a unique and outstanding
contribution to the VLSI/CAD community with significant long-term impact.

REGULAR PAPERS

EMERGING TECHNOLOGIES

Lee, D. ; Lee, W.S. ; Chen, C. ; Fallah, F. ; Provine, J. ; Chong, S. ;
Watkins, J. ; Howe, R.T. ; Wong, H.-S.P. ; Mitra, S. 
Combinational Logic Design Using Six-Terminal NEM Relays
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504552

This paper presents techniques for designing nanoelectromechanical relay-based
logic circuits using six-terminal relays that behave as universal logic gates.
With proper biasing, a compact 2-to-1 multiplexer can be implemented using a
single six-terminal relay. Arbitrary combinational logic functions can then be
implemented using well- known binary decision diagram (BDD) techniques.
Compared to a CMOS-style implementation using four-terminal relays, the
BDD-based implementation can result in lower area without major impact on
performance metrics such as delay, and energy (when the relays are scaled to
small dimensions). Although it is possible to implement any combinational
circuit with a single mechanical delay, the relay count can be significantly
reduced for complex logic functions by allowing multiple mechanical delays.

FPGAS AND RECONFIGURABLE COMPUTING

Teng, B. ; Anderson, J.H.
Latch-Based Performance Optimization for Field-Programmable Gate Arrays
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504531

We explore using pulsed latches for timing optimization in field-programmable
gate arrays (FPGAs). Pulsed latches are transparent latches driven by a clock
with a nonstandard (i.e., not 50%) duty cycle. As latches are already present
on commercial FPGAs, their use for timing optimization can avoid the power or
area drawbacks associated with other techniques such as clock skew and
retiming. We propose algorithms that automatically replace certain flipÐ flops
with latches for performance gains. Under conservative short path or minimum
delay assumptions, our latch- based optimization, operating on already routed
designs, provides all the benefit of clock skew in most cases and increases
performance by 9%, on average, without area penalties or significant netlist
changes. We show that short paths greatly hinder the ability of using pulsed
latches, and that further improvements in performance are possible by
increasing the delay of certain short paths.

Stojilovic, M. ; Novo, D. ; Saranovac, L. ; Brisk, P. ; Ienne, P.
Selective Flexibility: Creating Domain-Specific Reconfigurable Arrays
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504556

Historically, hardware acceleration technologies have either been
application-specific, therefore lacking in flexibility, or fully programmable,
thereby suffering from notable inefficiencies on an application-by-application
basis. To address the growing need for domain-specific acceleration
technologies, this paper describes a design methodology (i) to automatically
generate a domain-specific coarse-grained array from a set of representative
applications and (ii) to introduce limited forms of architectural generality to
increase the likelihood that additional applications can be successfully mapped
onto it. In particular, coarse-grained arrays generated using our approach are
intended to be integrated into customizable processors that use
application-specific instruction set extensions to accelerate performance and
reduce energy; rather than implementing these extensions using
application-specific integrated circuit (ASIC) logic, which lacks flexibility,
they can be synthesized onto our reconfigurable array instead, allowing the
processor to be used for a variety of applications in related domains. Results
show that our array is around 2x slower and 15x larger than an ultimately
efficient ASIC implementation, and thus far more efficient than
field-programmable gate arrays (FPGAs), which are known to be 3Ð4x slower and
20Ð40x larger.  Additionally, we estimate that our array is usually around 2x
larger and 2x slower than an accelerator synthesized using traditional datapath
merging, which has, if any, very limited flexibility beyond the design set of
DFGs.

MODELING AND SIMULATION

Park, S. ; Park, J. ; Shin, D. ; Wang, Y. ; Xie, Q. ; Pedram, M. ; Chang, N.
Accurate Modeling of the Delay and Energy Overhead of Dynamic Voltage and
Frequency Scaling in Modern Microprocessors
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504549

Dynamic voltage and frequency scaling (DVFS) has been studied for well over a
decade. Nevertheless, existing DVFS transition overhead models suffer from
significant inaccuracies; for example, by incorrectly accounting for the effect
of DCÐDC converters, frequency synthesizers, voltage, and frequency change
policies on energy losses incurred during mode transitions. Incorrect and/or
inaccurate DVFS transition overhead models prevent one from determining the
precise break-even time and thus forfeit some of the energy saving that is
ideally achievable. This paper introduces accurate DVFS transition overhead
models for both energy consumption and delay. In particular, we redefine the
DVFS transition overhead including the underclocking-related losses in a
DVFS-enabled microprocessor, additional inductor IR losses, and power losses
due to discontinuous-mode DCÐDC conversion. We report the transition overheads
for a desktop, a mobile and a low-power representative processor. We also
present DVFS transition overhead macromodel for use by high-level DVFS
schedulers.

PHYSICAL DESIGN

Liu, W.-H. ; Kao, W.-C. ; Li, Y.-L. ; Chao, K.-Y.
NCTU-GR 2.0: Multithreaded Collision-Aware Global Routing With Bounded-Length
Maze Routing
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504553

Modern global routers employ various routing methods to improve routing speed
and quality. Maze routing is the most time-consuming process for existing
global routing algorithms. This paper presents two bounded-length maze routing
(BLMR) algorithms (optimal-BLMR and heuristic-BLMR) that perform much faster
routing than traditional maze routing algorithms. In addition, a rectilinear
Steiner minimum tree aware routing scheme is proposed to guide heuristic-BLMR
and monotonic routing to build a routing tree with shorter wirelength. This
paper also proposes a parallel multithreaded collision-aware global router
based on a previous sequential global router (SGR). Unlike the
partitioning-based strategy, the proposed parallel router uses a task-based
concurrency strategy. Finally, a 3-D wirelength optimization technique is
proposed to further refine the 3-D routing results. Experimental results reveal
that the proposed SGR uses less wirelength and runs faster than most of other
state-of-the-art global routers with a different set of parameters. Compared to
the proposed SGR, the proposed parallel router yields almost the same routing
quality with average 2.71 and 3.12-fold speedup on overflow-free and
hard-to-route cases, respectively, when running on a 4-core system.

TEST

Ye, F. ; Zhang, Z. ; Chakrabarty, K. ; Gu, X.
Board-Level Functional Fault Diagnosis Using Artificial Neural Networks,
Support-Vector Machines, and Weighted-Majority Voting
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504533

Increasing integration densities and high operating speeds lead to subtle
manifestation of defects at the board level.  Functional fault diagnosis is,
therefore, necessary for board-level product qualification. However, ambiguous
diagnosis results lead to long debug times and even wrong repair actions, which
significantly increase repair cost and adversely impact yield. Advanced
machine-learning (ML) techniques offer an unprecedented opportunity to increase
the accuracy of board-level functional diagnosis and reduce high-volume
manufacturing cost through successful repair. We propose a smart diagnosis
method based on two ML classification models, namely, artificial neural
networks (ANNs) and support-vector machines (SVMs) that can learn from repair
history and accurately localize the root cause of a failure. Fine-grained fault
syndromes extracted from failure logs and corresponding repair actions are used
to train the classification models. We also propose a decision machine based on
weighted- majority voting, which combines the benefits of ANNs and SVMs. Three
complex boards from the industry, currently in volume production, and
additional synthetic data, are used to validate the proposed methods in terms
of diagnostic accuracy, resolution, and quantifiable improvement over current
diagnostic software.

Lin, Y.-H. ; Huang, S.-Y. ; Tsai, K.-H. ; Cheng, W.-T. ; Sunter, S. ; Chou,
Y.-F. ; Kwai, D.-M.
Parametric Delay Test of Post-Bond Through-Silicon Vias in 3-D ICs via Variable
Output Thresholding Analysis
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504551

A parametric delay fault could arise in a through-silicon via (TSV) of a 3-D IC
due to a manufacturing defect.  Identification of such a fault is essential for
fault diagnosis, yield-learning, and/or reliability screening. In this paper,
we present an innovative design-for-testability technique called variable
output thresholding. We discovered that by dynamically switching the output of
a TSV from a normal inverter to a SchmittÐTrigger inverter, the parametric
delay fault on the TSV can be characterized and detected. SPICE simulation
reveals that this technique remains effective even when there is significant
process variation. A scalable test infrastructure indicates that the test time
is modest at only 17.2 ms for 1024 TSVs and 648.8 ms for 32768 TSVs when the
test clock is running at 10 MHz.

Liu, X. ; Xu, Q.
On Multiplexed Signal Tracing for Post-Silicon Validation
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504557

Trace-based debug techniques have widely been utilized in the industry to
eliminate design errors escaped from pre- silicon verification. Existing
solutions typically trace the same set of signals throughout each debug run,
which is not quite effective for catching design errors. In this paper, we
propose a multiplexed signal tracing strategy that is able to significantly
increase debuggability of the circuit. That is, we divide the tracing procedure
in each debug run into a few periods and trace different sets of signals in
each period. We present a trace signal grouping algorithm to maximize the
probability of catching the propagated evidences from design errors,
considering the trace interconnection fabric design constraints. Moreover, we
propose a trace signal selection solution to enhance the error detection
capability. Experimental results on benchmark circuits demonstrate the
effectiveness of the proposed solution.

SYSTEM-LEVEL DESIGN

Naeem, A. ; Jantsch, A. ; Lu, Z.
Scalability Analysis of Memory Consistency Models in NoC-Based Distributed Shared Memory SoCs
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504554

We analyze the scalability of six memory consistency models in network-on-chip
(NoC)-based distributed shared memory multicore systems: 1) protected release
consistency (PRC); 2) release consistency (RC); 3) weak consistency (WC); 4)
partial store ordering (PSO); 5) total store ordering (TSO); and 6) sequential
consistency (SC).  Their realizations are based on a transaction counter and an
address-stack-based approach. The scalability analysis is based on different
workloads mapped on various sizes of networks using different problem sizes.
For the experiments, we use Nostrum NoC-based configurable multicore platform
with a 2-D mesh topology and a deflection routing algorithm. Under the
synthetic workloads, the average execution time for the PRC, RC, WC, PSO, and
TSO models in the 8x8 network (64-cores) is reduced by 32.3%, 28.3%, 20.1%,
13.8%, and 9.9% over the SC model, respectively. For the application workloads,
as the network size grows, the average execution time under these relaxed
memory models decreases with respect to the SC model depending on the
application and its match to the architecture. The performance improvement of
the PRC and RC models over the SC model tends to be higher than 50% as observed
in the experiments, when the system is further scaled up. The area cost in the
network interface for the relaxed memory models is increased by less than 4%
over the SC model.

VERIFICATION

Cimatti, A. ; Narasamdya, I. ; Roveri, M.
Software Model Checking SystemC
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504555

SystemC is an increasingly used language for writing executable specifications
of systems-on-chip. The verification of SystemC, however, is a very difficult
challenge. Simulation features great scalability, but can miss important
defects. On the other hand, formal verification of SystemC is extremely hard
because of the presence of threads, and the intricacies of the communication
and scheduling mechanisms. In this paper, we explore formal verification for
SystemC by means of software model checking techniques, which have demonstrated
substantial progress in recent years. We propose an accurate model of SystemC
and three complementary encodings of SystemC to finite-state processes,
sequential and threaded programming models. We implement the proposed
approaches in a tool chain and carry out a thorough experimental evaluation
using several benchmarks taken from the literature on SystemC verification, and
experimenting with different state-of-the-art software model checkers. The
results clearly show the applicability and efficiency of the proposed
approaches. In particular, the results show the effectiveness of the threaded
and of the finite-model encodings to prove and disprove properties,
respectively.

Kumar, J.A. ; Vasudevan, S.
Formal Probabilistic Timing Verification in RTL
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504532

Variations in timing can occur due to multiple sources on a chip such as
process variations and variations in input patterns. It is desirable to have
variation awareness at the register transfer level (RTL), and estimate block
level delay distributions early in the design cycle, to evaluate design choices
quickly and minimize postsynthesis simulation costs. In previous work, we
introduced statistical high-level analysis and rigorous performance estimation
(SHARPE), a rigorous, systematic methodology to verify design correctness in
RTL in the presence of variations.  We described SHARPE in the context of
computing statistical delay invariants with respect to input variations. We
treated the RTL source code as a program and used static program analysis
techniques to compute probabilities. We modeled the probabilistic RTL modules
as discrete time Markov chains that are then checked formally for probabilistic
invariants using PRISM, a probabilistic model checker. In this paper, we extend
SHARPE to perform timing verification in RTL in the context of process
variations. We achieved this by obtaining a set of process variation-aware RTL
delay models and correspondingly modifying the existing steps in SHARPE. We
illustrate SHARPE on the RTL description of the datapath of OR1200, an open
source embedded processor. We also apply SHARPE to other data-intensive RTL
designs such as nontrivial components of communication systems and a few
benchmark designs.

SHORT PAPER

Chen, Q. ; Schoenmaker, W. ; Chen, G. ; Jiang, L. ; Wong, N.
A Numerically Efficient Formulation for Time-Domain Electromagnetic-Semiconductor Cosimulation for 
Fast-Transient Systems
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504550

We report recent progress in developing a numerically efficient formulation for
electromagnetic-technology computer-aided design cosimulation for
fast-transient computations. The difficulties underlying the currently existing
transient formulation stemming from the vector potential-scalar potential (A-V)
framework are analyzed. A time-domain electric field-scalar potential (E-V)
framework is then developed via equation and variable transformations. This
results in better-conditioned systems that are friendly to iterative solutions
at fast switching times. Numerical examples show that the proposed E-V solver
renders a useful tool for addressing multidomain simulation.