January 2012 Newsletter
Placing you one click away from the best new CAD research!
Plain-text version at http://www.umn.edu/~tcad/newsletter/2012-01.txt 

WE WISH YOU A HAPPY NEW YEAR AND ALL THE BEST FOR 2012!

Editorial

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6106728

List of Reviewers for 2011

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6106727

SPECIAL SECTION ON PAR-CAD: PARALLEL CAD ALGORITHMS AND CAD FOR PARALLEL ARCHITECTURES/SYSTEMS

Marculescu, D.  Li, P.  
Guest Editorial 
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6106736 

Kapre, N.  DeHon, A. 
SPICE2: Spatial Processors Interconnected for Concurrent Execution for
Accelerating the SPICE Circuit Simulator Using an FPGA 
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106733

Spatial processing of sparse, irregular, double-precision floating-point
computation using a single field-programmable gate array (FPGA) enables up to
an order of magnitude speedup (mean 2.8x speedup) over a conventional
microprocessor for the SPICE circuit simulator. We develop a parallel,
FPGA-based, heterogeneous architecture customized for accelerating the SPICE
simulator to deliver this speedup. To properly parallelize the complete
simulator, we decompose SPICE into its three constituent phases-model
evaluation, sparse matrix-solve, and iteration control-and customize a spatial
architecture for each phase independently. Our heterogeneous FPGA organization
mixes very large instruction word, dataflow and streaming architectures into a
cohesive, unified design to match the parallel patterns exposed by our
programming framework. This FPGA architecture is able to outperform
conventional processors due to a combination of factors, including high
utilization of statically-scheduled resources, low-overhead dataflow scheduling
of fine-grained tasks, and streaming, overlapped processing of the control
algorithms. We demonstrate that we can independently accelerate model
evaluation by a mean factor of 6.x (1.4-23x) across a range of nonlinear device
models and matrix solve by 2.4x (0.6-13x) across various benchmark matrices
while delivering a mean combined speedup of 2.8x (0.2-11x) for the composite
design when comparing a Xilinx Virtex-6 LX760 (40 nm) with an Intel Core i7 965
(45 nm). We also estimate mean energy savings of 8.9x (up to 40.9x) when
comparing a Xilinx Virtex-6 LX760 with an Intel Core i7 965.

Sridhar, A.  Vincenzi, A.  Ruggiero, M.  Atienza, D.  
Neural Network-Based Thermal Simulation of Integrated Circuits on GPUs
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106739 

With the rising challenges in heat removal in integrated circuits (ICs), the
development of thermal-aware computing architectures and run-time management
systems has become indispensable to the continuation of IC design scaling.
These thermal-aware design technologies of the future strongly depend on the
availability of efficient and accurate means for thermal modeling and analysis.
These thermal models must have not only the sufficient accuracy to capture the
complex mechanisms that regulate thermal diffusion in ICs, but also a level of
abstraction that allows for their fast execution for design space exploration.
In this paper, we propose an innovative thermal modeling approach for
full-chips that can handle the scalability problem of transient heat flow
simulation in large 2-D/3-D multiprocessor ICs. This is achieved by
parallelizing the computation-intensive task of transient temperature tracking
using neural networks and exploiting the computational power of massively
parallel graphics processing units. Our results show up to 35x run-time speedup
compared to state-of-the-art IC thermal simulation tools while keeping the
error lower than 1C. Speedups scale with the size of the 3-D multiprocessor ICs
and our proposed method serves as a valuable design space exploration tool.   

Villena, J. F.  Silveira, L.  
Exploiting Parallelism for Improved Automation of Multidimensional Model Order
Reduction 
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106740

This paper addresses the issue of automatically generating reduced order models
of very large multidimensional systems. To tackle this problem we introduce an
efficient parallel projection based model order reduction framework for
parameterized linear systems. The underlying methodology is based on an
automated multidimensional sample selection procedure that maximizes
effectiveness in the generation of the projection basis. The parallel nature of
the algorithm is efficiently exploited using both shared and distributed memory
architectures. This leads to a highly scalable, automatic, and reliable
parallel reduction scheme, able to handle very large systems depending on
multiple parameters. In addition, the framework is general enough to provide a
good approximation regardless of the model's representation or underlying
nature, as will be demonstrated on a variety of benchmark examples. The method
provides the potential to tackle, in an automatic fashion, extremely
challenging models that would be otherwise difficult to address with existing
sequential approaches.

Kim, M.-C.  Lee, D.-J.  Markov, I. L.  
SimPL: An Effective Placement Algorithm 
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106735

We propose a self-contained, flat, quadratic global placer that is simpler than
existing placers and easier to integrate into timing-closure flows. It
maintains lower-bound and upper-bound placements that converge to a final
solution. The upper-bound placement is produced by a novel look-ahead
legalization algorithm. Our placer SimPL outperforms mPL6, FastPlace3,
NTUPlace3, APlace2, and Capo simultaneously in runtime and solution quality,
running 7.10 times faster than mPL6 (when using a single thread) and reducing
wirelength by 3% on the ISPD 2005 benchmark suite. More significant
improvements are achieved on larger benchmarks. The new algorithm is amenable
to parallelism, and we report empirical studies with SSE2 instructions and up
to eight parallel threads.
   
Gort, M.  Anderson, J. H.  
Accelerating FPGA Routing Through Parallelization and Engineering Enhancements
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106729 

We present parallelization and heuristic techniques to reduce the run-time of
field-programmable gate array (FPGA) negotiated congestion routing. Two
heuristic optimizations provide over 3x speedup versus a sequential baseline.
In our parallel approach, sets of design signals are assigned to different
processor cores and routed concurrently. Communication between cores is through
the message passing interface communications protocol. We propose a geographic
partitioning of signals into independent sets to help minimize the
communication overhead. Our parallel implementation provides approximately 2.3x
speedup using four cores and produces deterministic/repeatable results. When
combined, the parallel and heuristic techniques provide over 7x speedup with
four cores versus the router in the widely used Versatile Place and Route (VPR)
FPGA placement/routing framework, with no significant impact on circuit speed
or wirelength.

REGULAR PAPERS

LOGIC SYNTHESIS

Agyekum, M. Y.  Nowick, S. M.  
Error-Correcting Unordered Codes and Hardware Support for Robust Asynchronous
Global Communication 
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106737

This paper introduces a new family of error-correction unordered (ECU) codes
for global communication, called Zero-Sum. They combine the timing-robustness
of delay-insensitive (i.e., unordered) codes with the fault-tolerance of
error-correcting codes (providing 1-bit error correction or 2-bit detection).
Two key features of the codes are that they are systematic, allowing direct
extraction of data, and weighted, where the check field is computed as the sum
of data index weights. A wide variety of weight assignments is shown to be
feasible. Two practical enhancements are also proposed. The Zero-Sum+ code
extends error detection to 3-bit errors, or alternatively handles 2-bit
detection and 1-bit correction. The Zero-Sum* code supports heuristic 2-bit
correction, while still guaranteeing 2-bit detection, under different
strategies of weight assignment. Detailed hardware implementations of the
supporting components (encoder, completion detection, error corrector) are
given, as well as an outline of the system microarchitecture. In comparison to
the best alternative systematic ECU code, the basic Zero-Sum code provided
better or comparable coding efficiency, with a 5.74%-18.18% reduction in
average number of wire transitions for most field sizes. Several Zero-Sum*
codes were also evaluated for their 2-bit error correction coverage; initial
results are promising, where the best strategy corrected 52.92%-71.16% of all
2-bit errors for most field sizes, with only a moderate decrease in coding
efficiency and increase in wire transitions. Technology-mapped pre-layout
implementations of the supporting Zero-Sum code hardware were synthesized with
the UC Berkeley ABC tool using a 90 nm industrial standard cell library.
Results indicate that they have moderate area and delay overheads. In
comparison, supporting hardware for the best nonsystematic ECU codes have
3.82-10.44x greater area for larger field sizes.

MODELING AND SIMULATION

Hajj, I. N.  
Extended Nodal Analysis
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106732

This paper presents an extension to the popular nodal and modified nodal
formulation methods that allows elements whose characteristic functions include
controlling variables, in addition to voltages and currents, other variables,
such as charge, flux, and other physical parameters, to be included in the
circuit equation formulation in a straightforward manner. Stamps, similar to
nodal and modified nodal circuit element stamps, are developed to include these
elements in the circuit matrix equation without the need of deriving equivalent
circuit models consisting of interconnections of elements characterized only by
currents and voltages, as in the current practice. The method is applied to
derive circuit stamps of memristive, memcapacitive, meminductive, and other
complex device models. The method reduces the size of the overall circuit
matrix and allows easy model evaluation and linearization during the circuit
iterative solution process.

Zhou, Y.  Gad, E.  Nakhla, M. S.  Achar, R.  
Structural Characterization and Efficient Implementation Techniques for
A-Stable High-Order Integration Methods
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106731

This paper presents structural characterization and performance enhancement
strategies for the recently proposed A-stable and L-stable high-order
integration methods based on the Obreshkov formula. It is demonstrated that
although the Jacobian matrix in these methods has a bigger size than the
Jacobian matrix in classical low-order methods, it enjoys a special structure
that can be used to develop efficient factorization techniques. In addition,
the paper proposes a method to reduce the number of Newton-Raphson iterations
needed to converge in the large Jacobian domain.

Wang, Y.  Hu, X.  Cheng, C.-K.  Pang, G. K. H.  Wong, N.  
A Realistic Early-Stage Power Grid Verification Algorithm Based on Hierarchical
Constraints
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106741

Power grid verification has become an indispensable step to guarantee a
functional and robust chip design. Vectorless power grid verification methods,
by solving linear programming (LP) problems under current constraints, enable
worst-case voltage drop predictions at an early stage of design when the
specific waveforms of current drains are unknown. In this paper, a novel power
grid verification algorithm based on hierarchical constraints is proposed. By
introducing novel power constraints, the proposed algorithm generates more
realistic current patterns and provides less pessimistic voltage drop
predictions. The model order reduction-based coefficient computation algorithm
reduces the complexity of formulating the LP problems from being proportional
to steps to being independent of steps. Utilizing the special hierarchical
constraint structure, the submodular polyhedron greedy algorithm dramatically
reduces the complexity of solving the LP problems from over O(km3) to roughly
O(km log km), where km is the number of variables. Numerical results have shown
that the proposed algorithm provides less pessimistic voltage drop prediction
while at the same time achieves dramatic speedup.

SYSTEM-LEVEL DESIGN

Yun, D.  Kim, S.  Ha, S. 
A Parallel Simulation Technique for Multicore Embedded Systems and Its
Performance Analysis 
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106734

A virtual prototyping system is constructed by replacing real processing
components with component simulators running concurrently. The performance of
such a distributed simulation decreases drastically as the number of component
simulators increases. Thus, we propose a novel parallel simulation technique to
boost up the simulation speed. In the proposed technique, a simulator wrapper
performs time synchronization with the simulation backplane on behalf of the
associated component simulator itself. Component simulators send null messages
periodically to the backplane to enable parallel simulation without any
causality problems. Since excessive communication may degrade the simulation
performance, we also propose a novel performance analysis technique to
determine an optimal period of null message transfer, considering both the
characteristics of a target application and the configurations of the
simulation host. Through intensive experiments, we show that the proposed
parallel simulation achieves almost linear speedup to the number of processor
cores if the frequency of null message transfer is optimally decided. The
proposed analysis technique could predict the simulation performance with more
than 90% accuracy in the worst case for various target applications and
simulation environments we have used for experiments. 

Yang, C.  Orailoglu, A. 
Tackling Resource Variations Through Adaptive Multicore Execution Frameworks 
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106742 

Multicore architectures have been widely adopted to accommodate the rising
performance demand in various application domains, ranging from high-end
supercomputing to low-end consumer electronics. Yet due to the ever growing
integration density and application complexity, such architectures suffer from
increased level of core availability variations. At runtime, issues such as
device failures, heat buildup, as well as resource competitions and preemptions
can make computational resources unavailable, necessitating execution schedules
capable of delivering diverse performance levels to match the varying resource
allocations. The adaptive execution framework introduced in this paper delivers
high-quality schedules capable of predictably reconfiguring execution and
gracefully degrading performance in the face of resource unavailability. By
adhering to a novel band structure, a set of possible execution schedules are
compactly engendered in readiness at compile time, thus delivering predictable
responses to runtime resource variations. More importantly, through the
exploitation of an extra degree of freedom in the scheduling process, the
scheduler can perform task assignments in such a way that adaptivity can be
embedded within the preoptimized schedules at almost no cost. The efficacy of
the proposed technique is confirmed by incorporating it into a conventional,
widely adopted scheduling heuristic and experimentally verifying it in the
context of single core degradations.

Daneshtalab, M.  Ebrahimi, M.  Liljeberg, P.  Plosila, J.  Tenhunen, H.  
Memory-Efficient On-Chip Network With Adaptive Interfaces 
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106730

To achieve higher memory bandwidth in network-based multiprocessor
architectures, multiple dynamic random access memories can be accessed
simultaneously. In such architectures, not only resource utilization and
latency are the critical issues but also a reordering mechanism is required to
deliver the response transactions of concurrent memory accesses in-order. In
this paper, we present a memory-efficient on-chip network architecture to cope
with these issues efficiently. Each node of the network is equipped with a
novel network interface (NI) to deal with out-of-order delivery, and a
priority-based router to decrease the network latency. The proposed NI exploits
a streamlined reordering mechanism to handle the in-order delivery and utilizes
the advance extensible interface transaction-based protocol to maintain
compatibility with existing intellectual property cores. To improve the memory
utilization and reduce the memory latency, an optimized memory controller is
integrated in the presented NI. Experimental results with synthetic test cases
demonstrate that the proposed on-chip network architecture provides significant
improvements in average network latency (16%), average memory access latency
(19%), and average memory utilization (22%).

SHORT PAPERS

Shi, B.  Zhang, Y.  Srivastava, A.  
Accelerating Gate Sizing Using Graphics Processing Units 
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106738

In this paper, we investigate the gate sizing problem and develop techniques
for improving the runtime by effectively exploiting the graphics processing
unit (GPU) resources. Theoretically, we investigate a randomized cutting
plane-based convex optimization technique which is highly parallelizable and
can effectively exploit the single instruction multiple data structure imposed
by GPUs. In order to further improve the runtime, we also develop GPU-oriented
implementation guidelines that exploit the specific structure that convex gate
sizing formulations impose. We implemented our method on NVIDIA Tesla 10 GPU
and obtain 21x to 400x speedup compared to the MOSEK optimization tool
implemented on conventional CPU. The quality of solution of our method is very
close to that achieved by MOSEK, since both are optimal.