TCAD Newsletter - Apr 2011 Issue
Placing you one click away from the best new CAD research!

KEYNOTE PAPER

Cong, J.  Liu, B.  Neuendorffer, S.  Noguera, J.  Vissers, K.  Zhang, Z.
High-Level Synthesis for FPGAs: From Prototyping to Deployment

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737854

Escalating system-on-chip design complexity is pushing the design community to
raise the level of abstraction beyond register transfer level. Despite the
unsuccessful adoptions of early generations of commercial high-level synthesis
(HLS) systems, we believe that the tipping point for transitioning to HLS
methodology is happening now, especially for field-programmable gate array
(FPGA) designs. The latest generation of HLS tools has made significant
progress in providing wide language coverage and robust compilation technology,
platform-based modeling, advancement in core HLS algorithms, and a
domain-specific approach. In this paper, we use AutoESL's AutoPilot HLS tool
coupled with domain-specific system-level implementation platforms developed by
Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA
synthesis solutions targeting multiple application domains. Complex industrial
designs targeting Xilinx FPGAs are also presented as case studies, including
comparison of HLS solutions versus optimized manual designs. In particular, the
experiment on a sphere decoder shows that the HLS solution can achieve an
11-31% reduction in FPGA resource usage with improved design productivity
compared to hand-coded design.

SPECIAL ISSUE ON THE ACM/IEEE INTERNATIONAL SYMPOSIUM ON NETWORKS ON CHIP 2010

GUEST EDITORIAL

Benini, L.  Carloni, L. P. Guest Editorial: Special Section on the ACM/IEEE
Symposium on Networks-on-Chip 2010
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737845

SPECIAL SECTION PAPERS
Horak, M. N.  Nowick, S. M.  Carlberg, M.  Vishkin, U. A Low-Overhead
Asynchronous Interconnection Network for GALS Chip Multiprocessors
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737866

A new asynchronous interconnection network is introduced for
globally-asynchronous locally-synchronous (GALS) chip multiprocessors. The
network eliminates the need for global clock distribution, and can interface
multiple synchronous timing domains operating at unrelated clock rates. In
particular, two new highly-concurrent asynchronous components are introduced
which provide simple routing and arbitration/merge functions. Post-layout
simulations in identical commercial 90 nm technology indicate that comparable
recent synchronous router nodes have 5.6-10.7 $times$ more energy per packet
and 2.8-6.4$times$ greater area than the new asynchronous nodes. Under random
traffic, the network provides significantly lower latency and identical
throughput over the entire operating range of the 800 MHz network and through
mid-range traffic rates for the 1.36 GHz network, but with degradation at
higher traffic rates. Preliminary evaluations are also presented for a
mixed-timing (GALS) network in a shared-memory parallel architecture, running
both random traffic and parallel benchmark kernels, as well as directions for
further improvement.


Bogdan, P.  Marculescu, R. Non-Stationary Traffic Analysis and Its Implications
on Multicore Platform Design
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737846

Networks-on-chip (NoCs) have been proposed as a viable solution to solving the
communication problem in multicore systems. In this new setup, mapping multiple
applications on available computational resources leads to interaction and
contention at various network resources. Consequently, taking into account the
traffic characteristics becomes of crucial importance for performance analysis
and optimization of the communication infrastructure, as well as proper
resource management. Although queuing-based approaches have been traditionally
used for performance analysis purposes, they cannot properly account for many
of the traffic characteristics (e.g., non-stationarity, self-similarity) that
are crucial for multicore platform design. To overcome these limitations, we
propose a statistical physics inspired approach to capture the traffic dynamics
in multicore systems. As shown later in this paper, this is of fundamental
significance for re-thinking the very basis of multicore systems design; it
also opens up new research directions into NoC optimization which require
accurate models of time-dependent and space-dependent traffic behavior.


Matsutani, H.  Koibuchi, M.  Ikebuchi, D.  Usami, K.  Nakamura, H.  Amano,
H. Performance, Area, and Power Evaluations of Ultrafine-Grained Run-Time
Power-Gating Routers for CMPs
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737865

This paper proposes the ultrafine-grained run-time power gating of on-chip
routers, in which the power supply to each router component (e.g.,
virtual-channel buffer, virtual-channel multiplexer, and crossbar multiplexer
and output latch) can be individually controlled based on the applied workload.
Since only the router components that are transferring a packet are activated,
the leakage power of the on-chip network can be reduced to a near-optimal
level. However, such techniques inherently increase the communication latency
and degrade the application performance, since a certain amount of wakeup
latency is required to activate the sleeping components. To mitigate this
wakeup latency, an early wakeup method that can preliminarily detect the next
packet arrival and activate the corresponding components is essential. We
designed and implemented an ultrafine-grained power-gating router using a
commercial 65 nm process. We propose four early wakeup methods and combine them
with the power-gating router. The proposed router with the early wakeup methods
is evaluated in terms of its application performance, area overhead, and
leakage power reduction taking into account the on/off energy overhead. The
simulation results showed that it reduces the leakage power by 54.4-59.9% on
average even when the application programs are fully running, at the expense of
4.6% of the area and 0.7-3.7% of the performance overheads when we assume a 1
GHz operation.

Rodrigo, S.  Flich, J.  Roca, A.  Medardoni, S.  Bertozzi, D.  Camacho, J. 
Silla, F.  Duato, J. Cost-Efficient On-Chip Routing Implementations for CMP and
MPSoC Systems
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737867

The high-performance computing domain is enriching with the inclusion of
networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs)
architectures. NoCs face the communication scalability challenge while meeting
tight power, area, and latency constraints. Designers must address new
challenges that were not present before. Defective components, the enhancement
of application-level parallelism, or power-aware techniques may break topology
regularity, thus, efficient routing becomes a challenge. This paper presents
universal logic-based distributed routing (uLBDR), an efficient logic-based
mechanism that adapts to any irregular topology derived from 2-D meshes,
instead of using routing tables. uLBDR requires a small set of configuration
bits, thus being more practical than large routing tables implemented in
memories. Several implementations of uLBDR are presented highlighting the
tradeoff between routing cost and coverage. The alternatives span from the
previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism
achieving full coverage. This comes with a small performance cost, thus
exhibiting the tradeoff between fault tolerance and performance. Power
consumption, area, and delay estimates are also provided highlighting the
efficiency of the mechanism. To do this, different router models (one for CMPs
and one for MPSoCs) have been designed as a proof concept.

Ramanujam, R. S.  Soteriou, V.  Lin, B.  Peh, L.-S. Extending the Effective
Throughput of NoCs With Distributed Shared-Buffer Routers
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737868

Router microarchitecture plays a central role in the performance of
networks-on-chip (NoCs). Buffers are needed in routers to house incoming flits
that cannot be immediately forwarded due to contention. This buffering can be
done at the inputs or the outputs of a router, corresponding to an
input-buffered router (IBR) or an output-buffered router (OBR). OBRs are
attractive because they can sustain higher throughputs and have lower queuing
delays under high loads than IBRs. However, a direct implementation of an OBR
requires a router speedup equal to the number of ports, making such a design
prohibitive under aggressive clocking needs and limited power budgets of most
NoC applications. In this paper, a new router design based on a distributed
shared-buffer (DSB) architecture is proposed that aims to practically emulate
an OBR. The proposed architecture introduces innovations to address the unique
constraints of NoCs, including efficient pipelining and novel flow control.
Practical DSB configurations are also presented with reduced power overheads
while exhibiting negligible performance degradation. Compared to a
state-of-the-art pipelined IBR, the proposed DSB router achieves up to 19%
higher throughput on synthetic traffic and reduces packet latency on average by
61% when running SPLASH-2 benchmarks with high contention. On average, the
saturation throughput of DSB routers is within 7% of the theoretically ideal
saturation throughput under the synthetic workloads evaluated.


REGULAR PAPERS

MODELING AND SIMULATION

Feng, Z.  Zhao, X.  Zeng, Z. Robust Parallel Preconditioned Power Grid
Simulation on GPU With Adaptive Runtime Performance Modeling and Optimization
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737849

Leveraging the power of current-day graphics processing units for robust power
grid simulation remains a challenging task. Existing preconditioned iterative
methods that require incomplete matrix factorizations cannot be effectively
accelerated on graphics processing unit (GPU) due to its limited hardware
resource as well as data parallel computing. This paper presents an efficient
GPU-based multigrid preconditioning algorithm for robust power grid analysis.
By combining the fast geometric multigrid solver with the robust
Krylov-subspace iterative solver, power grid DC and transient analysis can be
performed efficiently on GPU without loss of accuracy (largest errors
${<},0.5~{rm mV}$). Unlike previous GPU-based algorithms that rely on good
power grid regularities, the proposed algorithm can be applied for more general
power grid structures. Additionally, we also propose an accuracy-aware GPU
performance modeling and optimization framework to automatically obtain the
best power grid simulation configurations. Experimental results show that the
DC and transient analysis on GPU can achieve more than $25X$ speedups over the
best available CPU-based solvers.


Li, X.-C.  Mao, J.-F.  Swaminathan, M. Transient Analysis of CMOS-Gate-Driven
RLGC Interconnects Based on FDTD
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737852

As the feature size of integrated circuits shrinking in deep submicron
technologies, time delay, and crosstalk noise of complementary
metal-oxide-semiconductor (CMOS)-gate-driven interconnects become critical
issues. Traditionally, CMOS driver is simplified as a linear circuit in which a
constant resistance is used to approximate the nonlinear and time-varying MOS
resistance, which is inaccurate for signal integrity analysis in high-speed
interconnect systems. This paper proposes a finite-difference time-domain
(FDTD)-based method for transient analysis of lossy transmission lines in the
presence of the nonlinear behavior of CMOS gates. The conventional FDTD with
second-order accuracy is used for interconnect analysis and the parameters with
frequency-dependent losses are also included. The nonlinear behavior of CMOS
gates is represented by alpha-power law model, with the drain current described
by piecewise linear function of the drain voltage and discretized in time
domain for the FDTD implementation. Explicit forms of the boundary conditions
are derived from the implicit interface equations and hence the stability is
strictly constrained by Courant condition. Experimental results show that the
proposed method has good accuracy and high efficiency with respect to HSPICE.
Therefore, it is useful for accurate prediction of time delay and crosstalk
noise in high-speed interconnect systems.


SYSTEM-LEVEL DESIGN


Hu, J.  Tseng, W.-C.  Xue, C. J.  Zhuge, Q.  Zhao, Y.  Sha, E. H.-M. Write
Activity Minimization for Nonvolatile Main Memory Via Scheduling and
Recomputation
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737850

Nonvolatile memories such as Flash memory, phase change memory (PCM), and
magnetic random access memory (MRAM) have many desirable characteristics for
embedded systems to employ them as main memory. However, there are two common
challenges we need to answer before we can apply nonvolatile memory as main
memory practically. First, nonvolatile memory has limited write/erase cycles
compared to DRAM. Second, a write operation is slower than a read operation on
nonvolatile memory. These two challenges can be answered by reducing the number
of write activities on nonvolatile main memory. In this paper, we proposed two
optimization techniques, write-aware scheduling and recomputation, to minimize
write activities on nonvolatile memory. With the proposed techniques, we can
both speed up the completion time of programs and extend nonvolatile memory's
lifetime. The experimental results show that the proposed techniques can reduce
the number of write activities on nonvolatile memory by 55.71% on average.
Thus, the lifetime of nonvolatile memory is extended to 2.5 times as long as
before on average. The completion time of programs can be reduced by 56.67% on
systems with NOR Flash memory and by 47.63% on systems with NAND Flash memory
on average.

Chiang, M.-C.  Yeh, T.-C.  Tseng, G.-F. A QEMU and SystemC-Based Cycle-Accurate
ISS for Performance Estimation on SoC Development
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737847

In this paper, we present a fast cycle-accurate instruction set simulator
(CA-ISS) for system-on-chip development based on QEMU and SystemC. Even though
most state-of-the-art commercial tools have tried very hard to provide all the
levels of details to satisfy the different requirements of the software
designer, the hardware designer, and even the system architect, the
hardware/software co-simulation speed is dramatically slow when co-simulating
the hardware models at the register-transfer level (RTL) with a full-fledged
operating system (OS). Our experimental results show that the combination of
QEMU and SystemC can make the co-simulation at the CA level much faster than
the conventional RTL simulation, even with a full-fledged operating system up
and running. Furthermore, the statistics indicate that with every instruction
executed and every memory accessed since power-on traced at the CA level, it
takes 28m15.804s on average to boot up a full-fledged Linux kernel, even on a
personal computer. Compared to the kernel boot time reported by Xilinx and
SiCortex, the proposed CA-ISS is about 6.09 times faster compared to "SystemC
without trace" of Xilinx and about 30.32 times faster compared to "SystemC
models converted from RTL" of SiCortex. The main contributions of this paper
are threefold: 1) a hardware/software co-simulation environment capable of
running a full-fledged OS at the early stage of the electronic system level
design flow at an acceptable simulation speed is proposed; 2) a virtual
platform constructed using the proposed CA-ISS as the processor model can be
used to estimate the performance of a target system from system perspective,
which all the previous works, such as QEMU-SystemC, do not provide; and 3) such
a virtual platform also provides the modeling capability from the transaction
level down to the CA level or the other way around.

Lee, J.  Shrivastava, A. Static Analysis of Register File Vulnerability
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737851

With continuous technology scaling, soft errors are becoming an increasingly
important design concern even for earth-bound applications. While compiler
approaches have the potential to mitigate the effect of soft errors with
minimal runtime overheads, static vulnerability estimation - an essential part
of compiler approaches - is lacking due to its inherent complexity. This paper
presents a static analysis approach for register file (RF) vulnerability
estimation. We decompose the vulnerability of a register into intrinsic and
conditional basic-block vulnerabilities. This decomposition allows us to
develop a fast, yet reasonably accurate RF vulnerability estimation mechanism.
We validate and compare a linear equation based method and an iterative method.
Also we demonstrate a practical application of RF vulnerability estimation to
compiler optimizations. Our experimental results on benchmarks from MiBench
suite indicate that not only our static RF vulnerability estimation is fast and
accurate, but also compiler optimizations enabled by our static estimation can
achieve very cost-effective protection of register files against soft errors.


VERIFICATION

Little, S.  Walter, D.  Myers, C.  Thacker, R.  Batchu, S.  Yoneda,
T. Verification of Analog/Mixed-Signal Circuits Using Labeled Hybrid Petri Nets
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737853

Mixed-signal designs integrate digital and analog circuits which complicates
the already difficult verification problem. This paper presents a model,
labeled hybrid Petri nets (LHPNs), that is developed to model this
heterogeneous set of components. To support formal verification, this paper
presents an efficient zone-based state space exploration algorithm for LHPNs.
This algorithm uses a process known as warping which allows zones to describe
continuous variables changing at variable rates. Finally, this paper describes
the application of this algorithm to analog/mixed-signal circuit examples.


SHORT PAPER

Mele, S.  Favalli, M. A SAT Based Test Generation Method for Delay Fault
Testing of Macro Based Circuits
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737848

This letter addresses the problem of delay fault test generation in circuits
using macros whose implementation is not known. The proposed approach uses a
new signal representation that allows us to evaluate any kind of sensitization
conditions (robust, non-robust, and functional) by means of Boolean
differential calculus. Such an approach makes use of binary decision diagrams
to support the computation of sensitization conditions for each macro along a
path and of Boolean satisfiability to justify such conditions at primary
inputs. Results are shown for a set of benchmarks.