TCAD Newsletter - October 2010 Issue
Placing you one click away from the best new CAD research!


SPECIAL SECTION ON THE ACM-IEEE INTERNATIONAL CONFERENCE ON FORMAL METHODS
AND MODELS FOR CODESIGN (MEMOCODE) 2009

Chung, E. S.; Hoe, J. C., "High-Level Design and Validation of the BlueSPARC Multithreaded Processor"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580230&isn...

Abstract: This paper presents our experiences in using high-level methods to
design and validate a 16-way multithreaded microprocessor called BlueSPARC.
BlueSPARC is an in-order, high-throughput processor supporting complex features
such as privileged-mode operations, memory management, and a nonblocking cache
subsystem. Using a high-level design language called Bluespec System Verilog
(BSV), our final implementation achieves comparable synthesis quality to a
similar commercial microprocessor developed using conventional register
transfer level flows, and is capable of running unmodified commercial
applications while hosted on a Xilinx XCV2P70 field-programmable gate array
(FPGA) at 90 MHz. To validate our implementation, an FPGA-accelerated approach
was developed to efficiently check the correct execution of real,
nondeterministic multithreaded programs running on the BlueSPARC processor.
Together, the high-level language features of BSV along with our validation
approach enabled us to achieve a working FPGA-based implementation in less than
one man-year.


Vasudevan, N.; Edwards, S. A., "Buffer Sharing in Rendezvous Programs"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580213&isn...

Abstract: Most compilers focus on optimizing performance, often at the expense
of memory, but efficient memory use can be just as important in constrained
environments such as embedded systems. This paper presents a memory reduction
technique for rendezvous communication, which is applied to the deterministic
concurrent programming language SHIM. It focuses on reducing memory consumption
by sharing communication buffers among tasks. It determines pairs of buffers
that can never be in use simultaneously and use a shared region of memory for
each pair. The technique produces a static abstraction of a SHIM program's
dynamic behavior, which is then analyzed to find buffers that are never
occupied simultaneously. Experiments show the technique runs quickly on
modest-sized programs and can sometimes reduce memory requirements by half.


Briand, X.; Jeannet, B., "Combining Control and Data Abstraction in the Verification of Hybrid Systems"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580228&isn...

Abstract: This paper addresses the verification of hybrid systems built as the
composition of a discrete software controller interacting with a physical
environment exhibiting a continuous behavior. The goal is to attack the problem
of the combinatorial explosion of discrete states that may happen if a complex
software controller is considered. It proposes as a solution to extend an
existing abstract interpretation technique, namely dynamic partitioning, to
hybrid systems described in a symbolic formalism. Dynamic partitioning allows
us finely tune the tradeoff between precision and efficiency in a reachability
analysis. It shows the effectiveness of the approach by a case study that
combines a nontrivial controller specified in the synchronous dataflow
programming language Lustre with its physical environment.


Bohm, P., "Incremental and Verified Modeling of the PCI Express Protocol"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580232&isn...

Abstract: PCI Express is a modern, high-performance communication protocol
implementing sophisticated features to meet today's performance demands.
Although an off-chip protocol, PCI Express implements many principles of future
on-chip communication architectures. It is a highly complex protocol that is
hard to verify formally. We present the application of a new approach to the
PCI Express transaction and data link layers. The methodology is based on a
series of model transformation steps and revises the traditional modeling and
verification workflow for designing on-chip protocols. Major parts of PCI
Express, including performance-related optimizations and fault-tolerance
features, are modeled incrementally to control the complexity and composed to a
single model. The work has been accomplished in the Isabelle/HOL theorem
prover. By restricting the models to an executable subset of the specification
language, we have been able to combine the advantages of specifying in a
theorem prover with the advantages of executable models in a functional
programming language.


FPGAS AND RECONFIGURABLE COMPUTING

Lin, M.; Wawrzynek, J.; Gamal, A. E., "Exploring FPGA Routing Architecture Stochastically"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580223&isn...

Abstract: This paper proposes a systematic strategy to efficiently explore the
design space of field-programmable gate array (FPGA) routing architectures. The
key idea is to use stochastic methods to quickly locate near-optimal solutions
in designing FPGA routing architectures without exhaustively enumerating all
design points. The main objective of this paper is not as much about the
specific numerical results obtained, as it is to show the applicability and
effectiveness of the proposed optimization approach. To demonstrate the utility
of the proposed stochastic approach, we developed the tool for optimizing
routing architecture (TORCH) software based on the versatile place and route
tool. Given FPGA architecture parameters and a set of benchmark designs, TORCH
simultaneously optimizes the routing channel segmentation and switch box
patterns using the performance metric of average interconnect power-delay
product estimated from placed and routed benchmark designs. Special
techniques--such as incremental routing, infrequent placement, multi-modal move
selection, and parallelized metric evaluation - are developed to reduce the
overall run time and improve the quality of results. Our experimental results
have shown that the stochastic design strategy is quite effective in
co-optimizing both routing channel segmentation and switch patterns. With the
optimized routing architecture, relative to the performance of our chosen
architecture baseline, TORCH can achieve average improvements of 24% and 15% in
delay and power consumption for the 20 largest Microelectronics Center of North
Carolina benchmark designs, and 27% and 21% for the eight benchmark designs
synthesized with the Altera Quartus II University Interface Program tool.
Additionally, we found that the average segment length in an FPGA routing
channel should decrease with technology scaling. Finally, we demonstrate the
versatility of TORCH by illustrating how TORCH can be used to optimize other
aspects of the routing architecture in an FPGA.


Regular Papers
==============

MODELING AND SIMULATION

Silva, J. M. S.; Phillips, J. R.; Silveira, L. M., "Efficient Simulation of Power Grids"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580215&isn...

Abstract: Modern deep sub-micron ultra-large scale integration designs with
hundreds of millions of devices require huge grids for power distribution. Such
grids, operating with decreasing power supply voltages, are a design limiting
factor and accurate analysis of their behavior is of paramount importance as
any voltage drops can seriously impact performance or functionality. As power
grid models have millions of unknowns, highly optimized special-purpose
simulation tools are required to handle the time and memory complexity of
solving for their dynamic behavior. In this paper, we propose a hierarchical
matrix representation of the power grid model that is both space and time
efficient. With this representation, reduced storage matrix factors are
efficiently computed and applied in the analysis at every time-step of the
simulation. Results show an almost linear complexity growth, namely O(n
log^{a}(n)), for some small constant "a", in both space and time, when using
this matrix representation. Comparisons of our academic implementation with
production-quality code prove this method to be very efficient when dealing
with the simulation of large power grid models.


Chakraborty, A.; Shi, S. X.; Pan, D. Z., "Stress Aware Layout Optimization Leveraging Active Area Dependent Mobility Enhancement"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580229&isn...

Abstract: Starting from the 90 nm technology node, process induced stress has
played a key role in the design of high-performance devices. The emergence of
source/drain silicon germanium (S/D SiGe) technique as the most important
stressing mechanism for p-channel metal-oxide-semiconductor field-effect
transistor devices has opened up various optimization possibilities at circuit
and physical design stage. In this paper, we exploit the active area dependence
of the performance improvement achievable using S/D SiGe technology for late
stage engineering change order (ECO) timing optimization. An active area sizing
aware cell-level delay model is derived which forms the basis of linear program
based optimization of a design for achieving maximum performance or target
performance under a timing budget. To control the magnitude of layout
perturbation and ensure predictable timing improvement, a set of physical
constraints for active area sizing is proposed. Further, an efficient minimum
movement legalization algorithm is proposed to remove the overlaps caused by
active area sizing of timing critical cells. Results on a wide variety of
benchmarks show consistent reduction in the cycle time by up to 6.3%.
Predictability of the performance improvement achievable as well as resultant
minuscule layout changes make our technique very attractive for late stage ECO
optimization and design closure.

PHYSICAL DESIGN

Lu, Y.; Zhou, H.; Shang, L.; Zeng, X., "Multicore Parallelization of Min-Cost Flow for CAD Applications"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580222&isn...

Abstract: Computational complexity has been the primary challenge of many very
large scale integration computer-aided design (CAD) applications. The emerging
multicore and many-core microprocessors have the potential to offer scalable
performance improvements. How to explore the multicore resources to speed up
CAD applications is thus a natural question but also a huge challenge for CAD
researchers. This paper proposes a methodology to explore concurrency via
nondeterministic transactional models, and to program them on multicore
processors for CAD applications. Various run-time scheduling implementations on
multicore shared-memory machines are discussed and the most efficient one is
identified. The proposed methodology is applied to the min-cost flow problem
which has been identified as the key problem in many design optimizations, from
wire-length optimization in detailed placement to timing-constrained voltage
assignment. A concurrent algorithm for min-cost flow has been developed based
on the methodology Experiments on voltage island generation in floorplanning
have demonstrated its efficiency and scalable speedup over different numbers of
cores.


SYSTEM-LEVEL DESIGN

Arjomand, M.; Sarbazi-Azad, H., "Power-Performance Analysis of Networks-on-Chip With Arbitrary Buffer Allocation Schemes"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580217&isn...

Abstract: End-to-end delay, throughput, energy consumption, and silicon area
are the most important design metrics of networks-on-chip (NoCs). Although
several analytical models have been previously proposed for predicting such
metrics in NoCs, very few of them consider the effect of message waiting time
in the buffers of network routers for predicting overall power consumptions and
none of them consider structural heterogeneity of network routers. This paper
introduces two inter-related analytical models to compute message latency and
power consumption of NoCs with arbitrary topology, buffering structure, and
routing algorithm. Buffer allocation scheme defines the buffering space for
each individual channel of the NoC that can be homogenous (all channels having
similar buffer structures) or heterogeneous (each channel having its own buffer
structure). Here, the buffer allocation scheme can be either homogenous or
heterogeneous. We assume no bandwidth sharing of virtual channels for a
physical channel, and IP cores generate messages following a Poisson
distribution. The results obtained from simulation experiments confirm that the
proposed models exhibit acceptable accuracy for different network
configurations operating under various working conditions. We have shown that
basing our analysis on a Poisson traffic model is still useful for scenarios
with real application workloads.


Jang, W.; Pan, D. Z., "An SDRAM-Aware Router for Networks-on-Chip"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580227&isn...

Abstract: Networks-on-chip (NoCs) may interface with lots of synchronous
dynamic random access memories (SDRAM) to provide enough memory bandwidth and
guaranteed quality-of-service for future systems-on-chip (SoCs). SDRAM is
commonly controlled by a memory subsystem that schedules memory requests to
improve memory efficiency and latency. However, a memory subsystem is still a
performance bottleneck in the entire NoC. Therefore, memory-aware NoC
optimization has attracted considerable attention. This paper presents a NoC
router with an explicit SDRAM-aware flow control. Based on priority-based
arbitration, our SDRAM-aware flow controller schedules memory requests to
prevent bank conflict, data contention, and short turn-around bank
interleaving. Moreover, our multi-stage scheduling scheme further improves
memory performance and saves NoC hardware costs. Experimental results show that
our cost-efficient SDRAM-aware NoC design significantly improves memory latency
and utilization compared to the conventional NoC design with no SDRAM-aware
router.


Sharifi, S.; Rosing, T. S., "Accurate Direct and Indirect On-Chip Temperature Sensing for Efficient Dynamic Thermal Management"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580218&isn...

Abstract: Dynamic thermal management techniques require accurate runtime
temperature information in order to operate effectively and efficiently. In
this paper, we propose two novel solutions for accurate sensing of on-chip
temperature. Our first technique is used at design time for sensor allocation
and placement to minimize the number of sensors while maintaining the desired
accuracy. The experimental results show that this technique can improve the
efficiency and accuracy of sensor allocation and placement compared to previous
work and can reduce the number of required thermal sensors by about 16% on
average. Secondly, we propose indirect temperature sensing to accurately
estimate the temperature at arbitrary locations on the die based on the noisy
temperature readings from a limited number of sensors which are located further
away from the locations of interest. Our runtime technique for temperature
estimation reduces the standard deviation and maximum value of temperature
estimation errors by an order of magnitude.


Yang, H.; Kim, S.; Ha, S., "An MILP-Based Performance Analysis Technique for Non-Preemptive Multitasking MPSoC"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580225&isn...

Abstract: For real-time applications, it is necessary to estimate the
worst-case performance early in the design process without actual hardware
implementation. While the non-preemptive task scheduling is pertinent to
multi-core platforms because of easy implementation and high performance, its
scheduling anomaly behavior makes the worst-case performance estimation
extremely difficult. In this paper, we propose an analysis technique based on
mixed integer linear programming (MILP) to estimate the worst-case performance
of each task in a non-preemptive multitask application on multi-processor
system-on-chip architecture. MILP provides a systematic way to describe the
complex interaction among task scheduling, communication architecture, and task
execution, which affects the worst-case behavior dynamically. The proposed
analysis technique overcomes several limitations that previous work usually
has; it allows multiple tasks with different periods and models contention on
the communication architecture. We show that the proposed analysis takes
affordable computation time to make it of practical value even though it has
exponential complexity in theory. The proposed technique estimates a safe bound
on task latency statistically, which is demonstrated by extensive random
simulations.


TEST

Miskov-Zivanov, N.; Marculescu, D., "Multiple Transient Faults in Combinational and Sequential Circuits: A Systematic Approach"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580219&isn...

Abstract: Transient faults in logic circuits are becoming an important
reliability concern for future technology nodes. Radiation-induced faults have
received significant attention in recent years, while multiple transients
originating from a single radiation hit are predicted to occur more often.
Furthermore, some effects, like reconvergent fanout-induced glitches, are more
pronounced in the case of multiple faults. Therefore, to guide the design
process and the choice of circuit optimization techniques, it is important to
model multiple faults and their propagation through logic circuits, while
evaluating the changes in error rates resulting from multiple simultaneous
faults. In this paper, we show how output error probabilities change with
increasing number of simultaneous faults and we also analyze the impact of
multiple errors in state flip-flops, during the cycles following the cycle when
fault(s) occurred. The results obtained using the proposed framework show that
output error probability resulting from multiple-event transient or
multiple-bit upsets can vary across different outputs and different circuits by
several orders of magnitude. The results also show that the impact of different
masking factors also varies across circuits and this information can be
valuable for customizing protection techniques.


Tseng, T.W.; Huang, Y.J.; Li, J.F., "DABISR: A Defect-Aware Built-In Self-Repair Scheme for Single/Multi-Port RAMs in SoCs"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580226&isn...

Abstract: Built-in self-repair (BISR) techniques are widely used to enhance the
yield of embedded random access memories (RAMs). Fault-location ability of test
algorithms executed by a BISR circuit has heavy impact on the repair efficiency
of the BISR circuit. This paper proposes a defect-aware BISR (DABISR) scheme
for single-port RAMs (SPRAMs) and multi-port RAMs (MPRAMs) in system chips.
Multiple RAMs can share a DABISR such that the area cost of DABISR is
drastically reduced. We also present two defect-location algorithms (DLAs) for
identification of bridge defects between word-lines and bit-lines of MPRAMs.
The DABISR can perform DLAs to locate bridge defects such that it can provide
high repair efficiency. For example, simulation results show that if a faulty
two-port RAM has 20% inter-port faults, the DLAs can help to gain 8.4-14.4%
increase of repair rate for different redundancy configurations. In comparison
with an existing shared BISR scheme, however, the DABISR only incurs about
0.34% additional area overhead to support the function of DLAs.


Short Papers
============

Tenentes, V.; Kavousianos, X.; Kalligeros, E., "Single and Variable-State-Skip LFSRs: Bridging the Gap Between Test Data Compression and Test Set Embedding for IP Cores"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580216&isn...

Abstract: Even though test set embedding (TSE) methods offer very high
compression efficiency, their excessively long test application times prohibit
their use for testing systems-on-chip (SoC). To alleviate this problem we
present two new types of linear feedback shift registers (LFSRs), the
Single-State-Skip and the Variable-State-Skip LFSRs. Both are normal LFSRs with
the addition of the State-Skip circuit, which is used instead of the
characteristic-polynomial feedback structure for performing successive jumps of
constant and variable length in their state sequence. By using
Single-State-Skip LFSRs for testing single or multiple identical cores and
Variable-State-Skip LFSRs for testing multiple non-identical cores we get the
well-known high compression efficiency of TSE with substantially reduced test
sequences, thus bridging the gap between test data compression and TSE methods.


Lo, C.Y.; Hsing, Y.T.; Denq, L.M.; Wu, C.W., "SOC Test Architecture and Method for 3-D ICs"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580221&isn...

Abstract: 3-D integration provides another way to put more devices in a smaller
footprint. However, it also introduces new challenges in testing. Flexible test
architecture named test access control system for 3-D integrated circuits
(TACS-3D) is proposed for 3-D integrated circuits (IC) testing. Integration of
heterogeneous design-for-testability methods for logic, memory, and
through-silicon via (TSV) testing further reduces the usage of test pins and
TSVs. To highly reuse pre-bond test circuits in post-bond test, an innovative
linking mechanism shares TSVs and test pins of the 3-D IC. No matter how many
layers are there in the 3-D IC, a large portion of TSVs and test pins is
reserved for data application. Therefore, smaller post-bond test time is
expected. A test chip composed of a network security processor platform is
taken as an example. Less than 0.4% test overhead increases in area and time
between 2-D and 3-D cases. Compared with the instinctively direct access,
TACS-3D reveals up to 54% test time improvement under the same TSV usage.


Moiseev, K.; Kolodny, A.; Wimer, S., "Interconnect Bundle Sizing Under Discrete Design Rules"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580220&isn...

Abstract: The lithography used for 32 nm and smaller very large scale
integrated process technologies restricts the admissible interconnect widths
and spaces to a small set of discrete values with some interdependencies, so
that traditional interconnect sizing by continuous-variable optimization
techniques becomes impossible. We present a dynamic programming (DP) algorithm
for simultaneous sizing and spacing of all wires in interconnect bundles,
yielding the optimal power-delay tradeoff curve. DP algorithm sets the width
and spacing of all interconnects simultaneously, thus finding the global
optimum. The DP algorithm is generic and can handle a variety of power-delay
objectives, such as total power or delay, or weighted sum of both, power-delay
product, max delay, and alike. The algorithm consistently yields 6% dynamic
power and 5% delay reduction for interconnect channels in industrial
microprocessor blocks designed in 32 nm process technology, when applied as a
post-layout optimization step to redistribute wires within interconnect
channels of fixed width, without changing the area of the original layout.


Liu, F.; Song, X.; Tan, Q.; Chen, G., "Formal Analysis of End-Around-Carry Adder in Floating-Point Unit"

URL:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580224&isn...

Abstract: End-around-carry (EAC) adder is extensively used in microprocessor's
floating-point units. This paper presents an algebraic characterization of an
EAC adder's algorithm. A novel symbolic analysis approach is presented to prove
the EAC adder's correctness. Algebraic structures and first-order recursive
equations are harnessed in proof derivations. A hybrid prefix/EAC architecture
is considered.