TCAD Newsletter - March 2010 Issue
Placing you one click away from the best new CAD research!


Regular Papers
==============

Zheng, H.; "Compositional Reachability Analysis for Efficient Modular
Verification of Asynchronous Designs"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419238&isnumber=5419222

Abstract: Compositional verification is essential to address state explosion in
model checking. Traditionally, an over-approximate context is needed for each
individual component in a system for sound verification. This may cause state
explosion for the intermediate results as well as inefficiency for abstraction
refinement. This paper presents an opposite approach, a compositional
reachability method, which constructs the state space of each component from an
under-approximate context gradually until a counter-example is found or a
fixpoint in state space is reached. This method has an additional advantage in
that counter-examples, if there are any, can be found much earlier, thus
leading to faster verification. Furthermore, this modular verification
framework does not require complex compositional reasoning rules. The
experimental results indicate that this method is promising.


Verma, A. K.; Brisk, P.; Ienne, P.; "Fast, Nearly Optimal ISE Identification
With I/O Serialization Through Maximal Clique Enumeration"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419242&isnumber=5419222

Abstract: The last decade has witnessed the emergence of the
application-specific instruction-set processor (ASIP) as a viable platform for
embedded systems. Extensible ASIPs allow the user to augment a base processor
with instruction set extensions (ISEs) that execute on dedicated hardware
application-specific functional units (AFUs). Due to the limited number of read
and write ports in the register file of the base processor, the size and
complexity of AFUs are generally limited. Recent papers have focused on
overcoming these constraints by serializing access to the register file.
Exhaustive ISE enumeration methods are not scalable and generally fail for
larger applications and register files with a large number of read and write
ports. To address this concern, a new approach to ISE identification is
proposed. The approach presented in this paper significantly prunes the list of
the best possible ISE candidates compared to previous approaches.
Experimentally, we observe tha t the new approach produces optimal results on
larger applications where prior approaches either fail or produce inferior
results.


Lee, H.; Paik, S.; Shin, Y.; "Pulse Width Allocation and Clock Skew Scheduling:
Optimizing Sequential Circuits Based on Pulsed Latches"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419229&isnumber=5419222

Abstract: Pulsed latches, latches driven by a brief clock pulse, offer the same
convenience of timing verification and optimization as flip-flop-based
circuits, while retaining the advantages of latches over flip-flops. But a
pulsed latch that uses a single pulse width has a lower bound on its clock
period, limiting its capacity to deal with higher frequencies or operate at
lower $V_{dd}$. The limitation still exists even when clock skew scheduling is
employed, since the amount of skew that can be assigned and realized is
practically limited due to process variation. For the first time, we formulate
the problem of allocating pulse widths, out of a small discrete number of
predefined widths, and scheduling clock skews, within a predefined upper bound
on skew, for optimizing pulsed latch-based sequential circuits. We then present
an algorithm called PWCS_Optimize (pulse width allocation and clock skew
scheduling, PWCS) to solve the problem. The allocated skews are realized
through synthesis of local clock trees between pulse generators and latches,
and a global clock tree between a clock source and pulse generators.
Experiments with 65-nm technology demonstrate that combining a small number of
different pulse widths with clock skews of up to 10% of the clock period yield
the minimum achievable clock period for many benchmark circuits. The results
have an average figure of merit of 0.86, where 1.0 indicates a minimum clock
period, and the average reduction in area by 11%. The design flow including
PWCS_Optimize, placement and routing, and synthesis of local and global clock
trees is presented and assessed with example circuits.


Yan, J. Z.; Chu, C.; "DeFer: Deferred Decision Making Enabled Fixed-Outline
Floorplanning Algorithm"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419234&isnumber=5419222

Abstract: In this paper, we present DeFer\u2014a fast, high-quality, scalable,
and nonstochastic fixed-outline floorplanning algorithm. DeFer generates a
nonslicing floorplan by compacting a slicing floorplan. To find a good slicing
floorplan, instead of searching through numerous slicing trees by simulated
annealing as in traditional approaches, DeFer considers only one single slicing
tree. However, we generalize the notion of slicing tree based on the principle
of deferred decision making (DDM). When two subfloorplans are combined at each
node of the generalized slicing tree, DeFer does not specify their
orientations, the left-right/top-bottom order between them, and the slice line
direction. DeFer even does not specify the slicing tree structure for small
subfloorplan. In other words, we are deferring the decisions on these factors,
which are specified arbitrarily at an early step in traditional approaches.
Because of DDM, one slicing tree actually corresponds to a large number of
slicing floorplan solutions, all of which are efficiently maintained in one
single shape curve. With the final shape curve, it is straightforward to choose
a good floorplan fitting into the fixed outline. Several techniques are also
proposed to further optimize the wirelength. For both fixed-outline and
classical floorplanning problems, experimental results show that DeFer achieves
the best success rate, the best wirelength, the best runtime, and the best area
on average compared - with all other state-of-the-art floorplanners.


Cabodi, G.; Garcia, L. A.; Murciano, M.; Nocco, S.; Quer, S.; "Partitioning
Interpolant-Based Verification for Effective Unbounded Model Checking"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419228&isnumber=5419222

Abstract: Interpolant-based model checking has been shown to be effective on
large verification instances, as it efficiently combines automated abstraction
and reachability fixed-point checks. On the other hand, methods based on
variable quantification have proved their ability to remove free inputs, thus
projecting the search space over state variables. In this paper, we propose an
integrated approach which combines the abstraction power of interpolation with
techniques that rely on and-inverter graph (AIG) and/or binary decision diagram
(BDD) representations of states, directly supporting variable quantification
and fixed-point checks. The underlying idea of this combination is to adopt AIG
or BDD-based quantifications to limit and restrict the search space and the
complexity of the interpolant-based approach. The exploited strategies, most of
which are individually well known, are integrated with a new flavor,
specifically designed to improve their effectiveness on difficu lt verification
instances. Experimental results, specifically oriented to hard-to-solve
verification problems, show the robustness of our approach.


Chen, M.; Mishra, P.; "Functional Test Generation Using Efficient Property
Clustering and Learning Techniques"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419232&isnumber=5419222

Abstract: Functional verification is one of the major bottlenecks in
system-on-chip design due to the combined effects of increasing complexity and
lack of automated techniques for generating efficient tests. Several promising
ideas using bounded model checking are proposed over the years to efficiently
generate counterexamples (tests). The existing researchers have used
incremental satisfiability to improve the counterexample generation, involving
only one property by sharing knowledge across instances of the same property
with incremental bounds. In this paper, we present a framework that can
efficiently reduce the overall test generation time by exploiting the
similarity among different properties. This paper makes two primary
contributions: 1) it proposes novel methods to cluster similar properties; and
2) it develops efficient learning techniques that can significantly reduce the
overall test generation time for the properties in a cluster by sharing
knowledge across similar test generation instances. Our experimental results
using both software and hardware benchmarks demonstrate that our approach can
drastically reduce (on average three to five times) the overall test generation
time compared to existing methods.


Kinsman, A. B.; Nicolici, N.; "Bit-Width Allocation for Hardware Accelerators
for Scientific Computing Using SAT-Modulo Theory"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419231&isnumber=5419222

Abstract: This paper investigates the application of computational methods via
Satisfiability Modulo Theory (SMT) to the bit-width allocation problem for
finite precision implementation of numerical calculations, specifically in the
context of scientific computing where division frequently occurs. In contrast
to the large body of work targeted at the precision aspect of the problem, this
paper addresses the range problem where employing SMT leads to more accurate
bounds estimation than those provided by other analytical methods, in turn
yielding smaller bit-widths, and hence a reduction in hardware cost and/or
increased parallelism, while maintaining robustness as is necessary for
scientific applications.  


Hashemi, M.; Ghiasi, S.; "Versatile Task Assignment for Heterogeneous Soft
Dual-Processor Platforms"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419233&isnumber=5419222

Abstract: Heterogeneous soft multiprocessor systems are likely to find a larger
share in the application-specific computing market due to increasing cost and
defect rates in foreseeable manufacturing technologies. We study the problem of
mapping streaming applications onto heterogeneous soft dual-processor systems,
in which processors' limited memory resources and application throughput form
the outstanding constraints and objective, respectively. A key step in the
compilation process is task assignment, where tasks are assigned to the
processors. We develop a provably-effective algorithm for task assignment. Our
algorithm is versatile, in that its formal properties hold for, and hence it is
applicable to, a variety of platforms. Measurement of generated code size, and
throughput of emulated systems validate the effectiveness of our approach. We
advance the state-of-the-art by considerably outperforming two recent
competitors in terms of both versatility and application throu ghput.


Palesi, M.; Kumar, S.; Catania, V.; "Leveraging Partially Faulty Links Usage
for Enhancing Yield and Performance in Networks-on-Chip"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419230&isnumber=5419222

Abstract: The communication infrastructure of a complex multicore
system-on-a-chip is getting an increasing fraction of the overall chip area.
According to the International Technology Roadmap for Semiconductors, killer
defect density does not decrease over successive technology generations. For
this reason, the probability that a manufacturing defect affects the
communication system is predicted to increase. In this paper, we deal with
manufacturing defects which affect the links in a network-on-chip-based
interconnection system. The goal of this paper is to show that by using
effective routing functions, supported by appropriate selection policies and
with a limited amount of extra logic in the router, it is easy to exploit
partially faulty links to improve the performance of the system. We show that,
instead of discarding partially faulty links, they can be used at reduced
capacity to improve the distribution of the traffic over the network, yielding
performance and power improvements. We couple an application-specific routing
function with a set of selection policies which are aware of link fault
distribution and evaluate them on both synthetic traffic and a real complex
multimedia application. We also present an implementation of the router,
augmented with the extra logic, to support both the proposed selection
functions and the transmission of messages over partially faulty links. We
analyze the router in terms of silicon area, timing, and power dissipation.


Mukherjee, N.; Pogiel, A.; Rajski, J.; Tyszer, J.; "High Volume Diagnosis in
Memory BIST Based on Compressed Failure Data"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419237&isnumber=5419222

Abstract: Embedded memories are increasingly identified as having potential for
introducing new yield loss mechanisms at a rate, magnitude, and complexity
large enough to demand major changes in fault diagnosis techniques. In
particular, time-related or complex read faults that originate in the highest
density areas of semiconductor designs require new methods to diagnose more
complex faults affecting large groups of memory cells. This paper presents a
built-in self-test (BIST)-based fault diagnosis scheme that can be used to
identify a variety of failures in embedded random-access memory arrays. The
proposed solution employs flexible test logic to record test responses at the
system speed with no interruptions of a BIST session. It offers a simple test
flow and enables detection of time-related faults. Furthermore, the way test
responses are processed allows accurate and time-efficient reconstruction of
error bitmaps. The proposed diagnostic algorithms use a number of techniques,
including discrete logarithm-based counting with ring generators acting as very
fast event counters and signature analyzers. Experimental results confirm high
diagnostic accuracy of the proposed scheme and its time efficiency.


Pomeranz, I.; Reddy, S. M.; "TOV: Sequential Test Generation by Ordering of
Test Vectors"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419236&isnumber=5419222

Abstract: We describe a new approach to test generation for stuck-at faults in
synchronous sequential circuits. Under this approach, the input vectors
comprising the test sequence are fixed in advance. The process of generating
the test sequence consists of ordering the precomputed input vectors such that
the resulting test sequence has as high a fault coverage as possible. The
advantage of this approach is that its computational complexity is limited by
limiting the search space to a given set of input vectors and a given test
sequence length. We describe a specific implementation of this approach.
Experimental results demonstrate that restricting the search space to a fixed
number of precomputed input vectors is sufficient for achieving the highest
known fault coverage, or a fault coverage close to it, for benchmark circuits.


Das, D.; Killpack, K.; Kashyap, C.; Jas, A.; Zhou, H.; "Pessimism Reduction in
Coupling-Aware Static Timing Analysis Using Timing and Logic Filtering"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419240&isnumber=5419222

Abstract: With continued scaling of technology into nanometer regimes, the
impact of coupling induced delay variations is significant. While several
coupling-aware static timers have been proposed, the results are often
pessimistic with many false failures. We present an integrated iterative timing
filtering and logic filtering based approach to reduce pessimism. We use a
realistic coupling model based on arrival times and slews, and show that
non-iterative pessimism reduction algorithms proposed in previous research may
give potentially non-conservative timing results. On a functional block from an
industrial 65 nm microprocessor, our algorithm produced a maximum pessimism
reduction of 11.18% of cycle time over converged timing filtering analysis that
does not consider logic constraints.


Karfa, c.; Sarkar, D.; Mandal, C.; "Verification of Datapath and Controller
Generation Phase in High-Level Synthesis of Digital Circuits"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419243&isnumber=5419222

Abstract: A formal verification method of the datapath and controller
generation phase of a high-level synthesis (HLS) process is described in this
paper. The goal is achieved in two steps. In the first step, the datapath
interconnection and the controller finite state machine description generated
by a high-level synthesis process are analyzed to obtain the register
transfer-operations executed in the datapath for a given control assertion
pattern in each control step. In the second step, an equivalence checking
method is deployed to establish equivalence between the input and the output
behaviors of this phase. A rewriting method has been developed for the first
step. Unlike many other reported techniques, our method is capable of
validating pipelined and multicycle operations, if any, spanning over several
states. The correctness and complexity of the presented method have been
treated formally. The method is implemented and integrated with an existing HLS
tool, called str uctured architecture synthesis tool. The experimental results
on several HLS benchmarks indicate the effectiveness of the presented method.


Short Papers
============

Musoll, E.; "Hardware-Based Load Balancing for Massive Multicore Architectures
Implementing Power Gating"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419241&isnumber=5419222

Abstract: Many-core architectures provide a computation platform with high
execution throughput, enabling them to efficiently execute workloads with a
significant degree of thread-level parallelism. The burstlike nature of these
workloads allows large power savings by power gating the idle cores. In
addition, the load balancing of threads to cores also impacts the power and
thermal behavior of the processor. Processor implementations of many-core
architectures may choose to group several cores into clusters sharing the area
overhead, so that the whole cluster is power gated as opposed to the individual
cores. However, the potential for power savings is reduced due to the coarser
level of power gating. In this paper, several hardware-based stateless
load-balancing schemes are evaluated for these clustered homogeneous multicore
architectures in terms of their power and thermal behavior. All these methods
can be unified into a parameterized technique that dynamically adjusts to
obtain the desired goal (lower power, higher performance, and lower hotspot
temperature).


Choi, S. H.; Kang, K.; Dartu, F.; Roy, K.; "Timed Input Pattern Generation for
an Accurate Delay Calculation Under Multiple Input Switching"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419235&isnumber=5419222

Abstract: In multiple input switching (MIS) analysis, input signal alignment is
one of the key factors which determines the quality and the accuracy of the
approach. In this paper, we propose a new signal alignment methodology for MIS
analysis based on a transistor level simulator at the core of the static timing
analysis. Our proposed methodology searches through the possible input vectors
in an efficient order to reduce the number of simulations and finds a true
worst case signal alignment for both the MIN and the MAX analysis. In our 180
nm simulation setup, the worst-case delay is predicted within 0.5% error for
more than 97% of test cases performing an average of less than two simulations
per logic gate.


Pomeranz, I.; Reddy, S. M.; "On Test Generation With Test Vector Improvement"
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419239&isnumber=5419222

Abstract: We investigate the introduction of a new step, referred to as test
vector improvement, into test generation processes. After a fully specified
test vector or a partially specified test cube $t$ is generated at an arbitrary
iteration of the test generation process, the test vector improvement step
modifies $t$ so as to increase the number of yet-undetected target faults that
$t$ detects. This is done in this paper using a simulation-based process. We
show that even if $t$ was generated using dynamic test compaction heuristics,
it is possible to improve $t$ further. When $t$ is partially specified to
accommodate test data compression, the test vector improvement step does not
change the number of unspecified bits of $t$. The final result is a smaller
test set and/or a higher fault coverage (if the test generation process does
not detect all the detectable faults).