TCAD Newsletter - June 2010 Issue
Placing you one click away from the best new CAD research!

Editorial
==============


Special Section on the 2009 ACM/IEEE Symposium on Networks-on-Chip
==============


Modarressi, M.; Tavakkol, A.; Sarbazi-Azad, H., "Virtual Point-to-Point
Connections for NoCs"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467328&isn...

Abstract: In this paper, we aim to improve the performance and power metrics of
packet-switched network-on-chips (NoCs) and benefits from the scalability and
resource utilization advantages of NoCs and superior communication performance
of point-to-point dedicated links. The proposed method sets up the virtual
point-to-point (VIP) connections over one virtual channel (which bypasses the
entire router pipeline) at each physical channel of the NoC. We present two
schemes for constructing such VIP circuits. In the first scheme, the circuits
are constructed for an application based on its task-graph at design time. The
second scheme addresses constructing the connections at run-time using a
light-weight setup network. It involves monitoring the NoC traffic in order to
detect heavy communication flows and setting up a VIP connection for them using
a run-time circuit construction mechanism. The proposed mechanism is compared
to a traditional packet-switched NoC and some modern switching mechanisms and
the results show a significant reduction in the network power and latency over
the other considered NoCs.


Concer, N.; Bononi, L.; Soulie, M.; Locatelli, R.; Carloni, L. P., "The
Connection-Then-Credit Flow Control Protocol for Heterogeneous Multicore
Systems-on-Chip"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467323&isn...

Abstract: Connection-then-credits (CTC) is a novel end-to-end flow control
protocol to handle message-dependent deadlocks in best-effort networks-on-chip
(NoC) for embedded multicore systems-on-chip (SoCs). CTC is based on the
classic end-to-end credit-based flow control protocol but differs from it
because it uses a network interface microarchitecture where a single credit
counter and a single input data queue are shared among all possible
communications. This architectural simplification reduces the area occupation
of the network interfaces and increases their design reuse; for instance, the
same network interface can be used to connect a core independently of the
number of incoming and outgoing communications. CTC, however, requires a
handshake preamble to initialize the credit counter in the sender network
interface based on the buffering capacity of the receiver network interface.
While this necessarily introduces a latency overhead in the transfer of a
message, simulation-based experimental results show that the penalty in
performance is limited when large messages need to be transferred, thus, making
CTC a valid solution for particular classes of applications such as video
stream processing.


Kohler, A.; Schley, G.; Radetzki, M., "Fault Tolerant Network on Chip
Switching With Graceful Performance Degradation"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467330&isn...

Abstract: The structural redundancy inherent to on-chip interconnection
networks [networks on chip (NoC)] can be exploited by adaptive routing
algorithms in order to provide connectivity even if network components are out
of service due to faults, which will appear at an increasing rate with future
chip technology nodes. This paper is based on a new, fine-grained functional
fault model and a corresponding distributed fault diagnosis method that
facilitate determining the fault status of individual NoC switches and their
adjacent communication links. Whereas previous work on network fault-tolerance
assume switches to be either available or fully out of service, we present a
novel adaptive routing algorithm that employs the remaining functionality of
partly defective switches. Using diagnostic information, transient faults are
handled with a retransmission scheme that avoids the latency penalty of
end-to-end repeat requests. Thereby, graceful degradation of NoC communication
performance can be achieved even under high failure rates.


Tran, A. T.; Truong, D. N.; Baas, B., "A Reconfigurable Source-Synchronous
On-Chip Network for GALS Many-Core Platforms"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467331&isn...

Abstract: This paper presents a globally-asynchronous locally-synchronous
(GALS)-compatible circuit-switched on-chip network that is well suited for use
in many-core platforms targeting streaming digital signal processing and
embedded applications which typically have a high degree of task-level
parallelism among computational kernels. Inter-processor communication is
achieved through a simple yet effective reconfigurable source-synchronous
network. Interconnect paths between processors can sustain a peak throughput of
one word per cycle. A theoretical model is developed for analyzing the
performance of the network. A 65 nm complementary metal-oxide-semiconductor
GALS chip utilizing this network was fabricated which contains 164 programmable
processors, three accelerators and three shared memory modules. For evaluating
the efficiency of this platform, a complete 802.11a wireless local area network
baseband receiver was implemented. It has a real-time throughput of 54 Mb/s
with all processors running at 594 MHz and 0.95-V, and consumes an average of
174.8 mW with 12.2 mW (or 7.0%) dissipated by its interconnect links and
switches. With the chip's dual supply voltages set at 0.95-V and 0.75-V, and
individual processors' oscillators operating at workload-based optimal
frequencies, the receiver consumes 123.2 mW, which is a 29.5% reduction in
power. Measured power consumption values from the chip are within 2-5% of the
estimated values.

Regular Issue Papers
==============

Embedded Systems

Ferrandi, F.; Lanzi, P. L.; Pilato, C.; Sciuto, D.; Tumeo, A., "Ant Colony
Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous Embedded Systems"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467335&isn...

Abstract: To exploit the power of modern heterogeneous multiprocessor embedded
platforms on partitioned applications, the designer usually needs to
efficiently map and schedule all the tasks and the communications of the
application, respecting the constraints imposed by the target architecture.
Since the problem is heavily constrained, common methods used to explore such
design space usually fail, obtaining low-quality solutions. In this paper, we
propose an ant colony optimization (ACO) heuristic that, given a model of the
target architecture and the application, efficiently executes both scheduling
and mapping to optimize the application performance. We compare our approach
with several other heuristics, including simulated annealing, tabu search, and
genetic algorithms, on the performance to reach the optimum value and on the
potential to explore the design space. We show that our approach obtains better
results than other heuristics by at least 16% on average, despite an overhead
in execution time. Finally, we validate the approach by scheduling and mapping
a JPEG encoder on a realistic target architecture.


Modeling and Simulation

Villena, J. F.; Silveira, L. M., "SPARE - A Scalable Algorithm for Passive,
Structure Preserving, Parameter-Aware Model Order Reduction"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467325&isn...


Abstract: This paper describes a flexible and efficient new algorithm for model
order reduction of parameterized systems. The method is based on the
reformulation of the parameterized system as a perturbation-like parallel
interconnection of the nominal transfer function and the nonparameterized
transfer function sensitivities with respect to the parameter variations. Such
a formulation reveals an explicit dependence on each parameter which is
exploited by reducing each component system independently via a standard
nonparameterized structure preserving algorithm. Therefore, the resulting
smaller size interconnected system retains the structure of the original system
with respect to parameter dependence. This allows for better accuracy control,
enabling independent adaptive order determination with respect to each
parameter and adding flexibility in simulation environments. It is shown that
the method is efficiently scalable and preserves relevant system properties
such as passivity. The new technique can handle fairly large parameter
variations on systems whose outputs exhibit smooth dependence on the
parameters, also allowing design space exploration to some degree. Several
examples show that besides the added flexibility and control, when compared
with competing algorithms, the proposed technique can, in some cases, produce
smaller reduced models with potential accuracy gains.


Physical Design

Kahng, A. B.; Park, C.-H.; Xu, X.; Yao, H., "Layout Decomposition Approaches for Double Patterning Lithography"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467327&isn...

Abstract: In double patterning lithography (DPL) layout decomposition for 45 nm
and below process nodes, two features must be assigned opposite colors
(corresponding to different exposures) if their spacing is less than the
minimum coloring spacing. However, there exist pattern configurations for which
pattern features separated by less than the minimum coloring spacing cannot be
assigned different colors. In such cases, DPL requires that a layout feature be
split into two parts. We address this problem using two layout decomposition
approaches based on a conflict graph. First, node splitting is performed at all
feasible dividing points. Then, one approach detects conflict cycles in the
graph which are unresolvable for DPL coloring, and determines the coloring
solution for the remaining nodes using integer linear programming (ILP). The
other approach, based on a different ILP problem formulation, deletes some
edges in the graph to make it two-colorable, then finds the coloring solution
in the new graph. We evaluate our methods on both real and artificial 45 nm
testcases. Experimental results show that our proposed layout decomposition
approaches effectively decompose given layouts to satisfy the key goals of
minimized line-ends and maximized overlap margin. There are no design rule
violations in the final decomposed layout.


System-Level Design

Wu, C.-H.; Lin, H.-H.; Kuo, T.-W., "An Adaptive Flash Translation Layer for
High-Performance Storage Systems"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467340&isn...

Abstract: While the capacity of flash-memory storage systems keeps increasing
significantly, an effective and efficient management of flash-memory space has
become a critical design issue. Different granularities in space management
impose different management costs and mapping efficiency. In this paper, we
will explore an address translation mechanism (AddrTM) that can dynamically and
adaptively switch between different granularities in the mapping of logical
block addresses into physical block addresses in flash-memory management. The
objective is to provide high performance in address mapping and space
utilization and, at the same time, to have the main memory requirements, the
garbage collection overhead, and the system initialization time under proper
management. The experimental results show that the proposed adaptive mechanism
can provide better performance improvement and practicability than other
well-known coarse-grained management mechanisms over realistic workloads.


Test

Chen, Z.; Xiang, D., "A Novel Test Application Scheme for High Transition
Fault Coverage and Low Test Cost"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467329&isn...


Abstract: This paper presents a new method for improving transition fault
coverage in hybrid scan testing. It is based on a novel test application
scheme, in order to break the functional dependence of broadside testing. The
new technique analyzes the automatic test pattern generation conflicts in
broadside test generation and skewed-load test generation, and tries to control
the flip-flops with the most influence on fault coverage. The conflict-driven
selection method selects some flip-flops that work in the enhanced scan mode or
skewed-load scan mode. And the conflict-driven reordering method distributes
the selected flip-flops into different chains. In the multiple scan chain
architecture, to avoid too many scan-in pins, some chains are driven by the
same scan-in pin to construct a tree-based architecture. Based on the
architecture, the new test application scheme allows some flip-flops to work in
enhanced scan or skewed-load mode, while most of others to work in the
traditional broadside scan mode. With the efficient conflict-driven selection
and reordering schemes, fault coverage is improved greatly, which can also
reduce test application time and test data volume. Experimental results show
that fault coverage based on the proposed method is comparable that of enhanced
scan.


Yu, X.; Blanton, R. D., "Diagnosis of Integrated Circuits With Multiple
Defects of Arbitrary Characteristics",

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467336&isn...


Abstract: This paper describes a multiple-defect diagnosis methodology that is
flexible in handling various defect behaviors and arbitrary failing pattern
characteristics. Unlike some other approaches, the search space of the
diagnosis method does not grow exponentially with the number of defects.
Results from extensive simulation experiments and real failing integrated
circuits show that this method can effectively diagnose circuits that are
affected by a large ( > 20 ) or small number of defects of various types.
Moreover, this method is capable of accurately estimating the number of
defective sites in the failing circuit.


Short Papers
============


Kumar, A.; Anis, M., "IR-Drop Management in FPGAs"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467333&isn...


Abstract: This paper presents novel computer-aided design (CAD) techniques for
mitigating IR-drops in field-programmable gate arrays (FPGAs). The proposed
placement and routing relies on reducing the switching activities in local
regions in the FPGA fabric to improve the profile of the supply voltage
distribution. The proposed techniques reduce IR-drops and the variance of the
supply voltage distribution across all the nodes in the power grid network. The
proposed CAD techniques are efficient as they do not require solving the power
grid model at every placement and routing iteration. A reduction of up to 53%
in maximum IR-drop and up to 66% reduction in standard deviation of V_{dd} is
obtained from the design techniques proposed in this paper with an average
impact of 3% on circuit delay.


Xu, X.; Lim, C.-C., "Modeling Interrupts for Software-Based System-on-Chip
Verification"

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467338&isn...

Abstract: The interrupt mechanism in a system-on-chip (SoC) joins the SoCs
hardware and software behaviors. We model interrupts as logical rather than
physical events and accordingly provides guidelines to compose software
components including interrupt-service-routines. As a benefit, classical
indeterministic behaviors (due to the parallelism) in the software domain, such
as preemption and nesting, can be constructed as early as raw hardware
components are being integrated. In effect, while the interrupt mechanism
itself is under rigorous stress, it is simultaneously driving the exercise of
the entire SoC. This effect can be observed through software profiling at the
hardware integration stage.