TCAD Newsletter - June 2010 Issue Placing you one click away from the best new CAD research! Editorial ============== Special Section on the 2009 ACM/IEEE Symposium on Networks-on-Chip ============== Modarressi, M.; Tavakkol, A.; Sarbazi-Azad, H., "Virtual Point-to-Point Connections for NoCs" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467328&isn... Abstract: In this paper, we aim to improve the performance and power metrics of packet-switched network-on-chips (NoCs) and benefits from the scalability and resource utilization advantages of NoCs and superior communication performance of point-to-point dedicated links. The proposed method sets up the virtual point-to-point (VIP) connections over one virtual channel (which bypasses the entire router pipeline) at each physical channel of the NoC. We present two schemes for constructing such VIP circuits. In the first scheme, the circuits are constructed for an application based on its task-graph at design time. The second scheme addresses constructing the connections at run-time using a light-weight setup network. It involves monitoring the NoC traffic in order to detect heavy communication flows and setting up a VIP connection for them using a run-time circuit construction mechanism. The proposed mechanism is compared to a traditional packet-switched NoC and some modern switching mechanisms and the results show a significant reduction in the network power and latency over the other considered NoCs. Concer, N.; Bononi, L.; Soulie, M.; Locatelli, R.; Carloni, L. P., "The Connection-Then-Credit Flow Control Protocol for Heterogeneous Multicore Systems-on-Chip" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467323&isn... Abstract: Connection-then-credits (CTC) is a novel end-to-end flow control protocol to handle message-dependent deadlocks in best-effort networks-on-chip (NoC) for embedded multicore systems-on-chip (SoCs). CTC is based on the classic end-to-end credit-based flow control protocol but differs from it because it uses a network interface microarchitecture where a single credit counter and a single input data queue are shared among all possible communications. This architectural simplification reduces the area occupation of the network interfaces and increases their design reuse; for instance, the same network interface can be used to connect a core independently of the number of incoming and outgoing communications. CTC, however, requires a handshake preamble to initialize the credit counter in the sender network interface based on the buffering capacity of the receiver network interface. While this necessarily introduces a latency overhead in the transfer of a message, simulation-based experimental results show that the penalty in performance is limited when large messages need to be transferred, thus, making CTC a valid solution for particular classes of applications such as video stream processing. Kohler, A.; Schley, G.; Radetzki, M., "Fault Tolerant Network on Chip Switching With Graceful Performance Degradation" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467330&isn... Abstract: The structural redundancy inherent to on-chip interconnection networks [networks on chip (NoC)] can be exploited by adaptive routing algorithms in order to provide connectivity even if network components are out of service due to faults, which will appear at an increasing rate with future chip technology nodes. This paper is based on a new, fine-grained functional fault model and a corresponding distributed fault diagnosis method that facilitate determining the fault status of individual NoC switches and their adjacent communication links. Whereas previous work on network fault-tolerance assume switches to be either available or fully out of service, we present a novel adaptive routing algorithm that employs the remaining functionality of partly defective switches. Using diagnostic information, transient faults are handled with a retransmission scheme that avoids the latency penalty of end-to-end repeat requests. Thereby, graceful degradation of NoC communication performance can be achieved even under high failure rates. Tran, A. T.; Truong, D. N.; Baas, B., "A Reconfigurable Source-Synchronous On-Chip Network for GALS Many-Core Platforms" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467331&isn... Abstract: This paper presents a globally-asynchronous locally-synchronous (GALS)-compatible circuit-switched on-chip network that is well suited for use in many-core platforms targeting streaming digital signal processing and embedded applications which typically have a high degree of task-level parallelism among computational kernels. Inter-processor communication is achieved through a simple yet effective reconfigurable source-synchronous network. Interconnect paths between processors can sustain a peak throughput of one word per cycle. A theoretical model is developed for analyzing the performance of the network. A 65 nm complementary metal-oxide-semiconductor GALS chip utilizing this network was fabricated which contains 164 programmable processors, three accelerators and three shared memory modules. For evaluating the efficiency of this platform, a complete 802.11a wireless local area network baseband receiver was implemented. It has a real-time throughput of 54 Mb/s with all processors running at 594 MHz and 0.95-V, and consumes an average of 174.8 mW with 12.2 mW (or 7.0%) dissipated by its interconnect links and switches. With the chip's dual supply voltages set at 0.95-V and 0.75-V, and individual processors' oscillators operating at workload-based optimal frequencies, the receiver consumes 123.2 mW, which is a 29.5% reduction in power. Measured power consumption values from the chip are within 2-5% of the estimated values. Regular Issue Papers ============== Embedded Systems Ferrandi, F.; Lanzi, P. L.; Pilato, C.; Sciuto, D.; Tumeo, A., "Ant Colony Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous Embedded Systems" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467335&isn... Abstract: To exploit the power of modern heterogeneous multiprocessor embedded platforms on partitioned applications, the designer usually needs to efficiently map and schedule all the tasks and the communications of the application, respecting the constraints imposed by the target architecture. Since the problem is heavily constrained, common methods used to explore such design space usually fail, obtaining low-quality solutions. In this paper, we propose an ant colony optimization (ACO) heuristic that, given a model of the target architecture and the application, efficiently executes both scheduling and mapping to optimize the application performance. We compare our approach with several other heuristics, including simulated annealing, tabu search, and genetic algorithms, on the performance to reach the optimum value and on the potential to explore the design space. We show that our approach obtains better results than other heuristics by at least 16% on average, despite an overhead in execution time. Finally, we validate the approach by scheduling and mapping a JPEG encoder on a realistic target architecture. Modeling and Simulation Villena, J. F.; Silveira, L. M., "SPARE - A Scalable Algorithm for Passive, Structure Preserving, Parameter-Aware Model Order Reduction" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467325&isn... Abstract: This paper describes a flexible and efficient new algorithm for model order reduction of parameterized systems. The method is based on the reformulation of the parameterized system as a perturbation-like parallel interconnection of the nominal transfer function and the nonparameterized transfer function sensitivities with respect to the parameter variations. Such a formulation reveals an explicit dependence on each parameter which is exploited by reducing each component system independently via a standard nonparameterized structure preserving algorithm. Therefore, the resulting smaller size interconnected system retains the structure of the original system with respect to parameter dependence. This allows for better accuracy control, enabling independent adaptive order determination with respect to each parameter and adding flexibility in simulation environments. It is shown that the method is efficiently scalable and preserves relevant system properties such as passivity. The new technique can handle fairly large parameter variations on systems whose outputs exhibit smooth dependence on the parameters, also allowing design space exploration to some degree. Several examples show that besides the added flexibility and control, when compared with competing algorithms, the proposed technique can, in some cases, produce smaller reduced models with potential accuracy gains. Physical Design Kahng, A. B.; Park, C.-H.; Xu, X.; Yao, H., "Layout Decomposition Approaches for Double Patterning Lithography" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467327&isn... Abstract: In double patterning lithography (DPL) layout decomposition for 45 nm and below process nodes, two features must be assigned opposite colors (corresponding to different exposures) if their spacing is less than the minimum coloring spacing. However, there exist pattern configurations for which pattern features separated by less than the minimum coloring spacing cannot be assigned different colors. In such cases, DPL requires that a layout feature be split into two parts. We address this problem using two layout decomposition approaches based on a conflict graph. First, node splitting is performed at all feasible dividing points. Then, one approach detects conflict cycles in the graph which are unresolvable for DPL coloring, and determines the coloring solution for the remaining nodes using integer linear programming (ILP). The other approach, based on a different ILP problem formulation, deletes some edges in the graph to make it two-colorable, then finds the coloring solution in the new graph. We evaluate our methods on both real and artificial 45 nm testcases. Experimental results show that our proposed layout decomposition approaches effectively decompose given layouts to satisfy the key goals of minimized line-ends and maximized overlap margin. There are no design rule violations in the final decomposed layout. System-Level Design Wu, C.-H.; Lin, H.-H.; Kuo, T.-W., "An Adaptive Flash Translation Layer for High-Performance Storage Systems" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467340&isn... Abstract: While the capacity of flash-memory storage systems keeps increasing significantly, an effective and efficient management of flash-memory space has become a critical design issue. Different granularities in space management impose different management costs and mapping efficiency. In this paper, we will explore an address translation mechanism (AddrTM) that can dynamically and adaptively switch between different granularities in the mapping of logical block addresses into physical block addresses in flash-memory management. The objective is to provide high performance in address mapping and space utilization and, at the same time, to have the main memory requirements, the garbage collection overhead, and the system initialization time under proper management. The experimental results show that the proposed adaptive mechanism can provide better performance improvement and practicability than other well-known coarse-grained management mechanisms over realistic workloads. Test Chen, Z.; Xiang, D., "A Novel Test Application Scheme for High Transition Fault Coverage and Low Test Cost" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467329&isn... Abstract: This paper presents a new method for improving transition fault coverage in hybrid scan testing. It is based on a novel test application scheme, in order to break the functional dependence of broadside testing. The new technique analyzes the automatic test pattern generation conflicts in broadside test generation and skewed-load test generation, and tries to control the flip-flops with the most influence on fault coverage. The conflict-driven selection method selects some flip-flops that work in the enhanced scan mode or skewed-load scan mode. And the conflict-driven reordering method distributes the selected flip-flops into different chains. In the multiple scan chain architecture, to avoid too many scan-in pins, some chains are driven by the same scan-in pin to construct a tree-based architecture. Based on the architecture, the new test application scheme allows some flip-flops to work in enhanced scan or skewed-load mode, while most of others to work in the traditional broadside scan mode. With the efficient conflict-driven selection and reordering schemes, fault coverage is improved greatly, which can also reduce test application time and test data volume. Experimental results show that fault coverage based on the proposed method is comparable that of enhanced scan. Yu, X.; Blanton, R. D., "Diagnosis of Integrated Circuits With Multiple Defects of Arbitrary Characteristics", URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467336&isn... Abstract: This paper describes a multiple-defect diagnosis methodology that is flexible in handling various defect behaviors and arbitrary failing pattern characteristics. Unlike some other approaches, the search space of the diagnosis method does not grow exponentially with the number of defects. Results from extensive simulation experiments and real failing integrated circuits show that this method can effectively diagnose circuits that are affected by a large ( > 20 ) or small number of defects of various types. Moreover, this method is capable of accurately estimating the number of defective sites in the failing circuit. Short Papers ============ Kumar, A.; Anis, M., "IR-Drop Management in FPGAs" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467333&isn... Abstract: This paper presents novel computer-aided design (CAD) techniques for mitigating IR-drops in field-programmable gate arrays (FPGAs). The proposed placement and routing relies on reducing the switching activities in local regions in the FPGA fabric to improve the profile of the supply voltage distribution. The proposed techniques reduce IR-drops and the variance of the supply voltage distribution across all the nodes in the power grid network. The proposed CAD techniques are efficient as they do not require solving the power grid model at every placement and routing iteration. A reduction of up to 53% in maximum IR-drop and up to 66% reduction in standard deviation of V_{dd} is obtained from the design techniques proposed in this paper with an average impact of 3% on circuit delay. Xu, X.; Lim, C.-C., "Modeling Interrupts for Software-Based System-on-Chip Verification" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5467338&isn... Abstract: The interrupt mechanism in a system-on-chip (SoC) joins the SoCs hardware and software behaviors. We model interrupts as logical rather than physical events and accordingly provides guidelines to compose software components including interrupt-service-routines. As a benefit, classical indeterministic behaviors (due to the parallelism) in the software domain, such as preemption and nesting, can be constructed as early as raw hardware components are being integrated. In effect, while the interrupt mechanism itself is under rigorous stress, it is simultaneously driving the exercise of the entire SoC. This effect can be observed through software profiling at the hardware integration stage.