TCAD Newsletter - Apr 2011 Issue Placing you one click away from the best new CAD research! KEYNOTE PAPER Cong, J.  Liu, B.  Neuendorffer, S.  Noguera, J.  Vissers, K.  Zhang, Z. High-Level Synthesis for FPGAs: From Prototyping to Deployment http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737854 Escalating system-on-chip design complexity is pushing the design community to raise the level of abstraction beyond register transfer level. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS methodology is happening now, especially for field-programmable gate array (FPGA) designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper, we use AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. In particular, the experiment on a sphere decoder shows that the HLS solution can achieve an 11-31% reduction in FPGA resource usage with improved design productivity compared to hand-coded design. SPECIAL ISSUE ON THE ACM/IEEE INTERNATIONAL SYMPOSIUM ON NETWORKS ON CHIP 2010 GUEST EDITORIAL Benini, L.  Carloni, L. P. Guest Editorial: Special Section on the ACM/IEEE Symposium on Networks-on-Chip 2010 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737845 SPECIAL SECTION PAPERS Horak, M. N.  Nowick, S. M.  Carlberg, M.  Vishkin, U. A Low-Overhead Asynchronous Interconnection Network for GALS Chip Multiprocessors http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737866 A new asynchronous interconnection network is introduced for globally-asynchronous locally-synchronous (GALS) chip multiprocessors. The network eliminates the need for global clock distribution, and can interface multiple synchronous timing domains operating at unrelated clock rates. In particular, two new highly-concurrent asynchronous components are introduced which provide simple routing and arbitration/merge functions. Post-layout simulations in identical commercial 90 nm technology indicate that comparable recent synchronous router nodes have 5.6-10.7 $times$ more energy per packet and 2.8-6.4$times$ greater area than the new asynchronous nodes. Under random traffic, the network provides significantly lower latency and identical throughput over the entire operating range of the 800 MHz network and through mid-range traffic rates for the 1.36 GHz network, but with degradation at higher traffic rates. Preliminary evaluations are also presented for a mixed-timing (GALS) network in a shared-memory parallel architecture, running both random traffic and parallel benchmark kernels, as well as directions for further improvement. Bogdan, P.  Marculescu, R. Non-Stationary Traffic Analysis and Its Implications on Multicore Platform Design http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737846 Networks-on-chip (NoCs) have been proposed as a viable solution to solving the communication problem in multicore systems. In this new setup, mapping multiple applications on available computational resources leads to interaction and contention at various network resources. Consequently, taking into account the traffic characteristics becomes of crucial importance for performance analysis and optimization of the communication infrastructure, as well as proper resource management. Although queuing-based approaches have been traditionally used for performance analysis purposes, they cannot properly account for many of the traffic characteristics (e.g., non-stationarity, self-similarity) that are crucial for multicore platform design. To overcome these limitations, we propose a statistical physics inspired approach to capture the traffic dynamics in multicore systems. As shown later in this paper, this is of fundamental significance for re-thinking the very basis of multicore systems design; it also opens up new research directions into NoC optimization which require accurate models of time-dependent and space-dependent traffic behavior. Matsutani, H.  Koibuchi, M.  Ikebuchi, D.  Usami, K.  Nakamura, H.  Amano, H. Performance, Area, and Power Evaluations of Ultrafine-Grained Run-Time Power-Gating Routers for CMPs http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737865 This paper proposes the ultrafine-grained run-time power gating of on-chip routers, in which the power supply to each router component (e.g., virtual-channel buffer, virtual-channel multiplexer, and crossbar multiplexer and output latch) can be individually controlled based on the applied workload. Since only the router components that are transferring a packet are activated, the leakage power of the on-chip network can be reduced to a near-optimal level. However, such techniques inherently increase the communication latency and degrade the application performance, since a certain amount of wakeup latency is required to activate the sleeping components. To mitigate this wakeup latency, an early wakeup method that can preliminarily detect the next packet arrival and activate the corresponding components is essential. We designed and implemented an ultrafine-grained power-gating router using a commercial 65 nm process. We propose four early wakeup methods and combine them with the power-gating router. The proposed router with the early wakeup methods is evaluated in terms of its application performance, area overhead, and leakage power reduction taking into account the on/off energy overhead. The simulation results showed that it reduces the leakage power by 54.4-59.9% on average even when the application programs are fully running, at the expense of 4.6% of the area and 0.7-3.7% of the performance overheads when we assume a 1 GHz operation. Rodrigo, S.  Flich, J.  Roca, A.  Medardoni, S.  Bertozzi, D.  Camacho, J.  Silla, F.  Duato, J. Cost-Efficient On-Chip Routing Implementations for CMP and MPSoC Systems http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737867 The high-performance computing domain is enriching with the inclusion of networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs) architectures. NoCs face the communication scalability challenge while meeting tight power, area, and latency constraints. Designers must address new challenges that were not present before. Defective components, the enhancement of application-level parallelism, or power-aware techniques may break topology regularity, thus, efficient routing becomes a challenge. This paper presents universal logic-based distributed routing (uLBDR), an efficient logic-based mechanism that adapts to any irregular topology derived from 2-D meshes, instead of using routing tables. uLBDR requires a small set of configuration bits, thus being more practical than large routing tables implemented in memories. Several implementations of uLBDR are presented highlighting the tradeoff between routing cost and coverage. The alternatives span from the previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism achieving full coverage. This comes with a small performance cost, thus exhibiting the tradeoff between fault tolerance and performance. Power consumption, area, and delay estimates are also provided highlighting the efficiency of the mechanism. To do this, different router models (one for CMPs and one for MPSoCs) have been designed as a proof concept. Ramanujam, R. S.  Soteriou, V.  Lin, B.  Peh, L.-S. Extending the Effective Throughput of NoCs With Distributed Shared-Buffer Routers http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737868 Router microarchitecture plays a central role in the performance of networks-on-chip (NoCs). Buffers are needed in routers to house incoming flits that cannot be immediately forwarded due to contention. This buffering can be done at the inputs or the outputs of a router, corresponding to an input-buffered router (IBR) or an output-buffered router (OBR). OBRs are attractive because they can sustain higher throughputs and have lower queuing delays under high loads than IBRs. However, a direct implementation of an OBR requires a router speedup equal to the number of ports, making such a design prohibitive under aggressive clocking needs and limited power budgets of most NoC applications. In this paper, a new router design based on a distributed shared-buffer (DSB) architecture is proposed that aims to practically emulate an OBR. The proposed architecture introduces innovations to address the unique constraints of NoCs, including efficient pipelining and novel flow control. Practical DSB configurations are also presented with reduced power overheads while exhibiting negligible performance degradation. Compared to a state-of-the-art pipelined IBR, the proposed DSB router achieves up to 19% higher throughput on synthetic traffic and reduces packet latency on average by 61% when running SPLASH-2 benchmarks with high contention. On average, the saturation throughput of DSB routers is within 7% of the theoretically ideal saturation throughput under the synthetic workloads evaluated. REGULAR PAPERS MODELING AND SIMULATION Feng, Z.  Zhao, X.  Zeng, Z. Robust Parallel Preconditioned Power Grid Simulation on GPU With Adaptive Runtime Performance Modeling and Optimization http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737849 Leveraging the power of current-day graphics processing units for robust power grid simulation remains a challenging task. Existing preconditioned iterative methods that require incomplete matrix factorizations cannot be effectively accelerated on graphics processing unit (GPU) due to its limited hardware resource as well as data parallel computing. This paper presents an efficient GPU-based multigrid preconditioning algorithm for robust power grid analysis. By combining the fast geometric multigrid solver with the robust Krylov-subspace iterative solver, power grid DC and transient analysis can be performed efficiently on GPU without loss of accuracy (largest errors ${<},0.5~{rm mV}$). Unlike previous GPU-based algorithms that rely on good power grid regularities, the proposed algorithm can be applied for more general power grid structures. Additionally, we also propose an accuracy-aware GPU performance modeling and optimization framework to automatically obtain the best power grid simulation configurations. Experimental results show that the DC and transient analysis on GPU can achieve more than $25X$ speedups over the best available CPU-based solvers. Li, X.-C.  Mao, J.-F.  Swaminathan, M. Transient Analysis of CMOS-Gate-Driven RLGC Interconnects Based on FDTD http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737852 As the feature size of integrated circuits shrinking in deep submicron technologies, time delay, and crosstalk noise of complementary metal-oxide-semiconductor (CMOS)-gate-driven interconnects become critical issues. Traditionally, CMOS driver is simplified as a linear circuit in which a constant resistance is used to approximate the nonlinear and time-varying MOS resistance, which is inaccurate for signal integrity analysis in high-speed interconnect systems. This paper proposes a finite-difference time-domain (FDTD)-based method for transient analysis of lossy transmission lines in the presence of the nonlinear behavior of CMOS gates. The conventional FDTD with second-order accuracy is used for interconnect analysis and the parameters with frequency-dependent losses are also included. The nonlinear behavior of CMOS gates is represented by alpha-power law model, with the drain current described by piecewise linear function of the drain voltage and discretized in time domain for the FDTD implementation. Explicit forms of the boundary conditions are derived from the implicit interface equations and hence the stability is strictly constrained by Courant condition. Experimental results show that the proposed method has good accuracy and high efficiency with respect to HSPICE. Therefore, it is useful for accurate prediction of time delay and crosstalk noise in high-speed interconnect systems. SYSTEM-LEVEL DESIGN Hu, J.  Tseng, W.-C.  Xue, C. J.  Zhuge, Q.  Zhao, Y.  Sha, E. H.-M. Write Activity Minimization for Nonvolatile Main Memory Via Scheduling and Recomputation http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737850 Nonvolatile memories such as Flash memory, phase change memory (PCM), and magnetic random access memory (MRAM) have many desirable characteristics for embedded systems to employ them as main memory. However, there are two common challenges we need to answer before we can apply nonvolatile memory as main memory practically. First, nonvolatile memory has limited write/erase cycles compared to DRAM. Second, a write operation is slower than a read operation on nonvolatile memory. These two challenges can be answered by reducing the number of write activities on nonvolatile main memory. In this paper, we proposed two optimization techniques, write-aware scheduling and recomputation, to minimize write activities on nonvolatile memory. With the proposed techniques, we can both speed up the completion time of programs and extend nonvolatile memory's lifetime. The experimental results show that the proposed techniques can reduce the number of write activities on nonvolatile memory by 55.71% on average. Thus, the lifetime of nonvolatile memory is extended to 2.5 times as long as before on average. The completion time of programs can be reduced by 56.67% on systems with NOR Flash memory and by 47.63% on systems with NAND Flash memory on average. Chiang, M.-C.  Yeh, T.-C.  Tseng, G.-F. A QEMU and SystemC-Based Cycle-Accurate ISS for Performance Estimation on SoC Development http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737847 In this paper, we present a fast cycle-accurate instruction set simulator (CA-ISS) for system-on-chip development based on QEMU and SystemC. Even though most state-of-the-art commercial tools have tried very hard to provide all the levels of details to satisfy the different requirements of the software designer, the hardware designer, and even the system architect, the hardware/software co-simulation speed is dramatically slow when co-simulating the hardware models at the register-transfer level (RTL) with a full-fledged operating system (OS). Our experimental results show that the combination of QEMU and SystemC can make the co-simulation at the CA level much faster than the conventional RTL simulation, even with a full-fledged operating system up and running. Furthermore, the statistics indicate that with every instruction executed and every memory accessed since power-on traced at the CA level, it takes 28m15.804s on average to boot up a full-fledged Linux kernel, even on a personal computer. Compared to the kernel boot time reported by Xilinx and SiCortex, the proposed CA-ISS is about 6.09 times faster compared to "SystemC without trace" of Xilinx and about 30.32 times faster compared to "SystemC models converted from RTL" of SiCortex. The main contributions of this paper are threefold: 1) a hardware/software co-simulation environment capable of running a full-fledged OS at the early stage of the electronic system level design flow at an acceptable simulation speed is proposed; 2) a virtual platform constructed using the proposed CA-ISS as the processor model can be used to estimate the performance of a target system from system perspective, which all the previous works, such as QEMU-SystemC, do not provide; and 3) such a virtual platform also provides the modeling capability from the transaction level down to the CA level or the other way around. Lee, J.  Shrivastava, A. Static Analysis of Register File Vulnerability http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737851 With continuous technology scaling, soft errors are becoming an increasingly important design concern even for earth-bound applications. While compiler approaches have the potential to mitigate the effect of soft errors with minimal runtime overheads, static vulnerability estimation - an essential part of compiler approaches - is lacking due to its inherent complexity. This paper presents a static analysis approach for register file (RF) vulnerability estimation. We decompose the vulnerability of a register into intrinsic and conditional basic-block vulnerabilities. This decomposition allows us to develop a fast, yet reasonably accurate RF vulnerability estimation mechanism. We validate and compare a linear equation based method and an iterative method. Also we demonstrate a practical application of RF vulnerability estimation to compiler optimizations. Our experimental results on benchmarks from MiBench suite indicate that not only our static RF vulnerability estimation is fast and accurate, but also compiler optimizations enabled by our static estimation can achieve very cost-effective protection of register files against soft errors. VERIFICATION Little, S.  Walter, D.  Myers, C.  Thacker, R.  Batchu, S.  Yoneda, T. Verification of Analog/Mixed-Signal Circuits Using Labeled Hybrid Petri Nets http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737853 Mixed-signal designs integrate digital and analog circuits which complicates the already difficult verification problem. This paper presents a model, labeled hybrid Petri nets (LHPNs), that is developed to model this heterogeneous set of components. To support formal verification, this paper presents an efficient zone-based state space exploration algorithm for LHPNs. This algorithm uses a process known as warping which allows zones to describe continuous variables changing at variable rates. Finally, this paper describes the application of this algorithm to analog/mixed-signal circuit examples. SHORT PAPER Mele, S.  Favalli, M. A SAT Based Test Generation Method for Delay Fault Testing of Macro Based Circuits http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5737848 This letter addresses the problem of delay fault test generation in circuits using macros whose implementation is not known. The proposed approach uses a new signal representation that allows us to evaluate any kind of sensitization conditions (robust, non-robust, and functional) by means of Boolean differential calculus. Such an approach makes use of binary decision diagrams to support the computation of sensitization conditions for each macro along a path and of Boolean satisfiability to justify such conditions at primary inputs. Results are shown for a set of benchmarks.