TCAD Newsletter - June 2011 Issue Placing you one click away from the best new CAD research! ANNOUNCING THE 2011 DONALD O. PEDERSON BEST PAPER AWARD... The Donald O. Pederson Award recognizes the best paper published in the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in the two calendar years preceding the award. Based on an open nomination process, the TCAD editorial board voted to select a paper by Amith Singhee and Rob Rutenbar as this year¿s winner: Statistical Blockade: Very Fast Statistical Simulation and Modeling of Rare Circuit Events and Its Application to Memory Design http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5166555 Circuit reliability under random parametric variation is an area of growing concern. For highly replicated circuits, e.g., static random access memories (SRAMs), a rare statistical event for one circuit may induce a not-so-rare system failure. Existing techniques perform poorly when tasked to generate both efficient sampling and sound statistics for these rare events. Statistical blockade is a novel Monte Carlo technique that allows to efficiently filter-to block-unwanted samples that are insufficiently rare in the tail distributions that are sought. The method synthesizes ideas from data mining and extreme value theory and, for the challenging application of SRAM yield analysis, shows speedups of 10-100 times over standard Monte Carlo. Congratulations to the winners! The award will be presented at the opening session of the ACM/EDAC/IEEE Design Automation Conference in San Diego, CA next week. JUNE 2011 ISSUE REGULAR PAPERS ANALOG, MIXED-SIGNAL, AND RF CIRUITS Liu, B. Fernandez, F. V. Gielen, G. G. E. Efficient and Accurate Statistical Analog Yield Optimization and Variation-Aware Circuit Sizing Based on Computational Intelligence Techniques http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768139 In nanometer CMOS technologies, worst-case design methods and response-surface-based yield optimization methods face challenges in accuracy. Monte-Carlo (MC) simulation is general and accurate for yield estimation, but its efficiency is not high enough to make MC-based analog yield optimization, which requires many yield estimations, practical. In this paper, techniques inspired by computational intelligence are used to speed up yield optimization without sacrificing accuracy. A new sampling-based yield optimization approach, which determines the device sizes to optimize yield, is presented, called the ordinal optimization (OO)-based random-scale differential evolution (ORDE) algorithm. By proposing a two-stage estimation flow and introducing the OO technique in the first stage, sufficient samples are allocated to promising solutions, and repeated MC simulations of non-critical solutions are avoided. By the proposed evolutionary algorithm that uses differential evolution for global search and a random-scale mutation operator for fine tunings, the convergence speed of the yield optimization can be enhanced significantly. With the same accuracy, the resulting ORDE algorithm can achieve approximately a tenfold improvement in computational effort compared to an improved MC-based yield optimization algorithm integrating the infeasible sampling and Latin-hypercube sampling techniques. Furthermore, ORDE is extended from plain yield optimization to process-variation-aware single-objective circuit sizing. EMERGING TECHNOLOGIES Maslov, D. Saeedi, M. Reversible Circuit Optimization Via Leaving the Boolean Domain http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768140 For years, the quantum/reversible circuit community has been convinced that: 1) the addition of auxiliary quantum bits (qubits) is instrumental in constructing a smaller quantum circuit, and 2) the introduction of quantum gates inside reversible circuits may result in more efficient designs. This paper presents a systematic approach to optimizing reversible (and quantum) circuits via the introduction of auxiliary qubits and quantum gates inside circuit designs. This advances our understanding of what may be achieved with 1) and 2). Lin, C. C.-Y. Chang, Y.-W. Cross-Contamination Aware Design Methodology for Pin-Constrained Digital Microfluidic Biochips http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768126 Digital microfluidic biochips have emerged as a popular alternative for laboratory experiments. Pin-count reduction and cross-contamination avoidance are key design considerations for practical applications with different droplets being transported and manipulated on highly integrated biochips. This paper presents the first design automation flow that considers the cross-contamination problems on pin-constrained biochips. The factors that make the problems harder on pin-constrained biochips are explored. To cope with these cross contaminations, this paper proposes: 1) early crossing minimization algorithms during placement, and 2) systematic wash droplet scheduling and routing that require only one extra control pin and zero assay completion time overhead for practical bioassays. Experimental results show the effectiveness and scalability of our algorithms for practical bioassays. FPGAs AND RECONFIGURABLE COMPUTING Golshan, S. Kooti, H. Bozorgzadeh, E. SEU-Aware High-Level Data Path Synthesis and Layout Generation on SRAM-Based FPGAs http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768130 Although triple modular redundancy (TMR) has been widely used to mitigate single event upsets (SEUs) in static random access memory-based field-programmable gate arrays (FPGAs), SEU-caused bridging faults between the triplicated modules do not guarantee the correctness of TMR designs under all SEUs. In this paper, we present a novel computer-aided design flow for redundancy-based applications on FPGAs in order to mitigate the impact of SEUs in the configuration bitstreams. We introduce the notions of modular redundancy conflicts and vulnerability-gap conflicts which maintain the fundamental assumption underlying the integrity of redundancy-based designs (i.e., self-containment of SEU-induced faults within a single replica of redundant resources). When the impact of SEU-induced bridging faults is considered in high-level synthesis as well as physical synthesis, on average more than 30% improvement in the number of potential SEU-induced bridging faults can be reached as well as improvements in the performance and area utilization, compared to post-synthesis TMR, in which the voters are applied to the feedback structures in the circuit. Compared to the extreme case of post-synthesis TMR in which the voters are applied at the end of every configurable logic block, we reach 38% (26%) improvement in performance (area) of the implemented circuits. MODELING AND SIMULATION Garcia-Loureiro, A. J. Seoane, N. Aldegunde, M. Valin, R. Asenov, A. Martinez, A. Kalna, K. Implementation of the Density Gradient Quantum Corrections for 3-D Simulations of Multigate Nanoscaled Transistors http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768129 An efficient implementation of the density-gradient (DG) approach for the finite element and finite difference methods and its application in drift-diffusion (D-D) simulations is described in detail. The new, second-order differential (SOD) scheme is compatible with relatively coarse grids even for large density variations thus applicable to device simulations with complex 3-D geometries. Test simulations of a 1-D metal-oxide semiconductor diode demonstrate that the DG approach discretized using our SOD scheme can be accurately calibrated against Schrödinger-Poisson calculations exhibiting lower discretization error than the previous schemes when using coarse grids and the same results for very fine meshes. 3-D test D-D simulations using the finite element method are performed on two devices: a 10 nm gate length double gate metal-oxide-semiconductor field-effect transistor (MOSFET) and a 40 nm gate length Tri-Gate fin field-effect transistor (FinFET). In 3-D D-D simulations, the SOD scheme is able to converge to physical solutions at high voltages even if the previous schemes fail when using the same mesh and equivalent conditions. The quantum corrected D-D simulations using the SOD scheme also converge with an atomistic mesh used for the 10 nm double gate MOSFET saving computational resources and can be accurately calibrated against the results from non-equilibrium Green's functions approach. Finally, the simulated ID-VG characteristics for the 40 nm gate length Tri-Gate are in an excellent agreement with experimental data. Veetil, V. Chopra, K. Blaauw, D. Sylvester, D. Fast Statistical Static Timing Analysis Using Smart Monte Carlo Techniques http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768142 In this paper, we propose a stratification+hybrid quasi Monte Carlo (SH-QMC) approach to improve the efficiency of Monte Carlo-based statistical static timing analysis (SSTA) using sample size reduction. Sample size reduction techniques proposed in the literature exhibit a tradeoff between accuracy of the Monte Carlo estimate with fewer samples and their ability to handle large number of variables in multidimensional space. This paper proposes to target several such techniques to different sets of process variation variables by using information about the importance of these variables to the circuit delay, and the capability of the techniques to handle multiple dimensions. Simulations on benchmark circuits up to 90K gates show that the proposed method requires up to 224 samples for varying levels of process variation to achieve accurate timing estimates. Results also show that when SH-QMC is performed with multiple parallel threads on a quad-core processor, the approach is faster than traditional SSTA with comparable accuracy. When the proposed SH-QMC technique is supplemented with a graph pruning method the runtime is further reduced by 46¿48% on average. The technique is also extended to include an incremental approach to recompute a percentile delay metric after engineering change order. Chen, Q. Schoenmaker, W. Meuris, P. Wong, N. An Effective Formulation of Coupled Electromagnetic-TCAD Simulation for Extremely High Frequency Onward http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768127 This paper presents an effective formulation tailored for electromagnetic-technology computer-aided design coupled simulations for extremely high-frequency ranges and beyond 50GHz. A transformation of variables is exploited from the starting A-V formulation to the E-V formulation, combined with adopting the gauge condition as the equation for scalar potential. The transformation significantly reduces the cross-coupling between electric and magnetic systems at high frequencies, providing therefore much better convergence for iterative solution. The validation of such transformations is ensured through a careful analysis of redundancy in the coupled system and material properties. Employment of the advanced matrix permutation technique further alleviates the extra computational cost introduced by the variable transformation. Numerical experiments confirm the accuracy and efficiency of the proposed E-V formulation. PHYSICAL DESIGN Rajaram, A. Pan, D. Z. Robust Chip-Level Clock Tree Synthesis http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768141 Chip-level clock tree synthesis (CCTS) is a key problem that arises in complex system-on-a-chip designs. A key requirement of CCTS is to balance the clock-trees belonging to different IPs such that the entire tree has a small skew across all process corners. Achieving this is difficult because the clock trees in different IPs might be vastly different in terms of their clock structures and cell/interconnect delays. The chip-level clock tree is expected to compensate for these differences and achieve good skews across all corners. Also, CCTS is expected to reduce clock divergence between IPs that have critical timing paths between them. Reducing clock divergence reduces the maximum possible clock skew in the critical paths between the IPs and thus improves yield. This paper proposes effective CCTS algorithms to simultaneously reduce multicorner skew and clock divergence. Experimental results on several test-cases indicate that our methods achieve 30% reduction in the clock divergence with significantly improved multicorner skew variance, at the cost of 2% increase in buffer area and 1% increase in wirelength. SYSTEM-LEVEL DESIGN Drego, N. Chandrakasan, A. Boning, D. Shah, D. Reduction of Variation-Induced Energy Overhead in Multi-Core Processors http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768128 Core-to-core variability in future many-core chip multi-processors (CMPs) negatively impacts energy. Under-performing cores necessitate increasing the system voltage to maintain homogeneous core performance, introducing an energy overhead. Multiple supply voltages can be used to mitigate the impact of delay variation in CMPs. In this paper, we carefully analyze the use of a local search algorithm to pick near-optimal supply voltages while meeting a fixed performance target. With two system voltages, we prove our algorithm selects the global optimum and in the more general multiple voltage case we develop quantitative bounds. Using a custom simulation methodology on a real processor core, we show that two system voltages provide the most incremental benefit, reducing the energy overhead relative to a single voltage by 59¿75% and total energy by 6¿16%. Additionally, the worst 5¿15% of cores in such systems necessitate increasingly larger amounts of incremental energy for a constant incremental performance gain. Therefore, turning off or disabling these cores is beneficial to a joint performance-energy metric. Kang, K. Kim, J. Yoo, S. Kyung, C.-M. Runtime Power Management of 3-D Multi-Core Architectures Under Peak Power and Temperature Constraints http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768133 3-D integration is a new technology that overcomes the limitations of 2-D integrated circuits, e.g., power and delay induced from long interconnect wires, by stacking multiple dies to increase logic integration density. However, chip-level power and peak temperature are the major performance limiters in 3-D multi-core architectures. In this paper, we propose a runtime power management method for both peak power and temperature-constrained 3-D multi-core systems in order to maximize the instruction throughput. The proposed method exploits dynamic temperature slack (defined as peak temperature constraint minus current temperature) and workload characteristics (e.g., instructions per cycle and memory-boundness) as well as thermal characteristics of 3-D stacking architectures. Compared with existing thermal-aware power management solutions for 3-D multi-core systems, our method yields up to 34.2% (average 18.5%) performance improvement in terms of instructions per second without significant additional energy consumption. TEST Lee, M. Denq, L.-M. Wu, C.-W. A Memory Built-In Self-Repair Scheme Based on Configurable Spares http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768135 There is growing need for embedded memory built-in self-repair (MBISR) due to the introduction of more and more system-on-chip (SoC) and other highly integrated products, for which the chip yield is being dominated by the yield of on-chip memories, and repairing embedded memories by conventional off-chip schemes is expensive. Therefore, we propose an MBISR generator called BRAINS+, which automatically generates register transfer level MBISR circuits for SoC designers. The MBISR circuit is based on a redundancy analysis (RA) algorithm that enhances the essential spare pivoting algorithm, with a more flexible spare architecture, which can configure the same spare to a row, a column, or a rectangle to fit failure patterns more efficiently. The proposed MBISR circuit is small, and it supports at-speed test without timing-penalty during normal operation, e.g., with a typical 0.13um CMOS technology, it can run at 333 MHz for a 512 Kb memory with four spare elements (rows and/or columns), and the MBISR area overhead is only 0.36%. With its low area overhead and zero test-time penalty, the MBISR can easily be applied to multiple memories with a distributed RA scheme. Compared with recent studies, the proposed scheme is better in not only test-time but also area overhead. SHORT PAPERS Hsieh, T.-Y. Lee, K.-J. Breuer, M. A. An Error-Tolerance-Based Test Methodology to Support Product Grading for Yield Enhancement http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768131 This paper presents a novel error-tolerance-based test methodology to grade defective chips according to their degree of acceptability so as to improve the effective yield of chips. We employ error rate as the attribute of error-tolerance to determine acceptability. We show that the number of test patterns that need to be applied to a circuit under test in estimating the circuit's error rate is highly dependent on how close the circuit's actual error rate is to the given grading thresholds. An iterative and adaptive error rate estimation technique is developed by which an appropriate number of test patterns can be efficiently determined and the circuit can be immediately classified into appropriate grades to fit various application requirements. Experimental results show that: 1) only a few iterations are required to classify a circuit, and 2) the total number of test patterns used is in general independent of the circuit size. Both of these observations imply that these techniques are applicable to large circuits. Tannir, D. Khazaka, R. Adjoint Sensitivity Analysis of Nonlinear Distortion in Radio Frequency Circuits http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768134 Measuring the effects of nonlinear intermodulation distortion is one of the main requirements in the design of radio frequency circuits. The third-order intercept point (IP3) is one of the main figures of merit that is used to characterize this distortion and is expensive to compute due to the presence of multi-tone inputs. Recently, an efficient method for computing the third order intercept point, using moments analysis, was presented. However, this approach does not provide any sensitivity information. In this paper, we propose an efficient and robust method for computing the sensitivity of IP3 using adjoint moments with minimal additional computational cost. Cong, J. Huang, H. Jiang, W. Pattern-Mining for Behavioral Synthesis http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5768132 Pattern-based synthesis has drawn wide interest from researchers who tried to utilize the regularity in applications for design optimizations. In this letter, we present a general pattern-based behavior synthesis framework which can efficiently extract similar structures in programs. Our approach is very scalable in benefit of advanced pruning techniques. The similarity of structures is captured by a mismatch-tolerant metric: the graph edit distance. The graph edit distance can naturally capture different program variations such as bit-width, structure, and port variations. In addition, we further our approach to handle control-intensive applications, and this leads to more opportunities for optimization. Our algorithm uses a feature-based filtering approach for fast pruning, and a graph similarity metric called the generalized edit distance for measuring variations in control-data flow graphs. Furthermore, we apply our pattern-based synthesis system to the resource optimization problem in behavioral synthesis. Considering knowledge of discovered patterns, the resource binding step can intelligently generate the data-path to reduce interconnect costs. Experiments show that our approach can, on average, reduce the total area by about 20% with 7% latency overhead with our pattern techniques on the Xilinx Virtex-4 field-programmable gate arrays, compared to the traditional behavioral synthesis flow.