July 2012 Newsletter Placing you one click away from the best new CAD research! Plain-text version at http://www.umn.edu/~tcad/newsletter/2012-07.txt REGULAR PAPERS ANALOG, MIXED-SIGNAL, AND RF CIRCUITS Liu, B.; Deferm, N.; Zhao, D.; Reynaert, P.; Gielen, G. G. E. An Efficient High-Frequency Linear RF Amplifier Synthesis Method Based on Evolutionary Computation and Machine Learning Techniques http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218230 Existing radio frequency (RF) integrated circuit (IC) design automation methods focus on the synthesis of circuits at a few GHz, typically less than 10 GHz. That framework is difficult to apply to RF IC synthesis at mm-wave frequencies (e.g., 60Ð100 GHz). In this paper, a new method, called efficient machine learning-based differential evolution, is presented for mm-wave frequency linear RF amplifier synthesis. By using electromagnetic (EM) simulations to evaluate the key passive components, the evaluation of circuit performances is accurate and solves the limitations of parasitic-included equivalent circuit models and predefined layout templates used in the existing synthesis framework. A decomposition method separates the design variables that require expensive EM simulations and the variables that only need cheap circuit simulations. Hence, a low-dimensional expensive optimization problem is generated. By the newly proposed core algorithm integrating adaptive population generation, naive Bayes classification, Gaussian process and differential evolution, the generated low-dimensional expensive optimization problem can be solved efficiently (by the online surrogate model), and global search (by evolutionary computation) can be achieved. A 100 GHz three-stage differential amplifier is synthesized in a 90 nm CMOS technology. The power gain reaches 10 dB with more than 20 GHz bandwidth. The synthesis costs only 25 h, having a comparable result and a nine times speed enhancement compared with directly using the EM simulator and global optimization algorithms. EMERGING TECHNOLOGIES Dong, X.; Xu, C.; Xie, Y.; Jouppi, N. P. NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218223 Various new nonvolatile memory (NVM) technologies have emerged recently. Among all the investigated new NVM candidate technologies, spin-torque-transfer memory (STT-RAM, or MRAM), phase-change random-access memory (PCRAM), and resistive random-access memory (ReRAM) are regarded as the most promising candidates. As the ultimate goal of this NVM research is to deploy them into multiple levels in the memory hierarchy, it is necessary to explore the wide NVM design space and find the proper implementation at different memory hierarchy levels from highly latency-optimized caches to highly density-optimized secondary storage. While abundant tools are available as SRAM/DRAM design assistants, similar tools for NVM designs are currently missing. Thus, in this paper, we develop NVSim, a circuit-level model for NVM performance, energy, and area estimation, which supports various NVM technologies, including STT-RAM, PCRAM, ReRAM, and legacy NAND Flash. NVSim is successfully validated against industrial NVM prototypes, and it is expected to help boost architecture-level NVM-related studies. MODELING AND SIMULATION Xie, L.; Davoodi, A. Post-Silicon Failing-Path Isolation Incorporating the Effects of Process Variations http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218222 This paper introduces a novel approach for isolating the failing paths for a fabricated chip. Failing paths are defined as the paths violating the timing constraint at the post-silicon stage, in the presence of process variations. To achieve this goal, a framework is suggested in which, first, a large number of statistically critical paths are extracted at the pre-silicon stage. These paths are those with the highest probability to violate the timing and therefore can form a ÒgoodÓ candidate set for failing paths. Next, a small set of representative paths are selected from those in the candidate set. These selected paths have their delays highly correlated with the delays of the remaining statistically critical paths. By directly measuring the delays of these selected representative paths at the post-silicon stage, the post-silicon delays of the remaining candidate failing paths can be predicted accurately that allows further isolating the failing paths for each fabricated chip. Simulation results show that up to a few thousand candidate failing paths can be accurately predicted using the post-silicon delays of less than 150 representative paths in the presence of more than 1000 independent process parameter variations for the ISCAS89 benchmark circuits. For each fabricated chip, the isolated failing paths are guaranteed to include all the ÒactualÓ failing paths in the candidate set. The proportion of the ÒactualÓ nonfailing paths from the isolated failing paths is shown to be very small. Paik, S.; Han, I.; Kim, S.; Shin, Y. Clock Gating Synthesis of Pulsed-Latch Circuits http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218237 Pulsed-latch circuits, in which latches are triggered by a short pulse, can reduce power consumption as well as increasing performance; and they can largely be designed using conventional computer-aided design tools. We explore the automatic synthesis of clock-gating logic for pulsed-latch circuits in which gating is implemented by enabling and disabling several pulse generators. The key problem is to arrange that each group of latches contains physically close latches, so that a short pulse from a pulse generator is delivered safely, and to ensure that the latches in a group have similar Boolean gating conditions because their clock is gated and ungated together. The resulting gating conditions should be implemented using as little extra logic as possible; for this purpose we rely on Boolean division, with an internal node of existing logic being used as the divisor. The proposed clock gating synthesis is assessed in 45-nm technology. Chen, Q.; Weng, S.-H.; Cheng, C.-K. A Practical Regularization Technique for Modified Nodal Analysis in Large-Scale Time-Domain Circuit Simulation http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218220 Fast full-chip time-domain simulation calls for advanced numerical integration techniques with capability to handle the systems with (tens of) millions of variables resulting from the modified nodal analysis (MNA). General MNA formulation, however, leads to a differential algebraic equation (DAE) system with singular coefficient matrix, for which most of explicit methods, which usually offer better scalability than implicit methods, are not readily available. In this paper, we develop a practical two-stage strategy to remove the singularity in MNA equations of large-scale circuit networks. A topological index reduction is first applied to reduce the DAE index of the MNA equation to one. The index-1 system is then fed into a systematic process to eliminate excess variables in one run, which leads to a nonsingular system. The whole regularization process is devised with emphasis on exact equivalence, low complexity, and sparsity preservation, and is thus well suited to handle extremely large circuits. PHYSICAL DESIGN Lee, M.-C.; Shi, Y.; Chang, S.-C. Efficient Wakeup Scheduling Considering Both Resource Usage and Timing Budget for Power Gating Designs http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218224 Power gating has been a very effective way to reduce power leakage. Normally, wakeup scheduling is required to control the turn-on times of sleep transistors to limit the surge current during the wakeup process. In this paper, a voltage sensor is adopted to compare the virtual ground voltage with the predesigned reference voltages, and use the result to determine the turn-on times of sleep transistors. We then propose a novel wakeup scheduling formulation that considers the tradeoff between wakeup times and hardware resources incurred by the voltage sensor. To address this problem, on the one hand, an efficient algorithm is proposed to find a wakeup scheduling with minimum wakeup time under the resource constraint. On the other hand, the algorithm to find a wakeup scheduling with minimum resource usage under the timing budget is presented. Experimental results show that with little increase in wakeup times, our algorithm can achieve significant hardware resource reduction for power gating designs. Liu, C.-H.; Kuo, S.-Y.; Lee, D. T.; Lin, C.-S.; Weng, J.-H.; Yuan, S.-Y. Obstacle-Avoiding Rectilinear Steiner Tree Construction: A Steiner-Point-Based Algorithm http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218228 For the obstacle-avoiding rectilinear Steiner minimal tree (OARSMT) problem, we present a Steiner-point-based algorithm that achieves the best practical performance among existing heuristics. We first propose a new concept of Steiner point locations, creating a linear-space routing graph with satisfactory Steiner point candidates to resolve the bottleneck of most existing heuristics. Then, we propose a Steiner-point-based framework to yield a solution, which is close to the key to the handling of the OARSMT problem. Experimental results show that this algorithm achieves excellent solution quality and speed performance at the same time. We also extend the Steiner-point-based framework to the obstacle-avoiding preferred direction Steiner tree problem with a good performance. SYSTEM-LEVEL DESIGN Majumder, T.; Borgens, M. E.; Pande, P. P.; Kalyanaraman, A. On-Chip Network-Enabled Multicore Platforms Targeting Maximum Likelihood Phylogeny Reconstruction http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218233 In phylogenetic inference, which aims at finding a phylogenetic tree that best explains the evolutionary relationship among a given set of species, statistical estimation approaches such as maximum likelihood (ML) and Bayesian inference provide more accurate estimates than other nonstatistical approaches. However, the improved quality comes at a higher computational cost, as these approaches, even though heuristic driven, involve optimization over multidimensional real continuous space. The number of possible search trees in ML is at least exponential, thereby making runtimes on even modest-sized datasets to clock up to several million CPU hours. Evaluation of these trees, involving node-level likelihood vector computation and branch-length optimization, can be partitioned into tasks (or kernels), providing the application with the potential to benefit from hardware acceleration. The range of hardware acceleration architectures tried so far offer limited degree of fine-grain parallelism. Network-on-chip (NoC) is an emerging paradigm that can efficiently support integration of massive number of cores on a chip. In this paper, we explore the design and performance evaluation of 2-D and 3-D NoC architectures for RAxML, which is one of the most widely used ML software suites. Specifically, we implement the computation kernels of the top three functions consuming more than 85% of the total software runtime. Simulations show that through appropriate choice of NoC architecture, and novel core design, allocation and placement strategies, our NoC-based implementation can achieve individual function-level speedups of 390x to 847x, speed up the targeted kernels in excess of 6500x, and provide end-to-end runtime reductions up to 5x over state-of-the-art multithreaded software. Shen, H.; Hamayun, M.-M.; Petrot, F. Native Simulation of MPSoC Using Hardware-Assisted Virtualization http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218234 Integration of multiple heterogeneous processors into a single system-on-a-chip is a clear trend in embedded devices. Designing and verifying these devices requires high-speed and easy-to-build simulation platforms. Among the software simulation approaches, native simulation is a good candidate since the embedded software is executed natively on the host machine, and no instruction set simulator development effort is necessary. However, existing native simulation approaches are such that the simulated software shares the memory space of the modeled hardware modules and the host operating system, making impractical the support of legacy code running on the target platform. To overcome this issue seldom mentioned in the literature, we propose the addition of a transparent address space translation layer to separate the target address space from the host simulator one. For this, we exploit the hardware-assisted virtualization technology now available on most general-purpose processors. Experiments show that this solution does not degrade the native simulation speed, while keeping the ability to accomplish software performance evaluation. TEST Natarajan, V.; Choi, H. W.; Banerjee, A.; Sen, S.; Chatterjee, A.; Srinivasan, G.; Taenzler, F.; Bhattacharya, S. Low Cost EVM Testing of Wireless RF SoC Front-Ends Using Multitones http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218236 Error-vector-magnitude (EVM) is a system level specification that determines the overall modulation quality and exhibits strong correlation to the inherent nonidealities of a radio frequency (RF) system. In production testing, EVM tests incur significant cost due to the large number of symbols required to ensure test quality. In our approach, EVM is decomposed into its deterministic (due to static impairments: IQ mismatch, gain, AM-AM and AM-PM) and random (due to dynamic impairments: VCO phase noise, thermal noise) components. The static impairments are computed from the device under test (DUT) response to an optimized multitone test input. The dynamic impairments are computed using signal processing algorithms from the DUT test response to the same test input. The EVM of the RF system is then derived from the computed static and dynamic impairments, respectively. Experimental results show that significant reduction in test time is possible without compromising EVM test quality. Chen, M.; Orailoglu, A. On Diagnosis of Timing Failures in Scan Architecture http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218221 Excessive test mode power-ground noise in nanometer scale chips causes large delay uncertainties in scan chains, resulting in a highly elevated rate of timing failures. The hybrid timing violation types in scan chains, compounded by their possibly intermittent manifestations, invalidate the conventional assumptions in scan chain fault behavior, significantly increasing the ambiguity and difficulty in diagnosis. In this paper, we propose a novel methodology to identify the root cause of scan chain timing failures. The proposed work addresses the challenge of diagnosing multiple permanent or intermittent timing faults in scan chains and the associated clock trees, which closely approximate the realistic failure mechanisms observed in silicon. Instead of relying on fault simulation that is incapable of approximating the intermittent fault manifestation, the proposed technique characterizes the impact of timing faults by analyzing the phase movement of scan patterns. Extracting fault-sensitive statistical features of phase movement information provides strong signals for the precise identification of fault locations and types. The manifestation probability of each fault is furthermore computed through a mathematical transformation framework which accurately models the behavior of multiple faults as a Markov chain. The identification of failing scan cells enables a further examination of the possible delay defects in the scan clock buffers, which ascertains the possible root causes of the observed scan chain failures. The proposed scheme characterizes the timing impact of the defective clock buffers by extracting the change in the delay distribution of the clock paths, enabling the effective pruning of unrealistic fault hypotheses that would result in highly deviant timing behavior. Simulation results have confirmed that the proposed methodology can yield highly accurate diagnosis results for complex fault manifestations. Stratigopoulos, H.-G. Test Metrics Model for Analog Test Development http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218239 The trend nowadays is to integrate more and more functionalities into a single chip. This, however, has serious implications in the testing cost. Especially for the analog circuits, the testing cost tends to be very high, despite the fact they occupy a small fraction of the area of the chip. Therefore, to reduce this cost, there is a high interest to replace the most demanding tests by alternative measurements. However, such replacement may inadvertently result in accepting faulty chips or rejecting functional chips. In this paper, we present a method for estimating such test metrics in the general scenario where a single test is replaced by a single measurement. The method is based on the extreme value theory and the statistical blockade algorithm. It can be readily applied during the test development phase to obtain estimates of the test metrics and corresponding confidence intervals with parts-per-million precision. For this purpose, the method requires a small number of selective simulations that we can afford to run in practice. SHORT PAPERS Poikonen, J. H.; Lehtonen, E.; Laiho, M. On Synthesis of Boolean Expressions for Memristive Devices Using Sequential Implication Logic http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218235 We determine an explicit procedure for representing any Boolean expression in a recursive form which can be realized using memristive devices, and demonstrate how the truth value of any Boolean expression can be determined using no more than two computing memristive devices. We present an algorithm which can be used to significantly reduce the number of implications required for representing a specific Boolean expression, and consider device-specific benefits in terms of reducing the lengths of computational sequences. Liang, Y.-Y.; Kuo, T.-Y.; Wang, S.-H.; Mak, W.-K. ALMmap: Technology Mapping for FPGAs With Adaptive Logic Modules http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218225 Modern field programmable gate arrays like Altera's Stratix Series have adopted the adaptive logic module (ALM) structure due to its potential performance and area advantages. An ALM can implement a single logic function or can be fractured into two smaller lookup tables (LUTs). In this paper, we propose an ALM mapping algorithm, ALMmap, for area minimization with bounded depth. We revamp the traditional iterative cut-based mapping flow and introduce a procedure for bounded depth mapping generation with dynamic area recovery that effectively combines cut selection, mapping, and area recovery together. In addition, we introduce a new procedure for computing cut set for ALM minimization under a depth constraint. The notion of area flow which has been used successfully for cut selection to reduce LUT count is revised for cut selection to reduce ALM count. ALMmap obtains depth optimal solutions that are 25.6% and 11.6% smaller, on average, than those produced by a classical mapper and WireMap, respectively. Smy, T.; Gunupudi, P. Robust Simulation of Opto-Electronic Systems by Alternating Complex Envelope Representations http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6218238 The increased use of optics in information processing and transmission systems motivates the development of self-consistent opto-electronic transient simulators. This letter presents a technique to significantly improve the DC and transient convergence behavior of a modified nodal analysis based optoelectronic simulation framework by allowing optical device models to be ÒswitchedÓ between a linear formulation using real and imaginary fields to a non-linear magnitude and phase representation. Configurable multilayer filters and optical ring modulators are used as examples to demonstrate the effectiveness of this approach in obtaining robust simulation convergence.