May 2013 Newsletter Placing you one click away from the best new CAD research! Plain-text version at http://www.umn.edu/~tcad/newsletter/2013-05.txt Announcing the 2013 Donald O. Pederson Best Paper AwardÉ The Donald O. Pederson Award recognizes the best paper published in the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in the two calendar years preceding the award. This yearÕs award, which will be presented at the Design Automation Conference in San Francisco, goes to: Wangyang Zhang; Xin Li; T. Liu;Emrah Acar; Rob A. Rutenbar; R. D. (Shawn) Blanton Virtual Probe: A Statistical Framework for Low-Cost Silicon Characterization of Nanoscale Integrated Circuits http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6071091&isnumber=6071079 The paper proposes a novel statistical methodology for modeling spatial variations of silicon wafers and dies from a small set of measurement data. It is the first paper to present a solid theoretical framework derived from compressive sensing theory and fundamentally changes the way that spatial variations are modeled and interpreted. Unlike other conventional techniques that require a lot of measurement data to capture the spatial variation pattern and, hence, are slow and expensive, Virtual Probe is both accurate and cost-efficient. It can be applied to a broad range of practical problems related to integrated circuit design and manufacturing, such as yield learning and test cost reduction. This paper thus makes a unique and outstanding contribution to the VLSI/CAD community with significant long-term impact. REGULAR PAPERS EMERGING TECHNOLOGIES Lee, D. ; Lee, W.S. ; Chen, C. ; Fallah, F. ; Provine, J. ; Chong, S. ; Watkins, J. ; Howe, R.T. ; Wong, H.-S.P. ; Mitra, S. Combinational Logic Design Using Six-Terminal NEM Relays http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504552 This paper presents techniques for designing nanoelectromechanical relay-based logic circuits using six-terminal relays that behave as universal logic gates. With proper biasing, a compact 2-to-1 multiplexer can be implemented using a single six-terminal relay. Arbitrary combinational logic functions can then be implemented using well- known binary decision diagram (BDD) techniques. Compared to a CMOS-style implementation using four-terminal relays, the BDD-based implementation can result in lower area without major impact on performance metrics such as delay, and energy (when the relays are scaled to small dimensions). Although it is possible to implement any combinational circuit with a single mechanical delay, the relay count can be significantly reduced for complex logic functions by allowing multiple mechanical delays. FPGAS AND RECONFIGURABLE COMPUTING Teng, B. ; Anderson, J.H. Latch-Based Performance Optimization for Field-Programmable Gate Arrays http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504531 We explore using pulsed latches for timing optimization in field-programmable gate arrays (FPGAs). Pulsed latches are transparent latches driven by a clock with a nonstandard (i.e., not 50%) duty cycle. As latches are already present on commercial FPGAs, their use for timing optimization can avoid the power or area drawbacks associated with other techniques such as clock skew and retiming. We propose algorithms that automatically replace certain flipÐ flops with latches for performance gains. Under conservative short path or minimum delay assumptions, our latch- based optimization, operating on already routed designs, provides all the benefit of clock skew in most cases and increases performance by 9%, on average, without area penalties or significant netlist changes. We show that short paths greatly hinder the ability of using pulsed latches, and that further improvements in performance are possible by increasing the delay of certain short paths. Stojilovic, M. ; Novo, D. ; Saranovac, L. ; Brisk, P. ; Ienne, P. Selective Flexibility: Creating Domain-Specific Reconfigurable Arrays http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504556 Historically, hardware acceleration technologies have either been application-specific, therefore lacking in flexibility, or fully programmable, thereby suffering from notable inefficiencies on an application-by-application basis. To address the growing need for domain-specific acceleration technologies, this paper describes a design methodology (i) to automatically generate a domain-specific coarse-grained array from a set of representative applications and (ii) to introduce limited forms of architectural generality to increase the likelihood that additional applications can be successfully mapped onto it. In particular, coarse-grained arrays generated using our approach are intended to be integrated into customizable processors that use application-specific instruction set extensions to accelerate performance and reduce energy; rather than implementing these extensions using application-specific integrated circuit (ASIC) logic, which lacks flexibility, they can be synthesized onto our reconfigurable array instead, allowing the processor to be used for a variety of applications in related domains. Results show that our array is around 2x slower and 15x larger than an ultimately efficient ASIC implementation, and thus far more efficient than field-programmable gate arrays (FPGAs), which are known to be 3Ð4x slower and 20Ð40x larger. Additionally, we estimate that our array is usually around 2x larger and 2x slower than an accelerator synthesized using traditional datapath merging, which has, if any, very limited flexibility beyond the design set of DFGs. MODELING AND SIMULATION Park, S. ; Park, J. ; Shin, D. ; Wang, Y. ; Xie, Q. ; Pedram, M. ; Chang, N. Accurate Modeling of the Delay and Energy Overhead of Dynamic Voltage and Frequency Scaling in Modern Microprocessors http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504549 Dynamic voltage and frequency scaling (DVFS) has been studied for well over a decade. Nevertheless, existing DVFS transition overhead models suffer from significant inaccuracies; for example, by incorrectly accounting for the effect of DCÐDC converters, frequency synthesizers, voltage, and frequency change policies on energy losses incurred during mode transitions. Incorrect and/or inaccurate DVFS transition overhead models prevent one from determining the precise break-even time and thus forfeit some of the energy saving that is ideally achievable. This paper introduces accurate DVFS transition overhead models for both energy consumption and delay. In particular, we redefine the DVFS transition overhead including the underclocking-related losses in a DVFS-enabled microprocessor, additional inductor IR losses, and power losses due to discontinuous-mode DCÐDC conversion. We report the transition overheads for a desktop, a mobile and a low-power representative processor. We also present DVFS transition overhead macromodel for use by high-level DVFS schedulers. PHYSICAL DESIGN Liu, W.-H. ; Kao, W.-C. ; Li, Y.-L. ; Chao, K.-Y. NCTU-GR 2.0: Multithreaded Collision-Aware Global Routing With Bounded-Length Maze Routing http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504553 Modern global routers employ various routing methods to improve routing speed and quality. Maze routing is the most time-consuming process for existing global routing algorithms. This paper presents two bounded-length maze routing (BLMR) algorithms (optimal-BLMR and heuristic-BLMR) that perform much faster routing than traditional maze routing algorithms. In addition, a rectilinear Steiner minimum tree aware routing scheme is proposed to guide heuristic-BLMR and monotonic routing to build a routing tree with shorter wirelength. This paper also proposes a parallel multithreaded collision-aware global router based on a previous sequential global router (SGR). Unlike the partitioning-based strategy, the proposed parallel router uses a task-based concurrency strategy. Finally, a 3-D wirelength optimization technique is proposed to further refine the 3-D routing results. Experimental results reveal that the proposed SGR uses less wirelength and runs faster than most of other state-of-the-art global routers with a different set of parameters. Compared to the proposed SGR, the proposed parallel router yields almost the same routing quality with average 2.71 and 3.12-fold speedup on overflow-free and hard-to-route cases, respectively, when running on a 4-core system. TEST Ye, F. ; Zhang, Z. ; Chakrabarty, K. ; Gu, X. Board-Level Functional Fault Diagnosis Using Artificial Neural Networks, Support-Vector Machines, and Weighted-Majority Voting http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504533 Increasing integration densities and high operating speeds lead to subtle manifestation of defects at the board level. Functional fault diagnosis is, therefore, necessary for board-level product qualification. However, ambiguous diagnosis results lead to long debug times and even wrong repair actions, which significantly increase repair cost and adversely impact yield. Advanced machine-learning (ML) techniques offer an unprecedented opportunity to increase the accuracy of board-level functional diagnosis and reduce high-volume manufacturing cost through successful repair. We propose a smart diagnosis method based on two ML classification models, namely, artificial neural networks (ANNs) and support-vector machines (SVMs) that can learn from repair history and accurately localize the root cause of a failure. Fine-grained fault syndromes extracted from failure logs and corresponding repair actions are used to train the classification models. We also propose a decision machine based on weighted- majority voting, which combines the benefits of ANNs and SVMs. Three complex boards from the industry, currently in volume production, and additional synthetic data, are used to validate the proposed methods in terms of diagnostic accuracy, resolution, and quantifiable improvement over current diagnostic software. Lin, Y.-H. ; Huang, S.-Y. ; Tsai, K.-H. ; Cheng, W.-T. ; Sunter, S. ; Chou, Y.-F. ; Kwai, D.-M. Parametric Delay Test of Post-Bond Through-Silicon Vias in 3-D ICs via Variable Output Thresholding Analysis http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504551 A parametric delay fault could arise in a through-silicon via (TSV) of a 3-D IC due to a manufacturing defect. Identification of such a fault is essential for fault diagnosis, yield-learning, and/or reliability screening. In this paper, we present an innovative design-for-testability technique called variable output thresholding. We discovered that by dynamically switching the output of a TSV from a normal inverter to a SchmittÐTrigger inverter, the parametric delay fault on the TSV can be characterized and detected. SPICE simulation reveals that this technique remains effective even when there is significant process variation. A scalable test infrastructure indicates that the test time is modest at only 17.2 ms for 1024 TSVs and 648.8 ms for 32768 TSVs when the test clock is running at 10 MHz. Liu, X. ; Xu, Q. On Multiplexed Signal Tracing for Post-Silicon Validation http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504557 Trace-based debug techniques have widely been utilized in the industry to eliminate design errors escaped from pre- silicon verification. Existing solutions typically trace the same set of signals throughout each debug run, which is not quite effective for catching design errors. In this paper, we propose a multiplexed signal tracing strategy that is able to significantly increase debuggability of the circuit. That is, we divide the tracing procedure in each debug run into a few periods and trace different sets of signals in each period. We present a trace signal grouping algorithm to maximize the probability of catching the propagated evidences from design errors, considering the trace interconnection fabric design constraints. Moreover, we propose a trace signal selection solution to enhance the error detection capability. Experimental results on benchmark circuits demonstrate the effectiveness of the proposed solution. SYSTEM-LEVEL DESIGN Naeem, A. ; Jantsch, A. ; Lu, Z. Scalability Analysis of Memory Consistency Models in NoC-Based Distributed Shared Memory SoCs http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504554 We analyze the scalability of six memory consistency models in network-on-chip (NoC)-based distributed shared memory multicore systems: 1) protected release consistency (PRC); 2) release consistency (RC); 3) weak consistency (WC); 4) partial store ordering (PSO); 5) total store ordering (TSO); and 6) sequential consistency (SC). Their realizations are based on a transaction counter and an address-stack-based approach. The scalability analysis is based on different workloads mapped on various sizes of networks using different problem sizes. For the experiments, we use Nostrum NoC-based configurable multicore platform with a 2-D mesh topology and a deflection routing algorithm. Under the synthetic workloads, the average execution time for the PRC, RC, WC, PSO, and TSO models in the 8x8 network (64-cores) is reduced by 32.3%, 28.3%, 20.1%, 13.8%, and 9.9% over the SC model, respectively. For the application workloads, as the network size grows, the average execution time under these relaxed memory models decreases with respect to the SC model depending on the application and its match to the architecture. The performance improvement of the PRC and RC models over the SC model tends to be higher than 50% as observed in the experiments, when the system is further scaled up. The area cost in the network interface for the relaxed memory models is increased by less than 4% over the SC model. VERIFICATION Cimatti, A. ; Narasamdya, I. ; Roveri, M. Software Model Checking SystemC http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504555 SystemC is an increasingly used language for writing executable specifications of systems-on-chip. The verification of SystemC, however, is a very difficult challenge. Simulation features great scalability, but can miss important defects. On the other hand, formal verification of SystemC is extremely hard because of the presence of threads, and the intricacies of the communication and scheduling mechanisms. In this paper, we explore formal verification for SystemC by means of software model checking techniques, which have demonstrated substantial progress in recent years. We propose an accurate model of SystemC and three complementary encodings of SystemC to finite-state processes, sequential and threaded programming models. We implement the proposed approaches in a tool chain and carry out a thorough experimental evaluation using several benchmarks taken from the literature on SystemC verification, and experimenting with different state-of-the-art software model checkers. The results clearly show the applicability and efficiency of the proposed approaches. In particular, the results show the effectiveness of the threaded and of the finite-model encodings to prove and disprove properties, respectively. Kumar, J.A. ; Vasudevan, S. Formal Probabilistic Timing Verification in RTL http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504532 Variations in timing can occur due to multiple sources on a chip such as process variations and variations in input patterns. It is desirable to have variation awareness at the register transfer level (RTL), and estimate block level delay distributions early in the design cycle, to evaluate design choices quickly and minimize postsynthesis simulation costs. In previous work, we introduced statistical high-level analysis and rigorous performance estimation (SHARPE), a rigorous, systematic methodology to verify design correctness in RTL in the presence of variations. We described SHARPE in the context of computing statistical delay invariants with respect to input variations. We treated the RTL source code as a program and used static program analysis techniques to compute probabilities. We modeled the probabilistic RTL modules as discrete time Markov chains that are then checked formally for probabilistic invariants using PRISM, a probabilistic model checker. In this paper, we extend SHARPE to perform timing verification in RTL in the context of process variations. We achieved this by obtaining a set of process variation-aware RTL delay models and correspondingly modifying the existing steps in SHARPE. We illustrate SHARPE on the RTL description of the datapath of OR1200, an open source embedded processor. We also apply SHARPE to other data-intensive RTL designs such as nontrivial components of communication systems and a few benchmark designs. SHORT PAPER Chen, Q. ; Schoenmaker, W. ; Chen, G. ; Jiang, L. ; Wong, N. A Numerically Efficient Formulation for Time-Domain Electromagnetic-Semiconductor Cosimulation for Fast-Transient Systems http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504550 We report recent progress in developing a numerically efficient formulation for electromagnetic-technology computer-aided design cosimulation for fast-transient computations. The difficulties underlying the currently existing transient formulation stemming from the vector potential-scalar potential (A-V) framework are analyzed. A time-domain electric field-scalar potential (E-V) framework is then developed via equation and variable transformations. This results in better-conditioned systems that are friendly to iterative solutions at fast switching times. Numerical examples show that the proposed E-V solver renders a useful tool for addressing multidomain simulation.