April 2012 Newsletter Placing you one click away from the best new CAD research! REGULAR PAPERS KEYNOTE PAPER Zhang, J. Lin, A. Patil, N. Wei, H. Wei, L. Wong, H.-P. Mitra, S. Carbon Nanotube Robust Digital VLSI http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171056 Carbon nanotube field-effect transistors (CNFETs) are excellent candidates for building highly energy-efficient electronic systems of the future. Fundamental limitations inherent to carbon nanotubes (CNTs) pose major obstacles to the realization of robust CNFET digital very large-scale integration (VLSI): 1) it is nearly impossible to guarantee perfect alignment and positioning of all CNTs despite near-perfect CNT alignment achieved in recent years; 2) CNTs can be metallic or semiconducting depending on chirality; and 3) CNFET circuits can suffer from large performance variations, reduced yield, and increased susceptibility to noise. Today's CNT process improvements alone are inadequate to overcome these challenges. This paper presents an overview of: 1) imperfections and variations inherent to CNTs; 2) design and processing techniques, together with a probabilistic analysis framework, for robust CNFET digital VLSI circuits immune to inherent CNT imperfections and variations; and 3) recent experimental demonstration of CNFET digital circuits that are immune to CNT imperfections. Significant advances in design tools can enable robust and scalable CNFET circuits that overcome the challenges of the CNFET technology while retaining its energy-efficiency benefits. MODELING AND SIMULATION Heloue, K. R. Onaissi, S. Najm, F. N. Efficient Block-Based Parameterized Timing Analysis Covering All Potentially Critical Paths http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171052 In order for the results of timing analysis to be useful, they must provide insight and guidance on how the circuit may be improved so as to fix any reported timing problems. A limitation of many recent variability-aware timing analysis techniques is that, while they report delay distributions, or verify multiple corners, they do not provide the required guidance for re-design. We propose an efficient block-based parameterized timing analysis technique that can accurately capture circuit delay at every point in the parameter space, by reporting all paths that can become critical. Using an efficient pruning algorithm, only those potentially critical paths are carried forward, while all other paths are discarded during propagation. This allows one to examine local robustness to parameters in different regions of the parameter space, not by considering differential sensitivity at a point (that would be useless in this context) but by knowledge of the paths that can become critical at nearby points in parameter space. We give a formal definition of this problem and propose a technique for solving it, which improves on the state of the art, both in terms of theoretical computational complexity and in terms of runtime on various test circuits. Chung, J. Abraham, J. A. Refactoring of Timing Graphs and Its Use in Capturing Topological Correlation in SSTA http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171045 Reconvergent paths in circuits have been a nuisance in various computer-aided design (CAD) algorithms, but no elegant solution to deal with them has been found yet. In statistical static timing analysis (SSTA), they cause difficulty in capturing topological correlation. This paper presents a technique that in arbitrary block-based SSTA reduces the error caused by ignoring topological correlation. We interpret a timing graph as an algebraic expression made up of addition and maximum operators. We define the division operation on the expression and propose algorithms that modify factors in the expression without expansion. As a result, the algorithms produce an expression to derive the latest arrival time with better accuracy in SSTA. Existing techniques handling reconvergent fanouts usually use dependency lists, requiring quadratic space complexity. Instead, the proposed technique has linear space complexity by using a new directed acyclic graph search algorithm. Our results show that it outperforms an existing technique in speed and memory usage with comparable accuracy. More important, the proposed technique is not limited to SSTA and is potentially applicable to various issues due to reconvergent paths in timing- related CAD algorithms. Chung, J. Xiong, J. Zolotov, V. Abraham, J. A. Path Criticality Computation in Parameterized Statistical Timing Analysis Using a Novel Operator http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171046 This paper presents a method to compute criticality probabilities of paths in parameterized statistical static timing analysis. We partition the set of all the paths into several groups and formulate the path criticality into a joint probability of inequalities. Before evaluating the joint probability directly, we simplify the inequalities through algebraic elimination, handling topological correlation. Our proposed method uses conditional probabilities to obtain the joint probability, and statistics of random variables representing process parameters are changed to take into account the conditions. To calculate the conditional statistics of the random variables, we derive analytic formulas by extending Clark's work. This allows us to obtain the conditional probability density function of a path delay, given the path is critical, as well as to compute criticality probabilities of paths. Our experimental results show that the proposed method provides 4.2X better accuracy on average in comparison to the state-of-art method. Ibrahim, W. Beiu, V. Beg, A. GREDA: A Fast and More Accurate Gate Reliability EDA Tool http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171048 Generic as well as customized reliability electronic design automation (EDA) tools have been proposed in the literature and used to estimate the reliability of both present and future (nano)circuits. However, the accuracy of many of these EDA tools is questionable as they: 1) either assume that all gates have the same constant probability of failure (PF_GATE=const.) , or 2) use very simple approaches to estimate the reliability of the elementary gates. In this paper, we introduce a gate reliability EDA tool (GREDA) that is able to estimate more accurately the reliability of CMOS gates by considering: 1) the gate's topology; 2) the variable probability of failure of the individual devices PF_DEV; 3) the applied input vector; 4) the reliability of the input signals; and 5) the input voltage variations (which can be linked to the allowed noise margins). GREDA can be used to calculate PF_GATE due to different types of faults and/or defects, and to estimate the effects of enhancing PF_DEV on PF_GATE. Simulation results show that GREDA can improve on the accuracy of reliability calculations at the gate level. Kazmierski, T. J. Wang, L. Al-Hashimi, B. M. Merrett, G. V. An Explicit Linearized State-Space Technique for Accelerated Simulation of Electromagnetic Vibration Energy Harvesters http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171049 Vibration energy harvesting systems pose significant modeling and design challenges due to their mixed-technology nature, extremely low levels of available energy and disparate time scales between different parts of a complete harvester. An energy harvester is a complex system of tightly coupled components modeled in the mechanical, magnetic, as well as electrical analog and digital domains. Currently available design tools are inadequate for simulating such systems due to prohibitive CPU times. This paper proposes a new technique to accelerate simulations of complete vibration energy harvesters by approximately two orders of magnitude. The proposed technique is to linearize the state equations of the system's analog components to obtain a fast estimate of the maximum step-size to guarantee the numerical stability of explicit integration based on the Adams-Bashforth formula. We show that the energy harvester's analog electronics can be efficiently and reliably simulated in this way with CPU times two orders of magnitude lower than those obtained from two state-of-the-art tools, VHDL-AMS and SystemC-A. As a case study, a practical, complex microgenerator with magnetic tuning and two types of power- processing circuits have been simulated using the proposed technique and verified experimentally. Wang, Y. Zhang, Z. Koh, C.-K. Shi, G. Pang, G. K. H. Wong, N. Passivity Enforcement for Descriptor Systems Via Matrix Pencil Perturbation http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171054 Passivity is an important property of circuits and systems to guarantee stable global simulation. Nonetheless, nonpassive models may result from passive underlying structures due to numerical or measurement error/inaccuracy. A postprocessing passivity enforcement algorithm is therefore desirable to perturb the model to be passive under a controlled error. However, previous literature only reports such passivity enforcement algorithms for pole-residue models and regular systems (RSs). In this paper, passivity enforcement algorithms for descriptor systems (DSs, a superset of RSs) with possibly singular direct term (specifically, D+D^T or I-DD^T) are proposed. The proposed algorithms cover all kinds of state-space models (RSs or DSs, with direct terms being singular or nonsingular, in the immittance or scattering representation) and thus have a much wider application scope than existing algorithms. The passivity enforcement is reduced to two standard optimization problems that can be solved efficiently. The objective functions in both optimization problems are the error functions, hence perturbed models with adequate accuracy can be obtained. Numerical examples then verify the efficiency and robustness of the proposed algorithms. SYSTEM-LEVEL DESIGN Cho, H. Leem, L. Mitra, S. ERSA: Error Resilient System Architecture for Probabilistic Applications http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171059 There is a growing concern about the increasing vulnerability of future computing systems to errors in the underlying hardware. Traditional redundancy techniques are expensive for designing energy-efficient systems that are resilient to high error rates. We present Error Resilient System Architecture (ERSA), a robust system architecture which targets emerging killer applications such as recognition, mining, and synthesis (RMS) with inherent error resilience, and ensures high degrees of resilience at low cost. Using the concept of configurable reliability, ERSA may also be adapted for general-purpose applications that are less resilient to errors (but at higher costs). While resilience of RMS applications to errors in low-order bits of data is well-known, execution of such applications on error-prone hardware significantly degrades output quality (due to high-order bit errors and crashes). ERSA achieves high error resilience to high-order bit errors and control flow errors (in addition to low-order bit errors) using a judicious combination of the following key ideas: 1) asymmetric reliability in many-core architectures; 2) error-resilient algorithms at the core of probabilistic applications; and 3) intelligent software optimizations. Error injection experiments on a multicore ERSA hardware prototype demonstrate that, even at very high error rates of 20 errors/flip-flop/10^8 cycles (equivalent to 25000 errors/core/s), ERSA maintains 90% or better accuracy of output results, together with minimal impact on execution time, for probabilistic applications such as K- Means clustering, LDPC decoding, and Bayesian network inference. In addition, we demonstrate the effectiveness of ERSA in tolerating high rates of static memory errors that are characteristic of emerging challenges relate- to SRAM V_ccmin problems and erratic bit errors. Park, H. Yoo, S. Lee, S. A Multistep Tag Comparison Method for a Low-Power L2 Cache http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171055 Tag comparison in a highly associative cache consumes a significant portion of the cache energy. Existing methods for tag comparison reduction are based on predicting either cache hits or cache misses. In this paper, we present novel ideas for both cache hit and miss predictions. We present a partial tag-enhanced Bloom filter to improve the accuracy of the cache miss prediction method and hot/cold checks that control data liveness to reduce the tag comparisons of the cache hit prediction method. We also combine both methods so that their order of application can be dynamically adjusted to adapt to changing cache access behavior, which further reduces tag comparisons. To overcome the common limitation of multistep tag comparison methods, we propose a method that reduces tag comparisons while meeting the given performance bound. Experimental results showed that the proposed method reduces the energy consumption of tag comparison by an average of 88.40%, which translates to an average reduction of 35.34% (40.19% with low-power data access) in the total energy consumption of the L2 cache and a further reduction of 8.86% (10.07% with low-power data access) when compared with existing methods. Passas, G. Katevenis, M. Pnevmatikatos, D. Crossbar NoCs Are Scalable Beyond 100 Nodes http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171053 We describe the design and layout of a radix-128 crossbar in 90 nm CMOS. The data path is 32 bits wide and runs at 750 MHz using a three-stage pipeline, while fitting in a silicon area as small as 6.6 mm^2 by filling it at the 90% level. The control path occupies 7 mm^2 next to the data path by filling it at 35% level, and reconfigures the data path once every three clock cycles. Next, we arrange 128 1 mm2 "user tiles" around the crossbar, forming a 150 mm2 die, and we connect all tiles to the crossbar via global links running on top of the tiles. Including the overhead of repeaters and flip flops on global links, the area cost of the crossbar is 11% of the die. Thus, we prove that crossbar networks-on-chips (NoCs) are small enough for radices exceeding by far the few tens of ports, that were believed to be the practical limit up to now, and reaching above 100 ports. We also attempt a first-order comparison between our crossbar and a model of a popular mesh NoC, and we find that our crossbar NoC increases performance when traffic is global and stressed, at the cost of worse performance when traffic is local and benign. Finally, we present an experimental cost analysis showing that crossbar area practically grows as O(N^2 W), as all wiring of the crossbar fits over its standard cells, while crossbar delay grows as O(N sqrt W) , as wire length increases with the perimeter of the crossbar. Chouhan, S. Balakrishnan, M. Bose, R. System-Level Design Space Exploration Methodology for Energy-Efficient Sensor Node Configurations: An Experimental Validation http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171044 For sensor nodes deployed at small distances, computation energy along with radio energy determines the battery life. We have proposed a system-level design space exploration methodology in for searching an energy-efficient error-correcting code (ECC). This methodology takes into account the computation and the radio energy in an integrated manner. In this paper, we validate this methodology by deploying the Imote2 nodes and measuring energy values under different operating modes, e.g., with and without ECC. In this process, we propose a validation framework and node energy model. Experimental results validate the methodology and show that with ECC we can save up to 14% transmitter energy under a certain set of conditions. The main contribution of this paper is that it establishes experimentally that our methodology is effective in exploration of various node configurations and finding an energy-efficient solution. TEST Hong, H.-C. A Static Linear Behavior Analog Fault Model for Switched-Capacitor Circuits http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171047 This paper proposes a static linear behavior (SLB) analog fault model for switched-capacitor (SC) circuits. The SC circuits under test (CUT) are divided into functional macros including the operational amplifiers, the capacitors, and the switches. Each macro has specified design parameters from the design's perspectives. These design parameters constitute a parameter set which determines the practical transfer function of the CUT. The SLB fault model defines that a CUT is faulty if its parameter set results in transfer functions whose frequency responses are out of the design specification. We analyzed the fault effects of the macros and derived their faulty signal-flow graph models with which the faulty transfer function templates of the CUT can be automatically generated. Based on the templates, we proposed a test procedure that can estimate all the parameters in the parameter set so as to test the CUT with multiple faults. Different from conventional single fault assumption, the proposed SLB fault model covers concurrent multiple parametric faults and catastrophic faults. In addition, it does not need to conduct fault simulations before test as conventional analog fault models do. As a result, it addresses the impractically long fault simulation time issue. A fully-differential low-pass SC biquad filter was adopted as an example to demonstrate how to design and use efficient multitone tests to test for the parameter set. The multitone test results acquired during the test procedure also reveal the distortion and noise performance of the CUT though the SLB fault model does not include them. Lee, D. Roy, K. Viterbi-Based Efficient Test Data Compression http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171050 This paper presents a Viterbi-based test compression algorithm/architecture that provides high encoding efficiency and scalability with respect to the number of test channels. The proposed scheme finds a set of compressed test vectors using the Viterbi algorithm instead of solving linear equations. By assigning a cost function to the branch metric of the Viterbi algorithm, an optimal compressed vector is selected among the possible solution set. This feature enables high flexibility to combine various test requirements such as low-power compression and/or improving capability to repeat test patterns. The proposed on-chip decompressor follows the structure of Viterbi encoders which require only one input channel. Experimental results on test volume show improvement on all ISCAS89 benchmark circuits (19.32% reduction on the average) compared to previous test data compression architectures. The proposed scheme also yields efficient power-dissipation/volume tradeoff. Lu, S.-K. Wang, Z.-Y. Tsai, Y.-M. Chen, J.-L. Efficient Built-In Self-Repair Techniques for Multiple Repairable Embedded RAMs http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6170996 In this paper, efficient built-in self-repair (BISR) techniques for multiple repairable memory cores with different sizes and divided redundancy mechanisms are proposed. Embedded memory cores are first partitioned into memory groups. For each memory group, a redundant memory module is added and divided into row blocks and column blocks. Moreover, the memory cores within a memory group are partitioned into divided arrays (consisting of row/column blocks) of the same size. The redundant memory can be shared among all memory cores within the same memory group. Therefore, unlike the traditional redundancy architectures, a row (column) block is used as the basic replacement element. Based on the proposed redundancy architecture, a heuristic heterogeneous extended spare pivoting redundancy analysis algorithm suitable for built-in implementation is also proposed. Experimental results show that the repair rate and manufacturing yield can be improved significantly due to the efficient usage of redundancy. Moreover, the area overhead of the BISR circuitry for an example memory group consisting of four memory instances of size 9.25 Mbits is only 1.12%. Fang, H. Chakrabarty, K. Wang, Z. Gu, X. Reproduction and Detection of Board-Level Functional Failure http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171058 No trouble found (NTF) due to functional failures is a common scenario today in board-level testing at system companies. A component on a board fails during the board-level functional test, but it passes the automatic test equipment (ATE) test when it is returned to the supplier for warranty replacement or service repair. To find the root cause of NTF, we propose an innovative functional test approach and discrete Fourier transform (DFT) methods for the detection of board-level functional failures. These DFT and test methods allow us to reproduce and detect functional failures in a controlled deterministic environment, and provide ATE tests to the supplier for early screening of defective parts. Experiments on an industry design show that the proposed functional scan test with appropriate functional constraints can adequately mimic the functional state space, as measured by appropriate coverage metrics. Experiments also show that most functional failures due to dominant bridging, crosstalk, and delay faults due to power supply noise can be reproduced and detected by functional scan test. We also describe two approaches to enhance the mimicking of the functional state space. The first approach allows us to select a given number of initial states in linear time and functional scan tests resulting from these selected states are used to mimic the functional state space. The second approach is based on controlled state injection. SHORT PAPERS Lee, L.-J. Tseng, W.-D. Lin, R.-B. Chang, C.-H. 2n Pattern Run-Length for Test Data Compression http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6171051 This paper presents a new pattern run-length compression method whose decompressor is simple and easy to implement. It encodes 2^n runs of compatible or inversely compatible patterns, either inside a single test data segment or across multiple test data segments. Experimental results show that it can achieve an average compression ratio of 67.64% and considerable test application time savings.