November 2011 Newsletter Placing you one click away from the best new CAD research! Plain-text version at http://www.umn.edu/~tcad/newsletter/2011-11.txt REGULAR PAPERS EMBEDDED SYSTEMS Kim, Y. Park, S. Cho, Y. Chang, N. System-Level Online Power Estimation Using an On-Chip Bus Performance Monitoring Unit http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046165 Quality power estimation is a basis of efficient power management of electronic systems. Indirect power measurement, such as power estimation using a CPU performance monitoring unit (PMU), is widely used for its low cost and area overheads. However, the existing CPU PMUs only monitor the core and cache activities, which result in a significant accuracy limitation in the system-wide power estimation including off-chip memory devices. In this paper, we propose an on-chip bus (OCB) PMU that directly captures on-chip and off-chip component activities by snooping the OCB. The OCB PMU stores the activity information in separate counters, and online software converts counter values into actual power values with simple first-order linear power models. We also introduce an optimization algorithm that minimizes the energy model to reduce the number of counters in the OCB PMU. We compare the accuracy of the power estimation using the proposed OCB PMU with real hardware measurement and cycle-accurate system-level power estimation, and demonstrate high estimation accuracy compared with CPU PMU-based estimation method. FPGAs AND RECONFIGURABLE COMPUTING Kim, Y. Lee, J. Shrivastava, A. Yoon, J. W. Cho, D. Paek, Y. High Throughput Data Mapping for Coarse-Grained Reconfigurable Architectures http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046176 Coarse-grained reconfigurable arrays (CGRAs) are a very promising platform, providing both up to 10Ð100 MOps/mW of power efficiency and software programmability. However, this promise of CGRAs critically hinges on the effectiveness of application mapping onto CGRA platforms. While previous solutions have greatly improved the computation speed, they have largely ignored the impact of the local memory architecture on the achievable power and performance. This paper motivates the need for memory-aware application mapping for CGRAs, and proposes an effective solution for application mapping that considers the effects of various memory architecture parameters including the number of banks, local memory size, and the communication bandwidth between the local memory and the external main memory. Further we propose efficient methods to handle dependent data on a double-buffering local memory, which is necessary for recurrent loops. Our proposed solution achieves 59% reduction in the energy-delay product, which factors into about 47% and 22% reduction in the energy consumption and runtime, respectively, as compared to memory-unaware mapping for realistic local memory architectures. We also show that our scheme scales across a range of applications and memory parameters, and the runtime overhead of handling recurrent loops by our proposed methods can be less than 1%. MODELING AND SIMULATION Zhao, X. Guo, Y. Chen, X. Feng, Z. Hu, S. Hierarchical Cross-Entropy Optimization for Fast On-Chip Decap Budgeting http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046169 Decoupling capacitor (decap) has been widely used to effectively reduce dynamic power supply noise. Traditional decap budgeting algorithms usually explore the sensitivity-based nonlinear optimizations or conjugate gradient (CG) methods, which can be prohibitively expensive for large-scale decap budgeting problems and cannot be easily parallelized. In this paper, we propose a hierarchical cross-entropy based optimization technique which is more efficient and parallel-friendly. Cross-entropy (CE) is an advanced optimization framework which explores the power of rare event probability theory and importance sampling. To achieve the high efficiency, a sensitivity-guided cross-entropy (SCE) algorithm is introduced which integrates CE with a partitioning-based sampling strategy to effectively reduce the solution space in solving the large-scale decap budgeting problems. Compared to improved CG method and conventional CE method, SCE with Latin hypercube sampling method (SCE-LHS) can provide $2times$ speedups, while achieving up to 25% improvement on power supply noise. To further improve decap optimization solution quality, SCE with sequential importance sampling (SCE-SIS) method is also studied and implemented. Compared to SCE-LHS, in similar runtime, SCE-SIS can lead to 16.8% further reduction on the total power supply noise. PHYSICAL DESIGN Ding, D. Torres, J. A. Pan, D. Z. High Performance Lithography Hotspot Detection With Successively Refined Pattern Identifications and Machine Learning http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046168 Under the real and evolving manufacturing conditions, lithography hotspot detection faces many challenges. First, real hotspots become hard to identify at early design stages and hard to fix at post-layout stages. Second, false alarms must be kept low to avoid excessive and expensive post-processing hotspot removal. Third, full chip physical verification and optimization require very fast turn-around time. Last but not least, rapid technology advancement favors generic hotspot detection methodologies to avoid exhaustive pattern enumeration and excessive development/update as technology evolves. To address the above issues, we propose a high performance hotspot detection methodology consisting of: 1) a fast layout analyzer; 2) powerful hotspot pattern identifiers; and 3) a generic and efficient flow with successive performance refinements. We implement our algorithms with industry-strength engine under real manufacturing conditions and show that it significantly outperforms state-of-the-art algorithms in false alarms (2.4X to 2300X reduction) and runtime (5X to 237X reduction), meanwhile achieving similar or better hotspot accuracies. Compared with pattern matching, our method achieves higher prediction accuracy for hotspots that are not previously characterized, therefore, more detection generality when exhaustive pattern enumeration is too expensive to perform a priori. Such high performance hotspot detection is especially suitable for lithography-friendly physical design. Lee, Y.-J. Lim, S. K. Co-Optimization and Analysis of Signal, Power, and Thermal Interconnects in 3-D Ics http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046179 Heat removal and power delivery have become two major reliability concerns in 3-D integrated circuit (IC) technology. To alleviate thermal problem, two possible solutions have been proposed: thermal-through-silicon-vias (T-TSVs) and micro- fluidic channel (MFC)-based cooling. In case of power delivery, a complex power distribution network is required to deliver currents reliably to all parts of the 3-D IC while suppressing the power supply noise to an acceptable level. However, these thermal and power networks pose major challenges in signal routability and congestion. This is because signal, power, and thermal interconnects are all competing for routing space, and the related TSVs interfere with gates and wires in each die. We present a co-optimization methodology for signal, power, and thermal interconnects in 3-D ICs based on design of experiments (DOE) and response surface methodology (RSM). The goal of our holistic approach is to improve signal, thermal, and power noise metrics and to provide fast and accurate design space exploration for early design stage. We also provide an in-depth comparison between T-TSV versus MFC-based cooling method and discuss how to employ DOE and RSM techniques to co-optimize the interconnects. Our DOE-based optimization found the optimal design point with less effort than a gradient search-based optimization. Chuang, Y.-L. Lee, P.-W. Chang, Y.-W. Voltage-Drop Aware Analytical Placement by Global Power Spreading for Mixed-Size Circuit Designs http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046166 Excessive supply voltage drops in a circuit may lead to significant circuit performance degradation and even malfunction. To handle this problem, existing power delivery aware placement algorithms model voltage drops as an optimization objective. We observe that directly minimizing the voltage drops in the objective function might not resolve voltage-drop violations effectively and might cause problems in power-integrity convergence. To remedy this deficiency, in this paper, we propose new techniques to incorporate device power spreading forces into a mixed-size analytical placement framework. Unlike the state-of-the-art previous work that handles the worst voltage-drop spots one by one, our approach simultaneously and globally spreads all the blocks with voltage-drop violations to desired locations directly to minimize the violations. To apply the power force, we model macro current density and power rails for our placement framework to derive desired macro/cell locations. To further improve the solution quality, we propose an efficient mathematical transformation to adjust the power force direction and magnitude. Experimental results show that our approach can substantially improve the voltage drops, wirelength, and runtime over the previous work. SYSTEM-LEVEL DESIGN Jain, T. N. K. Ramakrishna, M. Gratz, P. V. Sprintson, A. Choi, G. Asynchronous Bypass Channels for Multi- Synchronous NoCs: A Router Microarchitecture, Topology, and Routing Algorithm http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046181 Network-on-chip (NoC) designs have emerged as a replacement for traditional shared-bus designs for on-chip communication. As with all current very large scale integration designs, however, reducing power consumption in NoCs is a critical challenge. One approach to reduce power consumption is to dynamically scale the voltage and frequency of each network node or groups of nodes (DVFS). Another approach is to replace the balanced clock tree with a globally- asynchronous, locally-synchronous (GALS) clocking scheme. In both DVFS and GALS designs, the chip as a whole is multi-synchronous. As the NoCs interconnecting those nodes must communicate across these clock domain boundaries, they tend to have high latencies as packets must be synchronized at the intermediate nodes. In this paper, we propose a novel router microarchitecture which offers superior performance with respect to typical synchronizing router designs for multi-synchronous networks. Our approach features asynchronous bypass channels which allow flit traversal of intermediate nodes within the network without the latching or synchronization overheads of typical designs. We also propose a new network topology and routing algorithm that leverage the advantages of the bypass channel offered by our router design. We present a detailed analysis of design decisions which affect the performance of the asynchronous bypass channel network. Our experiments show that our design improves the performance of a conventional synchronizing design with similar resources by up to 26% at low loads and increases saturation throughput by up to 50% for a uniform random traffic. Hanumaiah, V. Vrudhula, S. Chatha, K. S. Performance Optimal Online DVFS and Task Migration Techniques for Thermally Constrained Multi-Core Processors http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046183 Extracting high performance from multi-core processors requires increased use of thermal management techniques. In contrast to offline thermal management techniques, online techniques are capable of sensing changes in the workload distribution and setting the processor controls accordingly. Hence, online solutions are more accurate and are able to extract higher performance than the offline techniques. This paper presents performance optimal online thermal management techniques for multicore processors. The techniques include dynamic voltage and frequency scaling and task- to-core allocation or task migration. The problem formulation includes accurate power and thermal models, as well as leakage dependence on temperature. This paper provides a theoretical basis for deriving the optimal policies and computationally efficient implementations. The effectiveness of our DVFS and task-to-core allocation techniques are demonstrated by numerical simulations. The proposed task-to-core allocation method showed a 20.2% improvement in performance over a power-based thread migration approach. The techniques have been incorporated in a thermal-aware architectural-level simulator called MAGMA that allows for design space exploration, offline, and online dynamic thermal management. The simulator is capable of handling simulations of hundreds of cores within reasonable time. Boland, D. Constantinides, G. A. Bounding Variable Values and Round-Off Effects Using Handelman Representations http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046164 The precision used in an algorithm affects the error and performance of individual computations, the memory usage, and the potential parallelism for a fixed hardware budget. This paper describes a new method to determine the minimum precision required to meet a given error specification for an algorithm consisting of the basic algebraic operations. Using this approach, it is possible to significantly reduce the computational word-length in comparison to existing methods, and this can lead to superior hardware designs. We demonstrate the proposed procedure on an iteration of the conjugate gradient algorithm, achieving proofs of bounds that can translate to global word-length savings ranging from a few bits to proving the existence of ranges that must otherwise be assumed to be unbounded when using competing approaches. We also achieve comparable bounds to recent literature in a small fraction of the execution time, with greater scalability. TEST Noia, B. Chakrabarty, K. Goel, S. K. Marinissen, E. J. Verbree, J. Test-Architecture Optimization and Test Scheduling for TSV-Based 3-D Stacked Ics http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046180 Through-silicon via (TSV)-based 3-D stacked ICs (SICs) are becoming increasingly important in the semiconductor industry. In this paper, we address test architecture optimization for 3-D stacked ICs implemented using TSVs. We consider two cases, namely 3-D SICs with die-level test architectures that are either fixed or still need to be designed. We next present mathematical programming techniques to derive optimal solutions for the architecture optimization problem for both cases. Experimental results for three handcrafted 3-D SICs comprising of various systems-on-a-chip (SoCs) from the ITC'02 SoC test benchmarks show that compared to the baseline method of sequentially testing all dies, the proposed solutions can achieve significant reduction in test length. This is achieved through optimal test schedules enabled by the test architecture. We also show that increasing the number of test pins typically provides a greater reduction in test length compared to an increase in the number of test TSVs. Furthermore, we show that shorter test lengths are generally achieved with the larger, more complex dies lower in the stack. This is because test data must pass through every die lower in a stack in order to reach its target die, and with the larger dies lower in the stack, more test bandwidth may be provided to these dies using fewer routing resources. Zhong, S. Khursheed, S. Al-Hashimi, B. M. A Fast and Accurate Process Variation-Aware Modeling Technique for Resistive Bridge Defects http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046171 Recent research has shown that tests generated without taking process variation into account may lead to loss of test quality. At present, there is no efficient device-level modeling technique that models the effect of process variation on resistive bridge defects. This paper presents a fast and accurate technique to achieve this, including modeling the effect of voltage and temperature variation using the BSIM4 transistor model. To speed up the computation time and without compromising simulation accuracy (achieved through BSIM4), two efficient voltage approximation algorithms are proposed for calculating logic threshold of driven gates and voltages on bridged lines of a fault-site to calculate bridge critical resistance. Experiments are conducted on a 65 nm gate library (for illustration purposes), and results show that on average the proposed modeling technique is more than 53 times faster and in the worst case, error in bridge critical resistance is 2.64% when compared with HSPICE. Hou, C.-S. Li, J.-F. Tseng, T.-W. Memory Built-in Self-Repair Planning Framework for RAMs in SoCs http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046177 Built-in self-repair (BISR) techniques are widely used to enhance the yield of random access memories (RAMs) in a system-on-chip (SoC) which typically consists of hundreds of RAMs. Hence, many BISR circuits may be needed in a such SoC. Effective techniques for planning these BISR circuits thus are imperative. In this paper, we propose a memory BISR planning (MBiP) framework for the RAMs in SoCs. The MBiP framework consists of a memory grouping algorithm for selecting RAMs which can share a BISR circuit. Then, a test scheduling algorithm is used to determine the test sequence of RAMs in a SoC under the constraint of test power. Finally, a BISR scheme allocation algorithm is proposed to allocate different BISR schemes for the RAMs under the constraints of the results of memory grouping and test scheduling. Simulation results show that the proposed MBiP can effectively plan the BISR schemes for the RAMs in a SoC. For example, about 22% area reduction can be achieved by the BISR schemes planned by the proposed MBiP framework for 50 RAMs under 1.5 mm distance constraint and 350 mW test power constraint in comparison with a dedicated BISR scheme (i.e., each RAM has a self-contained BISR circuit). Sinanoglu, O. Almukhaizim, S. Unified 2-D X-Alignment for Improving the Observability of Response Compactors http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046182 Despite the advantages of performing response compaction in integrated-circuit testing, unknown response bits (x's) inevitably reflect into loss in test quality. The distribution of these x's within the captured response, which varies for each test pattern, directly impacts the number of scan cells observed through the response compactor. In this paper, we propose a unified 2-D x-alignment technique in order to judiciously manipulate the distribution of x's in the test response prior to its compaction. The controlled response manipulation is performed on a per pattern basis, in the form of scan chain delay and intra-slice rotate operations, and with the objective that x's are aligned within as few scan slices and chains as possible. Consequently, a larger number of scan cells are observed after compaction for any test pattern. In an effort to tackle the unified 2-D x-alignment problem and to achieve maximum overall observability, we first decipher the interaction between 1-D x-alignment operations, and formulate 1-D and 2-D x-alignment operations all as maximum satisfiability (MAX-SAT) problems; a weighted MAX-SAT formulation is necessitated in the 2-D case to identify the best possible 2-D x-alignment, which may differ from back to back application of the individual best possible 1-D alignments in two dimensions. The proposed technique is test set independent, leading to a generic, simple, and cost-effective hardware implementation. While we show in this paper that x-alignment improves horizontal and vertical compactors, covering a wide spectrum of compactors, it is expected to improve other types of compactors as well by manipulating the x-distribution properly. SHORT PAPERS Foreman, E. A. Habitz, P. A. Cheng, M.-C. Tamon, C. Inclusion of Chemical-Mechanical Polishing Variation in Statistical Static Timing Analysis http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046170 Technology trends show the importance of modeling process variation in static timing analysis. With the advent of statistical static timing analysis (SSTA), multiple independent sources of variation can be modeled. This paper proposes a methodology for modeling metal interconnect process variation in SSTA. The developed methodology is applied in this study to investigate metal variation in SSTA resulting from chemical-mechanical polishing (CMP). Using our statistical methodology, we show that CMP variation has a smaller impact on chip performance as compared to other factors impacting metal process variation. Chen, Z. Chakrabarty, K. Xiang, D. MVP: Minimum-Violations Partitioning for Reducing Capture Power in At- Speed Delay-Fault Testing http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046167 Scan shift power can be reduced by activating only a subset of scan cells in each shift cycle. In contrast to shift power reduction, the use of only a subset of scan cells to capture responses in a cycle may cause capture violations, thereby leading to fault coverage loss. In order to restore the original fault coverage, new test patterns must be generated, leading to higher test-data volume. In this paper, we propose minimum-violations partitioning, a scan-cell clustering method that can support multiple capture cycles in delay testing without increasing test-data volume. This method is based on an integer linear programming model and it can cluster the scan flip-flops into balanced parts with minimum capture violations. Based on this approach, hierarchical partitioning is proposed to make the partitioning method routing-aware. Experimental results on ISCAS'89 and IWLS'05 benchmark circuits demonstrate the effectiveness of our method. Liao, K.-Y. Chang, C.-Y. Li, J. C.-M. A Parallel Test Pattern Generation Algorithm to Meet Multiple Quality Objectives http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6046178 This paper proposes a bit-level parallel ATPG algorithm (SWK) that generates multiple test patterns at a time. This algorithm converts decisions into bitwise logic operation so that W (CPU word size) test patterns are searched independently. Multiple objectives for different quality metrics can therefore be achieved in a single test generation process. Experimental results on ISCAS'89 and IWLS'05 benchmark circuits show that SWK test sets are better in many quality metrics than traditional 50-detect test sets, while the length of the former is shorter. Also, patterns selected from large N-detect pattern pool cannot achieve the same or higher quality than patterns generated by SWK.