May 2012 Newsletter Placing you one click away from the best new CAD research! Plain-text version at http://www.umn.edu/~tcad/newsletter/2012-05.txt Announcing the 2012 Donald O. Pederson Best Paper Award... The Donald O. Pederson Award recognizes the best paper published in the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in the two calendar years preceding the award. This year's award, which will be presented at the Design Automation Conference in San Francisco, goes to: Ogras, U. Y.; Bogdan, P.; Marculescu, R., An Analytical Approach for Network-on-Chip Performance Analysis http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5621037 Networks-on-chip (NoCs) have recently emerged as a scalable alternative to classical bus and point-to-point architectures. To date, performance evaluation of NoC designs is largely based on simulation which, besides being extremely slow, provides little insight on how different design parameters affect the actual network performance. Therefore, it is practically impossible to use simulation for optimization purposes. In this paper, we present a mathematical model for on-chip routers and utilize this new model for NoC performance analysis. The proposed model can be used not only to obtain fast and accurate performance estimates, but also to guide the NoC design process within an optimization loop. The accuracy of our approach and its practical use is illustrated through extensive simulation results. REGULAR PAPERS EMBEDDED SYSTEMS Jin, X.; Goto, S. Hilbert Transform-Based Workload Prediction and Dynamic Frequency Scaling for Power-Efficient Video Encoding http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186868 With the popularity of mobile devices with embedded video cameras, real-time video encoding on hand-held devices becomes increasingly popular. Reducing the power consumption during real-time video encoding to suspend the battery life with the same encoding performance is very important to improve the quality of service. Although some workload estimation techniques have been developed for video decoding to reduce power consumption for video playback applications, they present inefficiency in being transferred to video encoding directly because the compressed information cannot be retrieved before encoding and the future input video content is often nondeterministic. In this paper, a workload estimation scheme targeting video encoding applications is proposed. Based on the definition of video encoding workload and the analysis of the features, a Hilbert transform-based workload estimation model is proposed to predict the overall variation trend in the encoding workload to overcome the workload fluctuations and the nondeterministic content variations, e.g., burst motion. The effectiveness of the proposed algorithm is demonstrated on two H.264/AVC encoders on PC and an embedded platform by encoding different video contents at different bit-rates. The proposed algorithm provides a negligible deadline missing ratio around 4.8%, which is much lower than the previous solutions, together with platform and content robustness. Compared with the previous solutions, the proposed algorithm provides up to 61.69% power reduction under the same performance constraint. LOGIC SYNTHESIS Wan, L.; Chen, D. Analysis of Digital Circuit Dynamic Behavior With Timed Ternary Decision Diagrams for Better-Than- Worst-Case Design http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186868 Modern logic optimization tools tend to optimize circuits in a balanced way so that all primary outputs (POs) have the similar delay close to the cycle time. However, in reality, certain POs will be exercised more frequently than the rest. Among the critical POs, some may be stabilized very quickly by input vectors, even if their topological delays from primary inputs are very long. Knowing the dynamic behavior of a circuit can help optimize the most commonly activated paths and help engineers understand how resilient a PO is against dynamic environmental variations such as voltage fluctuations. In this paper, we describe a tool to analyze the dynamic behavior of a digital circuit utilizing probabilistic information. The techniques exploit the use of timed ternary decision diagrams (tTDD) to encode stabilization conditions for POs. To compute probabilities based on a tTDD, we propose false assignment pruning and random variable compaction to preserve probability calculation accuracy. To deal with the scalability issue, this paper proposes a new circuit partitioning heuristic to reduce the inaccuracy introduced by partitioning. Compared to the timed simulation results, on average our tool has a mean absolute error of 1.7% and a root mean square error of 3.9% for MCNC benchmarks and a mean absolute error of 2.2% and a root mean square error of 4.8% for ISCAS benchmarks. Compared to a state-of-the-art dynamic behavior analysis tool, our tool is $15times$ faster on average for MCNC benchmarks and $65times$ faster on average for ISCAS benchmarks, and can also handle circuits that the previous tool cannot. This dynamic behavior analyzer would enable fast and effective circuit optimizations for better-than-worst-case design. MODELING AND SIMULATION Melamed, S.; Thorolfsson, T.; Harris, T. R.; Priyadarshi, S.; Franzon, P.; Steer, M. B.; Davis, W. R. Junction-Level Thermal Analysis of 3-D Integrated Circuits Using High Definition Power Blurring http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186865 The degraded thermal path of 3-D integrated circuits (3DICs) makes thermal analysis at the chip-scale an essential part of the design process. Performing an appropriate thermal analysis on such circuits requires a model with junction-level fidelity; however, the computational burden imposed by such a model is tremendous. In this paper, we present enhancements to two thermal modeling techniques for integrated circuits to make them applicable to 3DICs. First, we present a resistive mesh-based approach that improves on the fidelity of prior approaches by constructing a thermal model of the full structure of 3DICs, including the interconnect. Second, we introduce a method for dividing the thermal response caused by a heat load into a high fidelity "near response" and a lower fidelity "far response" in order to implement Power Blurring high definition (HD), a hierarchical thermal simulation approach based on Power Blurring that incorporates the resistive mesh-based models and allows for junction-level accuracy at the full- chip scale. The Power Blurring HD technique yields approximately three orders of magnitude of improvement in memory usage and up to six orders of magnitude of improvement in runtime for a three-tier synthetic aperture radar circuit, as compared to using a full-chip junction-scale resistive mesh-based model. Finally, measurement results are presented showing that Power Blurring high definition (HD) accurately determines the shape of the thermal profile of the 3DIC surface after a correction factor is added to adjust for a discrepancy in the absolute temperature values. PHYSICAL DESIGN Kagalwalla, A. A.; Gupta, P.; Progler, C. J.; McDonald, S. Design-Aware Mask Inspection http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186861 Mask inspection has become a major bottleneck in the manufacturing flow taking up as much as 40% of the total mask manufacturing time. In this paper, we explore techniques to improve the reticle inspection flow by increasing its design awareness. We develop an algorithm to locate nonfunctional features in a postoptical proximity correction layout without using any design information. Using this, and the timing information of the design (if available), the smallest defect size that could cause the design to fail is assigned to each reticle feature. The criticality of various reticle features is then used to partition the reticle such that each partition is inspected at a different pixel size and sensitivity so that the false and nuisance defect count is reduced without missing any critical defect. We also develop an analytical model to estimate the false and nuisance defect count. Using those models, our simulation results show that this design-aware mask inspection can reduce the false and nuisance defect count for a critical polysilicon layer from 80 defects down to 49 defects, leading to substantial reduction in defect review load. We also develop a model to estimate first pass yield (FPY) and show that our method can improve the FPY for a polysilicon layer from 11% to 30%. Apart from the polysilicon layer, the potential benefit of this approach is analyzed for active, contact and all the metal/via layers. Fang, S.-Y.; Chen, S.-Y.; Chang, Y.-W. Native-Conflict and Stitch-Aware Wire Perturbation for Double Patterning Technology http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186851 Double patterning technology (DPT), in which a dense layout pattern is decomposed into two separate masks to relax its pitch, is the most popular lithography solution for the sub-22 nm node to enhance pattern printability. Previous work focused on stitch insertion to improve the decomposition success rate. However, there exist native conflicts (NCs) which cannot be resolved by any kind of stitch insertion. A design with NCs is not DPT-compliant and may fail the decomposition, resulting in design for manufacturability redesign and longer design cycles. In this paper, we give a sufficient condition for the NC existence and propose a geometry-based method for NC prediction to develop an early-stage analyzer for DPT decomposability checking. Then, a wire perturbation algorithm is presented to fix as many NCs in the layout as possible. The algorithm is based on iterative 1-D compaction and can easily be embedded into existing industrial compaction systems. The algorithm is then further applied to further reduce the number of stitches required for the decomposition process. Experimental results show that the proposed algorithm can significantly reduce the number of NCs by an average of 85% and reduce the number of stitches by an average of 39%, which may effectively increase the decomposition success rate for the next stage. SYSTEM-LEVEL DESIGN Salamy, H.; Ramanujam, J. An Effective Solution to Task Scheduling and Memory Partitioning for Multiprocessor System-on-Chip http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186867 The growing trend in current complex embedded systems is to deploy a multiprocessor system-on-chip (MPSoC). A MPSoC consists of multiple heterogeneous processing elements, a memory hierarchy, and input/output components which are linked together by an on-chip interconnect structure. Such an architecture provides the flexibility to meet the performance requirements of multimedia applications while respecting the constraints on memory, cost, size, time, and power. Many embedded systems employ software-managed memories known as scratch-pad memories (SPM). Unlike caches, SPMs are software-controlled and hence the execution time of applications on such systems can be accurately predicted. Scheduling the tasks of an embedded application on the processors and partitioning the available SPM budget among these processors are two critical issues in such systems. Often, these are considered separately; such a decoupled approach may miss better quality schedules. In this paper, we present an integrated approach to task scheduling and SPM partitioning to further reduce the execution time of embedded applications. Results on several real-life benchmarks show the significant improvement from our proposed technique. DeOrio, A.; Fick, D.; Bertacco, V.; Sylvester, D.; Blaauw, D.; Hu, J.; Chen, G. A Reliable Routing Architecture and Algorithm for NoCs http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186857 Aggressive transistor scaling continues to drive increasingly complex digital designs. The large number of transistors available today enables the development of chip multiprocessors that include many cores on one die communicating through an on-chip interconnect. As the number of cores increases, scalable communication platforms, such as networks-on-chip (NoCs), have become more popular. However, as the sole communication medium, these interconnects are a single point of failure so that any permanent fault in the NoC can cause the entire system to fail. Compounding the problem, transistors have become increasingly susceptible to wear-out related failures as their critical dimensions shrink. As a result, the on-chip network has become a critically exposed unit that must be protected. To this end, we present Vicis, a fault-tolerant architecture and companion routing protocol that is robust to a large number of permanent failures, allowing communication to continue in the face of permanent transistor failures. Vicis makes use of a two-level approach. First, it attempts to work around errors within a router by leveraging reconfigurable architectural components. Second, when faults within a router disable a link's connectivity, or even an entire router, Vicis reroutes around the faulty node or link with a novel, distributed routing algorithm for meshes and tori. Tolerating permanent faults in both the router components and the reliability hardware itself, Vicis enables graceful performance degradation of networks-on-chip. Mariani, G.; Palermo, G.; Zaccaria, V.; Silvano, C. OSCAR: An Optimization Methodology Exploiting Spatial Correlation in Multicore Design Spaces http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186864 This paper presents OSCAR, an optimization methodology exploiting spatial correlation of multicore design spaces. This paper builds upon the observation that power consumption and performance metrics of spatially close design configurations (or points) are statistically correlated. We propose to exploit the correlation by using a response surface model (RSM), i.e., a closed-form expression suitable for predicting the quality of nonsimulated design points. This model is useful during the design space exploration (DSE) phase to quickly converge to the Pareto set of the multiobjective problem without executing lengthy simulations. To this end, we introduce a multiobjective optimization heuristic which iteratively updates and queries the RSM to identify the design points with the highest expected improvement. The RSM allows to consolidate the Pareto set by reducing the number of simulations required, thus speeding up the exploration process. We compare the proposed heuristic with state-of-the-art approaches [conventional, RSM-based, and structured design of experiments (DoEs)]. Experimental results show that OSCAR is a faster heuristic with respect to state-of-the-art techniques such as response-surface Pareto iterative refinement ReSPIR and nondominated-sorting genetic algorithm NSGA-II. In fact, OSCAR used a lower number of simulations to produce a similar solution, i.e., an average of 150 simulations instead of 320 simulations (NSGA-II) and 178 simulations (ReSPIR). When the number of design points is fixed to an average of 300, OSCAR achieves less than 0.6% in terms of average distance with respect to the reference solution while NSGA-II achieves 3.4%. Reported results also show that OSCAR can significantly improve structured DoE approaches by slightly increasing the number of experiments. TEST Lee, K.-J.; Hsieh, T.-Y.; Breuer, M. A. Efficient Overdetection Elimination of Acceptable Faults for Yield Improvement http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186858 Acceptable faults in a circuit under test (CUT) refer to those faults that have no or only minor impacts on the performance of the CUT. A circuit with an acceptable fault may be marketable for some specific applications. Therefore, by carefully dealing with these faults during testing, significant yield improvement can be achieved. Previous studies have shown that the patterns generated by a conventional automatic test pattern generation procedure to detect all unacceptable faults also detect many acceptable ones, resulting in a severe loss on achievable yield improvement. In this paper, we present a novel test methodology called multiple test set detection (MTSD) to totally eliminate this overdetection problem. A basic test set generation method is first presented, which depicts a fundamental scheme to generate appropriate test sets for MTSD. We then describe an enhanced test generation method that can significantly reduce the total number of test patterns. Solid theoretical derivations are provided to validate the effectiveness of the proposed methods. Experimental results show that in general an 80%-99% reduction in the number of test patterns can be achieved compared with previous work addressing this problem. VERIFICATION Chockler, H.; Kroening, D.; Purandare, M. Computing Mutation Coverage in Interpolation-Based Model Checking http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186866 Coverage is a means to quantify the quality of a system specification, and is frequently applied to assess progress in system validation. Coverage is a standard measure in testing, but is very difficult to compute in the context of formal verification. We present efficient algorithms for identifying those parts of the system that are covered by a given property. Our algorithm is integrated into state-of-the-art Boolean satisfiability problem-based model checking using Craig interpolation. The key insight into our algorithm is the re-use of previously computed inductive invariants and counterexamples. This re-use permits a a rapid completion of the vast majority of tests, and enables the computation of a coverage measure with 96% accuracy with only $5times$ the runtime of the model checker. Welp, T.; Kitchen, N.; Kuehlmann, A. Hardware Acceleration for Constraint Solving for Random Simulation http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186869 Constrained random simulation has been widely adopted in contemporary hardware verification flows. In this methodology, a set of user-specified declarative constraints describe valid input stimuli for the design under test (DUT). A constraint solver produces the simulation input vectors; their generation is interleaved with the actual simulation of the design for these vectors. Besides its distribution, the solver's performance is one of the most critical characteristics that determines the overall verification efficiency. There are no general approaches to hardware acceleration for solving declarative constraints. Current setups for hardware acceleration-based verification combine a software constraint solver running on a general-purpose processor with the hardware- accelerated DUT. This approach suffers from a major efficiency bottleneck caused by the significant performance mismatch between the solver executed in software and the DUT running on an accelerator. In this paper, we present a hardware constraint solver that uses a set of parallel solving units executing Markov chain Monte Carlo sampling. We propose to combine this solver and the DUT on the same device and run both entities hardware-accelerated in order to eliminate the performance mismatch. We discuss the details of the solver architecture and its implementation and report comprehensive results on performance and distribution characteristics as well as experience obtained from our case study where we used our solver to verify a real-world hardware design. Liu, L.; Sheridan, D.; Tuohy, W.; Vasudevan, S. A Technique for Test Coverage Closure Using GoldMine http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186862 We propose a methodology to generate input stimulus to achieve coverage closure using GoldMine, an automatic assertion generation engine that uses data mining and formal verification. GoldMine mines the simulation traces of a behavioral register transfer level (RTL) design using a decision tree based learning algorithm to produce candidate assertions. These candidate assertions are passed to a formal verification engine. If a candidate assertion is false, a counterexample trace is generated. In this paper, we feed these counterexample traces to iteratively refine the original simulation trace data. We introduce an incremental decision tree to mine the new traces in each iteration. The algorithm converges when all the candidate assertions are true. We formally prove that our algorithm will always converge and capture the complete functionality of each output of a sequential design on convergence. We show that our method always results in a monotonic increase in simulation coverage. We also present an output- centric notion of coverage, and argue that we can attain coverage closure with respect to this notion of coverage. We elaborate the technique step by step using a nontrivial arbiter design. Experimental results to validate our arguments are presented on several designs from Rigel, OpenRisc, and SpaceWire. Some practical limitations to achieve 100% coverage and the differences between final decision tree and binary decision diagram are discussed. SHORT PAPERS Jiang, Y.-L.; Chen, H.-B. Application of General Orthogonal Polynomials to Fast Simulation of Nonlinear Descriptor Systems Through Piecewise-Linear Approximation http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186859 In this letter, we report an approach combining piecewise-linear (PWL) approximation with general orthogonal polynomials to efficiently simulate large nonlinear descriptor systems in time-domain. The main idea of this approach is first to approximate a nonlinear function by a piecewise-linear representation. Then, using the recursive formulae of general orthogonal polynomials, orthonormal bases can be produced for fast simulation of the PWL model. The effectiveness of our approach is demonstrated on two nonlinear circuit models. Alpaslan, E.; Kruseman, B.; Majhi, A. K.; Heuvalman, W. M.; Dworak, J. NIM-X: A Noise Index Model-Based X-Filling Technique to Overcome the Power Supply Switching Noise Effects on Path Delay Test http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6186856 Power supply noise (PSN) has become a critical issue during high-quality at-speed testing. Discrepancies between the circuit's switching activity during functional and test mode can cause overtesting and lead to yield loss. Alternatively, reduced PSN effects around critical paths can result in undertesting the chip, causing test escapes. To achieve a high-quality at-speed test, it is necessary to solve these problems simultaneously. Our previous work introduced a noise index model (NIM), which can be used to predict the mismatch between expected and real path delays. This paper quantitatively investigates and compares NIM values for critical paths during functional and test mode. We then propose a test pattern modification method that harnesses the NIM. The method fills a subset of the don't care bits in partially specified test vectors such that the worst observed functional NIM for the targeted critical path is replicated during test mode.