TCAD Newsletter - May 2010 Issue Placing you one click away from the best new CAD research! Regular Papers ============== Paik, S.; Shin, I.; Kim, T.; Shin, Y., "HLS-l: A High-Level Synthesis Framework for Latch-Based Architectures" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452109&isn... Abstract: Level-sensitive latches are widely used in high-performance custom designs while edge-triggered flip-flops are predominantly used in application-specific integrated circuits. We consider a latch as a basis for storage and address each step of high-level synthesis (HLS), including scheduling, allocation, and control synthesis. While the use of latches provides an opportunity to reduce the latency during the scheduling, the register allocation has to take extra conflicts caused by latch into account, and the control synthesis has to be tailored to support the latch-based data-path. Optimization potentials specific to this HLS are identified and solutions are proposed. Specifically, the register allocation can be improved by refining the operation schedule in a way to reduce the number of edges in a register conflict graph; the latency can be reduced by adjusting the clock duty cycle in a way to generate a tighter schedule. All the steps of HLS and optimization procedures were integrated into a framework called HLS-l. It was tested on benchmark designs implemented in 1.1-V, 45 nm complementary metal-oxide-semiconductor technology. Compared to the conventional HLS, HLS-l was able to reduce the latency by 18.2% on average with 9.2% less area and 16.0% less power consumption. The application of HLS-l to an industrial example is demonstrated through the design of a module extracted from H.264/advanced video coding. Tong, Y.-S.; Chen, S.-J., "An Automatic Optical Simulation-Based Lithography Hotspot Fix Flow for Post-Route Optimization" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452094&isn... Abstract: In this paper, an optical simulation-based lithography hotspot fix guidance generator and an automatic hotspot fix flow are proposed. We develop our aerial image simulation engine by enhancing the traditional sum of coherence system method. Subject to the shape changes, a strong correlation between the aerial image intensity difference maps of pre-optical proximity correction (OPC) and post-OPC schemes is found. We collect near a litho hotspot in a pre-OPC layout some fix actions that are local shape changes to optimize the optical intensity. Then, fix guidances will be selected from the collected fix actions by a heuristic algorithm and input to a router for fixing the hotspot. We integrate the fix guidance generation method with a commercial lithography hotspot detection tool to create an automatic post-route optical-simulation-embedded local fix (OSELF) flow and test with industry 65 nm designs. Compared with the commercial flow that uses only local fix, our method has a $1.4times hbox{--}1.9times$ fix rate, similar run time, no new design rule check violation, and negligible circuit timing impacts. We also combine our OSELF algorithm with a rip-up and reroute engine, and test on the same designs. Compared to the commercial tool that uses a hybrid (local fix plus reroute) fix flow, our combined flow runs $1.7times hbox{--}2.9times$ faster with 45\u201355% circuit timing impact. Both flows achieve a 100% hotspot fix rate. Hsu, C.-H.; Chen, H.-Y.; Chang, Y.-W., "Multilayer Global Routing With Via and Wire Capacity Considerations" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452098&isn... Abstract: Global routing for modern large-scale circuit designs has attracted much attention in the recent literature. Most of the state-of-the-art academic global routers just work on a simplified routing congestion model that ignores the essential via capacity for routing through multiple metal layers. Such a simplified model would easily cause fatal routability problems in subsequent detailed routing. To remedy this deficiency, a more effective congestion metric that considers both the in-tile nets and the residual via capacity for global routing is presented. Experimental results show that our global router can achieve very high-quality routing solutions with more reasonable via usage. Ho, K.-H.; Chen, Y.-P.; Fang, J.-W.; Chang, Y.-W., "ECO Timing Optimization Using Spare Cells and Technology Remapping" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452097&isn... Abstract: We introduce in this paper a new problem of post-mask engineering change order (ECO) timing optimization using spare-cell rewiring and present a two-phase framework for this problem. Spare-cell rewiring is a popular technique for incremental timing optimization and/or functional change after the placement stage. The spare-cell rewiring problem is very challenging because of its dynamic wiring cost nature for selecting a spare cell, while the existing related problems consider only static wiring cost: once a standard cell is placed, its physical location is fixed and so is its wiring cost. For the spare-cell rewiring problem, each rewiring could make some spare cells become ordinary standard cells and some standard cells become new spare cells simultaneously. As a result, the wiring cost becomes dynamic and further complicates the optimization process. For the addressed problem, we present a two-phase framework of 1) buffer insertion and gate sizing followed by 2) technology remapping. For Phase 1, we present a dynamic programming algorithm considering the dynamic cost, called dynamic cost programming, for the ECO timing optimization with spare cells. Without loss of solution optimality, we further present an effective pruning method by selecting spare cells only inside an essential bounding polygon to reduce the solution space. For those ECO timing paths that cannot be fixed during Phase 1, we apply technology remapping on the spare cells to restructure the circuit to fix the timing violations. The whole framework is integrated into a commercial design flow. Experimental results based on five industry benchmarks show that our method is very effective and efficient in fixing the timing violations of ECO paths. Fang, J.-W.; Chang, Y.-W., "Area-I/O Flip-Chip Routing for Chip-Package Co-Design Considering Signal Skews" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452092&isn... Abstract: The area-input/output (I/O) flip-chip package provides a high chip-density solution to the demand of more I/Os in very large scale integration designs; it can achieve smaller package size, shorter wirelength, and better signal and power integrity. In this paper, we introduce the routing problem for chip and package co-design and present the first work in the literature to handle the multiple re-distribution layer (RDL) routing problem (without RDL vias) for flip-chip designs, considering pin and layer assignment, signal integrity, signal-skew and total wirelength minimization, and chip-package co-design. Our router adopts a two-stage technique of global routing followed by RDL routing. The global routing assigns each block port to a unique bump pad via an I/O pad and decides the RDL routing among I/O pads and bump pads. Based on the minimum-cost maximum-flow algorithm, we can guarantee 100% RDL routing completion after the assignment and the optimal solution with the minimum wirelength. The RDL routing efficiently distributes the routing points between two adjacent bump pads and then generates a 100% routable sequence to complete the routing. Experimental results based on 12 industry designs demonstrate that our router can achieve 100% routability and the optimal routing wirelength under reasonable central processing unit times, while related works cannot. Joshi, V.; Cline, B.; Sylvester, D.; Blaauw, D.; Agarwal, K., "Mechanical Stress Aware Optimization for Leakage Power Reduction" Abstract: Process-induced mechanical stress is used to enhance carrier transport and achieve higher drive currents in current complementary metal-oxide\u2014semiconductor technologies. This paper explores how to fully exploit the layout dependence of stress enhancement and proposes a circuit-level, block-based, stress-enhanced optimization algorithm that uses stress-optimized layouts in conjunction with dual-$V_{rm th}$ assignment to achieve optimal power-performance tradeoffs. We begin by studying how channel stress and drive current depend on layout parameters such as active area length and contact placement, while considering all layout-dependent sources of mechanical stress in a 65 nm industrial process. We then investigate the three main layout properties that impact mechanical stress in this process and discuss how to improve stress-based performance enhancement in standard cell libraries. While varying the stress-altering layout properties of a number of standard cells in a 65 nm industrial library, we show that \u201cdual-stress\u201d standard cell layouts (analogous to \u201cdual-$V_{rm th}$\u201d) can be designed to achieve drive current differences up to ${sim}{rm 14}%$ while incurring less than half the leakage penalty of dual-$V_{rm th}$. Therefore, when the flexibility of \u201cdual-stress\u201d assignment is combined with dual-$V_{rm th}$ assignment (within the proposed joint optimization framework), simulation results for a set of benchmark circuits show that leakage is reduced by ${sim}{rm 24}%$ on average, for iso-delay, when compa- red to dual-$V_{rm th}$ assignment. Since mobility enhancement does not incur the exponential leakage penalty associated with $V_{rm th}$ assignment, our optimization technique is ideal for leakage power reduction. However, our framework can also be used to achieve higher performance circuits for iso-leakage and our joint optimization framework can be used to reduce delay on average by ${sim}{rm 5}%$. In both cases, the proposed method only incurs a small area penalty $({<}{rm 0.5}%)$. Alizadeh, B.; Mirzaei, M.; Fujita, M., "Coverage Driven High-Level Test Generation Using a Polynomial Model of Sequential Circuits" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452110&isn... Abstract: This paper proposes a high-level test generation method which considers the control part as well as data path of a register transfer level circuit as a set of polynomial functions to generate behavioral test patterns from faulty behavior instead of comparing the faulty and fault-free circuits based on a hybrid Boolean-word canonical representation called Horner expansion diagram. Since this set of polynomial functions express primary outputs and next states with respect to primary inputs and present states, it is not necessary to perform justification/propagation phase which leads to a minimum number of backtracks. It improves fault coverage and reduces test generation time over logic-level techniques. We assess then the effectiveness of high-level test generation with a simple gate-level automatic test pattern generation algorithm. Experimental results show robustness and reliability of our method compared to other contemporary approaches in terms of fault coverage and CPU time. URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452130&isn... Zolotov, V.; Xiong, J.; Fatemi, H.; Visweswariah, C., "Statistical Path Selection for At-Speed Test" Abstract: Process variations make at-speed testing significantly more difficult. They cause subtle delay changes that are distributed in contrast to the localized nature of a traditional fault model. Due to parametric variations, different paths can be critical in different parts of the process space, and the union of such paths must be tested to obtain good process space coverage. This paper proposes an integrated at-speed structural testing methodology, and develops a novel branch-and-bound algorithm that elegantly and efficiently solves the hitherto open problem of statistical path tracing. The resulting paths are used for at-speed structural testing. A new test quality metric is proposed, and paths which maximize this metric are selected. After chip timing has been performed, the path selection procedure is extremely efficient. Path selection for a multimillion gate chip design can be completed in a matter of seconds. URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452120&isn... Yilmaz, M.; Chakrabarty, K.; Tehranipoor, M., "Test-Pattern Selection for Screening Small-Delay Defects in Very-Deep Submicrometer Integrated Circuits" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452121&isn... Abstract: Timing-related defects are major contributors to test escapes and in-field reliability problems for very-deep submicrometer integrated circuits. Small delay variations induced by crosstalk, process variations, power-supply noise, as well as resistive opens and shorts can potentially cause timing failures in a design, thereby leading to quality and reliability concerns. We present a test-grading technique that uses the method of output deviations for screening small-delay defects (SDDs). A new gate-delay defect probability measure is defined to model delay variations for nanometer technologies. The proposed technique intelligently selects the best set of patterns for SDD detection from an $n$-detect pattern set generated using timing-unaware automatic test-pattern generation (ATPG). It offers significantly lower computational complexity and excites a larger number of long paths compared to a current generation commercial timing-aware ATPG tool. Our results also show that, for the same pattern count, the selected patterns provide more effective coverage ramp-up than timing-aware ATPG and a recent pattern-selection method for random SDDs potentially caused by resistive shorts, resistive opens, and process variations. Ganeshpure, K.; Kundu, S., "On ATPG for Multiple Aggressor Crosstalk Faults" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452112&isn... Abstract: Crosstalk faults have emerged as a significant mechanism of circuit failure due to decreasing process geometries and increasing operation frequencies. Long signal nets are highly susceptible to crosstalk faults because they tend to have a higher coupling capacitance to overall capacitance ratio. Moreover, a typical long net also has multiple aggressors. In generating patterns to create maximal crosstalk induced delay on a victim net, it may be impossible to activate all aggressors logically or simultaneously to constructively induce maximum noise at the victim. Therefore, pattern generation must focus on activating a maximal subset of aggressors, weighted by actual coupling capacitance value, in close temporal proximity of the victim net transition. This max-satisfiability problem is constrained by fault effect propagation condition which involves determining an input signal assignment so as to propagate the fault effect at the victim to the primary output. In this paper, we present Automatic Test Pattern Generation (ATPG) solutions for multiple aggressor crosstalk faults for zero and unit delay models and compare the magnitude of crosstalk induced delay at the victim net. Our solution involves a combination of 0\u20131 Integer Linear Programming (ILP), for maximal aggressor excitation. Fault effect propagation is solved independently by using traditional stuck-at fault ATPG or by generating additional ILP constraints thus forming a integrated ILP formulation with error propagation. The effect of gate delays is summed by circuit transformation. The proposed technique was applied to ISCAS85 benchmark circuits. Results indicate that the percentage of total capacitance that can be switched varies from 75\u2013100% for zero delay and 30\u201380% for variable delay case while achieving propagation of the fault effect to primary output. Alves, N.; Buben, A.; Nepal, K.; Dworak, J.; Bahar, R. I., "A Cost Effective Approach for Online Error Detection Using Invariant Relationships" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452115&isn... Abstract: This paper investigates the use of logic implication checkers for the online detection of errors. A logic implication, or invariant relationship, must hold for all valid input conditions; therefore, any violation of this implication will indicate an error due to an intermittent fault. Techniques are presented to efficiently identify the most useful logic implications to include in checker hardware such that the probability of error detection is maximized while minimizing the additional hardware and delay overhead. Results show that significant error detection is possible\u2014even with only a 10% area overhead\u2014while minimizing impact on delay and power. Qian, Y.; Lu, Z.; Dou, W., "Analysis of Worst-Case Delay Bounds for On-Chip Packet-Switching Networks" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452093&isn... Abstract: In network-on-chip (NoC), computing worst-case delay bounds for packet delivery is crucial for designing predictable systems but yet an intractable problem. This paper presents an analysis technique to derive per-flow communication delay bound. Based on a network contention model, this technique, which is topology independent, employs network calculus to first compute the equivalent service curve for an individual flow and then calculate its packet delay bound. To exemplify this method, this paper also presents the derivation of a closed-form formula to compute a flow's delay bound under all-to-one gather communication. Experimental results demonstrate that the theoretical bounds are correct and tight. Majzoub, S. S.; Saleh, R. A.; Wilton, S. J. E.; Ward, R. K., "Energy Optimization for Many-Core Platforms: Communication and PVT Aware Voltage-Island Formation and Voltage Selection Algorithm" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452113&isn... Abstract: In this paper, we propose a novel approach to voltage-island formation, for the energy optimization of many-core architectures, which mitigates the impact of process, voltage, and temperature (PVT) variations. The islands are created by balancing their shape constraints imposed by intra and inter-island communication with the desire to limit the spatial extent of each island to minimize PVT impact. In addition, to reduce the number of voltage levels in the design, we propose an efficient voltage selection approach that provides near optimal results, for a set of 33 examined cases, with more than a ten times speedup compared to the best-known previous methods. This run-time improvement is important, especially for large many-core platforms. Finally, we present an evaluation platform considering pre-fabrication and post-fabrication PVT scenarios where multiple applications with hundreds to thousands of tasks are mapped onto many-core platforms with hundreds to thousands of cores to evaluate the proposed techniques. Results show that the average energy savings for 33 test cases using the proposed methods are 37% compared to 16% obtained using previous methods. Short Papers ============ Broussev, S. S.; Tchamov, N. T., "Time-Varying Root-Locus of Large-Signal LC Oscillators" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452116&isn... Abstract: Time-varying root-locus of a large-signal LC oscillator is obtained via semi-symbolic analysis, where the complex frequency $s$ is a symbol. The steady-state circuit behavior is modeled by its time-varying small-signal admittance matrix within one oscillation period. The roots of the characteristic equation are computed with the QZ method. The time-varying root-locus analysis capabilities are demonstrated on a GHz range 130 nm complementary metal-oxide-semiconductor cross-coupled LC oscillator, and they complement the results obtained with the traditional numerical computer-aided design methods. Tzeng, C.-W.; Huang, S.-Y., "Split-Masking: An Output Masking Scheme for Effective Compound Defect Diagnosis in Scan Architecture With Test Compression" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452123&isn... Abstract: In modern scan architecture, it is often desirable to compact the output response without jeopardizing the diagnostic resolution. In this paper, we propose an output masking scheme to meet such a stringent requirement. We consider a practical scenario in which an output compactor is in use. We aim to support the harshest condition called compound defect diagnosis, in which faults exist in both the scan chain and the core logic. To overcome the loss of the diagnostic resolution, we incorporate a split-masking scheme, by which one can easily separate the output responses of the faulty chains from those of the fault-free ones. The experimental results demonstrate that the proposed scheme can recover the diagnostic resolution loss induced by an output compactor almost completely without sacrificing the compaction ratio. Suissa, A.; Romain, O.; Denoulet, J.; Hachicha, K.; Garda, P., "Empirical Method Based on Neural Networks for Analog Power Modeling" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452129&isn... Abstract: We introduce an empirical method for power consumption modeling of analog components at system level. The principal step of this method uses neural networks to approximate the mathematical curve of the power consumption as a function of the inputs and parameters of the analog component. For a node of a wireless sensors network, we found an average error of 1.53% with a maximum error of 3.06% between our estimation and the measured power consumption. This novel method is suitable for Platform-Based Design and has three key features for architecture exploration purposes. Firstly, the method is generic as it can be applied to any analog component in any modeling and simulation environment. Secondly, the method is suitable for the total (analog and digital) power consumption estimation of a heterogeneous system. Thirdly, the method provides an online estimation of the instantaneous power consumption of analog blocks. Chang, C.-H.; Faust, M., "On A New Common Subexpression Elimination Algorithm for Realizing Low-Complexity Higher Order Digital Filters" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5452122&isn... Abstract: A thorough analysis of the paper above revealed several controversial arguments about the superiority of binary representation over canonical signed digits (CSD) for common subexpression elimination (CSE). It was improper to model the number of logic operators (LO) required after CSE as a linear sum of independently weighted numbers of nonzero bits, common subexpressions and unpaired bits. The logic depth (LD) penalty of binary CSE had been deemphasized by the errors in the reported LD. This comment corrects the LD of contention resolution algorithm, and points out some contradictions with reference to the latest experimentation of binary, CSD and minimal signed digit number representations for CSE. Upon correcting the error in the reported filter lengths for different stopband attenuations of digital advanced mobile phone system specification, the LO and LD data of the CSE algorithms compared in the above paper are recalculated using the corrected filter coefficient sets.