TCAD Newsletter - January 2011 Issue Placing you one click away from the best new CAD research! Regular Papers FPGAS AND RECONFIGURABLE COMPUTING Banerjee, P.; Sangtani, M.; Sur-Kolay, S., "Floorplanning for Partially Reconfigurable FPGAs" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671549&isn... Abstract: Partial reconfiguration on heterogeneous field-programmable gate arrays with millions of gates yields better utilization of its different types of resources by swapping in and out the appropriate modules of one or more applications at any instant of time. Given a schedule of sub-task instances where each instance is specified as a netlist of active modules, reconfiguration overhead can be reduced by fixing the position and shapes of modules common across all instances. We propose a global floorplan generation method PartialHeteroFP to obtain same positions for the common modules across all instances such that the heterogeneous resource requirements of all modules in each instance are satisfied, and the total half-perimeter wirelength over all instances is minimal. Experimental results establish that the proposed PartialHeteroFP produces floorplans very fast, with 100% match of common modules and thereby minimizing the partial reconfiguration overhead. Ling, A. C.; Brown, S. D.; Safarpour, S.; Zhu, J., "Toward Automated ECOs in FPGAs" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671540&isn... Abstract: Engineering change orders (ECOs), which are used to apply late-stage specification changes and bug fixes, have become an important part of the field-programmable gate array design flow. ECOs are beneficial since they are applied directly to a placed-and-routed netlist which preserves most of the engineering effort invested previously. Unfortunately, designers often apply ECOs in a manual fashion which may have an unpredictable impact on the design's final correctness and end costs. As a solution, this paper introduces an automated method to tackle the ECO problem. This paper uses a novel resynthesis technique which can automatically update the functionality of a circuit by leveraging the existing logic within the design, thereby removing the inefficient manual effort required by a designer. The technique presented in this paper is robust enough to handle a wide range of changes. Furthermore, the technique can successfully make late-stage functional changes while minimally perturbing the placed-and-routed netlist: something that is necessary for ECOs. Also, this technique does this with a minimal impact on the circuit performance where on average over 90% of the placement and routing wires remain unchanged. LOGIC SYNTHESIS Keren, O.; Levin, I.; Stankovic, R. S., "Determining the Number of Paths in Decision Diagrams by Using Autocorrelation Coefficients" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671542&isn... Abstract: This paper deals with the number of paths in multiterminal binary decision diagrams (MTBDDs) and shared binary decision diagrams (SBDDs) representing a set of Boolean functions. It is shown that the number of paths in an MTBDD (SBDD) can be uniquely determined by values of specific weighted-autocorrelation coefficients. An analytical expression for the number of paths as a linear function of the values of the weighted-autocorrelation coefficients is presented. Based on this expression, a method of minimization of the number of paths is proposed. The method is based on replacing the initial set of input variables with their linear combinations. By using this method, a deterministic paths-reduction procedure, which provides MTBDDs and SBDDs with a reduced number of paths, is presented. The efficiency of the suggested approach is demonstrated on benchmark functions. MODELING AND SIMULATION Ye, X.; Dong, W.; Li, P.; Nassif, S., "Hierarchical Multialgorithm Parallel Circuit Simulation" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671539&isn... Abstract: The emergence of multicore and many-core processors has introduced new opportunities and challenges to electronic design automation research and development. While the availability of increasing parallel computing power holds new promise to address many challenges in computer-aided design (CAD), the leverage of hardware parallelism can only be possible with a new generation of parallel CAD applications. In this paper, we propose a novel hierarchical multialgorithm (MA) parallel circuit simulation approach and its multicore implementation to expedite one of the most fundamental CAD applications: transistor-level transient circuit simulation. In our parallel circuit simulation approach, we create two levels of parallelism. At the higher level of parallelism, we start multiple simulation algorithms in parallel for a given simulation task. Interalgorithm communication is established to enable simulation algorithms to exchange useful information so that they could advance faster than without doing so. At the lower level of parallelism, each algorithm within the MA framework utilizes fine-grained parallel techniques such as parallel device evaluation and parallel matrix solve to fully harness the available hardware resources. By combining the two levels of parallelism, the computing power of the multicore or many-core processor platforms can be fully utilized to achieve superlinear speedup in circuit simulation. Xie, L.; Davoodi, A., "Bound-Based Statistically-Critical Path Extraction Under Process Variations" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671544&isn... Abstract: This paper introduces a bound-based approach to extract a pre-specified number of statistically-critical paths under process variations. These are the paths with the highest "violation probability", which indicates the probability that a path would violate a given timing constraint. Our approach requires pre-computation of the violation probability of all the nodes and edges in the circuit timing graph, which can be done using two rounds of block-based statistical static timing analysis. Given these node/edge violation probabilities, we derive tight upper and lower bounds for any arbitrary segment of consecutive nodes and edges, which is the major contribution of this paper. We further utilize these bounds to extract the statistically-critical paths and show constant-time for incremental update of the bounds when extending a segment to a longer one. If our goal is to extract the single most statistically-critical path, we show a bound-based reduction that can prune a large portion of circuit without losing optimality. In our simulations, we verify the correctness and accuracy of our bounds for individual paths, and compare with exact path extraction using Monte-Carlo-based simulation, and an alternative which incorporates path-based statistical static timing analysis. PHYSICAL DESIGN Wu, T.-H.; Davoodi, A.; Linderoth, J. T., "GRIP: Global Routing via Integer Programming" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671547&isn... Abstract: This paper introduces GRIP, a global routing technique via integer programming. GRIP optimizes wirelength and via cost directly without going through a traditional layer assignment phase. Candidate routes spanning all the metal layers are generated using a linear programming pricing phase that formally accounts for the impact of existing candidate routes when generating new ones. To make an integer-programming-based approach applicable for today's large-scale global routing instances, the original problem is decomposed into smaller subproblems corresponding to rectangular subregions on the chip together with their net assignments. Route fragments of nets that fall in adjacent subproblems are connected in a flexible manner. In case of overflow, GRIP applies a second-phase optimization that explicitly minimizes overflow. By using integer programming in an effective manner, GRIP obtains high-quality solutions. Specifically, for the ISPD 2007 and 2008 benchmarks, GRIP obtains an average improvement in wirelength and via cost of 9.23% and 5.24%, respectively, when compared to the best result in the open literature. Ma, Q.; Xiao, L.; Tam, Y.-C.; Young, E. F. Y., "Simultaneous Handling of Symmetry, Common Centroid, and General Placement Constraints" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671537&isn... Abstract: In today's system-on-chip designs, both digital and analog parts of a circuit will be implemented on the same chip. Parasitic mismatch induced by layout will affect circuit performance significantly for analog designs. Consideration of symmetry and common centroid constraints during placement can help to reduce these errors. Besides these two specific types of placement constraints, other constraints, such as alignment, abutment, preplace, and maximum separation, are also essential in circuit placement. In this paper, we will present a placement methodology that can handle all these constraints at the same time. To the best of our knowledge, this is the first piece of work that can handle symmetry constraint, common centroid constraint, and other general placement constraints, simultaneously. Experimental results do confirm the effectiveness and scalability of our approach in solving this mixed constraint-driven placement problem. Jang, H.; Joo, D.; Kim, T., "Buffer Sizing and Polarity Assignment in Clock Tree Synthesis for Power/Ground Noise Minimization" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671543&isn... Abstract: In synchronous systems, clock tree causes high peak current at clock edges, increasing power/ground noise significantly, if the clock tree is not carefully designed. This paper addresses the problem of minimizing power/ground noise in the clock tree synthesis. Contrary to the previous approaches which only make use of assigning polarities to clock buffers to reduce power/ground noise, our approach solves a new problem of simultaneous consideration of assigning polarities to clock buffers and determining buffer sizes to fully exploit the effects of buffer sizing together with polarity assignment on the minimization of power/ground noise while satisfying the clock skew constraint. Specifically, the contributions of this paper are: 1) precisely estimating peak currents by clock buffers and reflecting them on the power/ground noise minimization; 2) proposing a pseudo-polynomial time optimal algorithm based on dynamic programming for solving the integrated problem, together with the proof of intractability of the problem; 3) devising a systematic design flow framework for reducing the power/ground noise over the entire chip; and 4)considering the effect of thermal variation on the clock skew bound and the noise minimization. SYSTEM-LEVEL DESIGN Kim, J.; Yoo, S.; Kyung, C.-M., "Program Phase-Aware Dynamic Voltage Scaling Under Variable Computational Workload and Memory Stall Environment" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671536&isn... Abstract: Most complex software programs are characterized by program phase behavior and runtime distribution. Dynamism of the two characteristics often makes the design-time workload prediction difficult and inefficient. Especially, memory stall time whose variation is significant in memory-bound applications has been mostly neglected or handled in a too simplistic manner in previous works. In this paper, we present a novel online dynamic voltage and frequency scaling (DVFS) method which takes into account both program phase behavior and runtime distribution of memory stall time, as well as computational workload. The online DVFS problem is addressed in two ways: intraphase workload prediction and program phase detection. The intraphase workload prediction is to predict the workload based on the runtime distribution of computational workload and memory stall time in the current program phase. The program phase detection is to identify to which program phase the current instant belongs and then to obtain the predicted workload corresponding to the detected program phase, which is used to set voltage and frequency during the program phase. The proposed method considers leakage power consumption as well as dynamic power consumption by a temperature-aware combined Vdd/Vbb scaling. Compared to a conventional method, experimental results show that the proposed method provides up to 34.6% and 17.3% energy reduction for two multimedia applications, MPEG4 and H.264 decoder, respectively. Loi, I.; Angiolini, F.; Fujita, S.; Mitra, S.; Benini, L., "Characterization and Implementation of Fault-Tolerant Vertical Links for 3-D Networks-on-Chip" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671541&isn... Abstract: Through silicon vias (TSVs) provide an efficient way to support vertical communication among different layers of a vertically stacked chip, enabling scalable 3-D networks-on-chip (NoC) architectures. Unfortunately, low TSV yields significantly impact the feasibility of high-bandwidth vertical connectivity. In this paper, we present a semi-automated design flow for 3-D NoCs including a defect-tolerance scheme to increase the global yield of 3-D stacked chips. Starting from an accurate physical and geometrical model of TSVs: 1) we extract a circuit-level model for vertical interconnections; 2) we use it to evaluate the design implications of extending switch architectures with ports in the vertical direction; moreover, 3) we present a defect-tolerance technique for TSV-based multi-bit links through an effective use of redundancy; and finally, 4) we present a design flow allowing for post-layout simulation of NoCs with links in all three physical dimensions. Experimental results show that a 3-D NoC implementation yields around 10% frequency improvement over a 2-D one, thanks to the propagation delay advantage of TSVs and the shorter links. In addition, the adopted fault tolerance scheme demonstrates a significant yield improvement, ranging from 66% to 98%, with a low area cost (20.9% on a vertical link in a NoC switch, which leads a modest 2.1% increase in the total switch area) in 130 nm technology, with minimal impact on very large-scale integrated design and test flows. TEST Xiang, D.; Zhang, Y., "Cost-Effective Power-Aware Core Testing in NoCs Based on a New Unicast-Based Multicast Scheme" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671538&isn... Abstract: Reuse of network-on-chip (NoC) for test data and test response delivery is attractive. However, previous techniques do not effectively use the bandwidths of the network by delivering test packets to all cores separately, which can make very much test cost and test data volume. The NoC core testing problem is formulated as a unicast-based multicast problem in order to reduce test data delivery time in the NoC. Test response data are forwarded back to the automated test equipment (ATE) via the communication channels using the reverse paths of test data delivery, which are compacted on the way from each processor to the ATE. A new power-aware test scheduling scheme is proposed, which is extended to cases for multiple port ATEs. Test data is further compressed before delivering and a low-power test application scheme is used for the cores because power produced by cores is the bottleneck of NoC test. Experimental results are presented to show the effectiveness of the proposed method in reducing the NoC test cost and test data volume by comparing to the previous methods. Biswas, S.; Blanton, R. D., "Reducing Test Execution Cost of Integrated, Heterogeneous Systems Using Continuous Test Data" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671546&isn... Abstract: Integrated, heterogeneous systems are comprehensively tested to verify whether their performance specifications fall within some acceptable ranges. However, explicitly testing every manufactured instance against all of its specifications can be expensive due to the complex requirements for test setup, stimulus application, and response measurement. To reduce manufacturing test cost, we have developed a methodology that uses binary decision forests and several test-specific enhancements for identifying redundant tests of an integrated system. Feasibility is empirically demonstrated using test data from over 70 000 manufactured instances of an in-production microelectromechanical system accelerometer, and over 4 500 manufactured instances of an RF transceiver. Through our analysis, we have shown that the expensive cold-mechanical test of the accelerometer and nine out of the 22 RF tests of the transceiver are likely redundant. Short Papers ============ Jaffari, J.; Anis, M., "On Efficient LHS-Based Yield Analysis of Analog Circuits" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5671545&isn... Abstract: The Latin hypercube sampling (LHS) has been used as a variance-reduction estimation tool for an efficient sampling-based variability analysis of analog circuits. For a certain estimation confidence interval, a lower number of LHS samples is needed than that of Monte Carlo due to the estimation variance reduction. In this paper, an analysis of variance decomposition of the indicator function, the yield function, reveals strong contribution of interactive terms in the variance of the yield function, leading to limited performance gain of the traditional LHS sampling. In order to improve its efficiency, two correlation-controlled LHS methods are developed to reduce the required number of LHS samples for analog circuit yield estimation.