TCAD Newsletter-March 2011 Issue Placing you one click away from the best new CAD research! Regular Papers ============== ANALOG, MIXED-SIGNAL, AND RF CIRCUITS Lin, M. P.-H.  Zhang, H.  Wong, M. D. F.  Chang, Y.-W.  Thermal-Driven Analog Placement Considering Device Matching With the thermal effect, improper analog placements may degrade circuit performance because the thermal impact from power devices can affect electrical characteristics of the thermally-sensitive devices. There is not much previous work that considers the desired placement configuration between power and thermally-sensitive devices for a better thermal profile to reduce the thermally-induced mismatches. This paper first introduces the properties of a desired thermal profile for better thermal matching of the matched devices. It then presents a thermal-driven analog placement methodology to achieve the desired thermal profile and to consider the best device matching under the thermal profile while satisfying the symmetry and the common-centroid constraints. Experimental results based on real analog circuits show that the proposed approach can achieve the best analog circuit performance/accuracy with the least impact due to the thermal gradient, among existing works. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715604 EMERGING TECHNOLOGIES Rostami, M.  Mohanram, K.  Dual-Vth Independent-Gate FinFETs for Low Power Logic Circuits This paper describes the electrode work-function, oxide thickness, gate-source/drain underlap, and silicon thickness optimization required to realize dual-Vth independent-gate FinFETs. Optimum values for these FinFET design parameters are derived using the physics-based University of Florida SPICE model for double-gate devices, and the optimized FinFETs are simulated and validated using Sentaurus TCAD simulations. Dual-Vth FinFETs with independent gates enable series and parallel merge transformations in logic gates, realizing compact low power alternative gates with competitive performance and reduced input capacitance in comparison to conventional FinFET gates. Furthermore, they also enable the design of a new class of compact logic gates with higher expressive power and flexibility than conventional CMOS gates, e.g., implementing 12 unique Boolean functions using only four transistors. Circuit designs that balance and improve the performance of the novel gates are described. The gates are designed and calibrated using the University of Florida double-gate model into conventional and enhanced technology libraries. Synthesis results for 16 benchmark circuits from the ISCAS and OpenSPARC suites indicate that on average at 2GHz, the enhanced library reduces total power and the number of fins by 36% and 37%, respectively, over a conventional library designed using shorted-gate FinFETs in 32 nm technology. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715611 HIGH-LEVEL SYNTHESIS Del Barrio, A. A.  Memik, S. O.  Molina, M. C.  Mendias, J. M.  Hermida, R.  A Distributed Controller for Managing Speculative Functional Units in High Level Synthesis Speculative functional units (SFUs) are arithmetic functional units that operate using a predictor for the carry signal. The carry prediction helps to shorten the critical path of the functional unit. The average case performance of these units is determined by the hit rate of the prediction. In case of mispredictions, the SFUs need to be coordinated by the datapath control mechanism to perform corrections and to maintain the datapath in the correct state. Devising a control mechanism for correcting mispredictions without adversely impacting overall performance is the most important challenge. In this paper, we present techniques for designing a datapath controller for seamless deployment of SFUs in high level synthesis. We have developed two techniques based on two main control paradigms: centralized and distributed control. The centralized approach stops the execution of the entire datapath for each misprediction and resumes execution once the correct value of the carry is known. The distributed approach decouples the functional unit suffering from the misprediction from the rest of the datapath. Hence, it allows the remainder of the functional units to carry on execution and be at different scheduling states at different times. We tested datapaths utilizing both linear structures and logarithmic structures for speculative arithmetic functional units. Our results show that it is possible to reduce execution time by as much as 38% (33% on average) for linear structures and by as much as 37.2% (25% on average) for logarithmic structures. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715599 MODELING AND SIMULATION Roy, S.  Dounavis, A. Transient Simulation of Distributed Networks Using Delay Extraction Based Numerical Convolution This paper presents a numerical convolution based approach for transient simulation of distributed networks characterized by band limited frequency domain data when terminated with arbitrary nonlinear circuits. The proposed algorithm uses time-frequency decompositions to extract multiple propagation delays of a distributed network and the associated attenuation losses in a piecewise manner, and implements inverse fast Fourier transform to efficiently convert the frequency response into a sum of delayed time domain responses. Numerical examples illustrate that the proposed algorithm shows significantly more accurate results for networks with multiple long delays when compared to existing numerical convolution techniques. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715600 Miettinen, P.  Honkala, M.  Roos, J.  Valtonen, M. PartMOR: Partitioning-Based Realizable Model-Order Reduction Method for RLC Circuits This paper presents a robust partitioning-based model-order reduction (MOR) method, PartMOR, suitable for reduction of very large RLC circuits or RLC-circuit parts of a non-RLC circuit. The MOR is carried out on a partitioned circuit, which enables the use of low-order moments and macromodels of few elements, while still preserving good accuracy for the reduction. As the method produces a positive-valued, passive, and stable reduced-order RLC circuit (netlist-in–netlist-out), it can be used in conjunction with any standard analysis tool or circuit simulator without modification. It is shown that PartMOR achieves excellent reduction results in terms of accuracy and reduced CPU time for RLC, RC, and RL circuits. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715610 Cheng, L.  Gupta, P.  Spanos, C. J.  Qian, K.  He, L. Physically Justifiable Die-Level Modeling of Spatial Variation in View of Systematic Across Wafer Variability Modeling spatial variation is important for statistical analysis. Most existing works model spatial variation as spatially correlated random variables. We discuss process origins of spatial variability, all of which indicate that spatial variation comes from deterministic across-wafer variation, and purely random spatial variation is not significant. We analytically study the impact of across-wafer variation and show how it gives an appearance of correlation. We have developed a new die-level variation model considering deterministic across-wafer variation and derived the range of conditions under which ignoring spatial variation altogether may be acceptable. Experimental results show that for statistical timing and leakage analysis, our model is within 2% and 5% error from exact simulation result, respectively, while the error of the existing distance-based spatial variation model is up to 6.5% and 17%, respectively. Moreover, our new model is also 6x faster than the spatial variation model for statistical timing analysis and 7x faster for statistical leakage analysis. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715598 PHYSICAL DESIGN Feng, C.  Zhou, H.  Yan, C.  Tao, J.  Zeng, X. Efficient Approximation Algorithms for Chemical Mechanical Polishing Dummy Fill To reduce chip-scale topography variation in chemical mechanical polishing process, dummy fill is widely used to improve the layout density uniformity. Previous researches formulated the density-driven dummy fill problem as a standard linear program (LP). However, solving the huge linear program formed by real-life designs is very expensive and has become the hurdle in deploying the technology. Even though there exist efficient heuristics, their performance cannot be guaranteed. Furthermore, dummy fill can also change the interconnect coupling capacitance which might lead to a significant influence on circuit delay, crosstalk, and power consumption. In this paper, we develop a dummy fill algorithm that can be applied to solve both the traditional density-driven problem and the problem considering fill-induced coupling capacitance impact. The proposed algorithm is both efficient and with provably good performance, which is based on a fully polynomial time approximation scheme by Fleischer for covering LP problems. Moreover, based on the approximation algorithm, we also propose a new greedy iterative algorithm to achieve high quality solutions more efficiently than previous Monte Carlo based heuristic methods. Final experimental results demonstrate the effectiveness and efficiency of our algorithms. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715601 Liu, Y.  Shelar, R. S.  Hu, J. Simultaneous Technology Mapping and Placement for Delay Minimization Technology mapping and placement have a significant impact on delays in standard cell-based very large scale integrated circuits. Traditionally, these steps are applied separately to optimize the delays, possibly since efficient algorithms that allow the simultaneous exploration of the mapping and placement solution spaces are unknown. In this paper, we present an exact polynomial time algorithm for delay-optimal placement of a tree and extend the same to simultaneous technology mapping and placement for the optimal delay in the tree. We extend the algorithm by employing Lagrangian relaxation technique, which assesses the timing criticality of paths beyond a tree, to optimize the delays in directed acyclic graphs. Experimental results on benchmark circuits in a 70 nm technology show that our algorithms improve timing significantly with remarkably less runtimes compared to a competitive approach of iterative conventional timing-driven mapping and multilevel placement. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715609 SYSTEM-LEVEL DESIGN Lan, Y.-C.  Lin, H.-A.  Lo, S.-H.  Hu, Y. H.  Chen, S.-J. A Bidirectional NoC (BiNoC) Architecture With Dynamic Self-Reconfigurable Channel http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715603 A bidirectional channel network-on-chip (BiNoC) architecture is proposed to enhance the performance of on-chip communication. In a BiNoC, each communication channel allows to be dynamically self-reconfigured to transmit flits in either direction. This added flexibility promises better bandwidth utilization, lower packet delivery latency, and higher packet consumption rate. Novel on-chip router architecture is developed to support dynamic self-reconfiguration of the bidirectional traffic flow. This area-efficient BiNoC router delivers better performance and requires smaller buffer size than that of a conventional network-on-chip (NoC). The flow direction at each channel is controlled by a channel direction control (CDC) algorithm. Implemented with a pair of finite state machines, this CDC algorithm is shown to be high performance, free of deadlock, and free of starvation. Extensive cycle-accurate simulations using synthetic and real-world traffic patterns have been conducted to evaluate the performance of the BiNoC. These results exhibit consistent and significant performance advantage over conventional NoC equipped with hard-wired unidirectional channels. Nurvitadhi, E.  Hoe, J. C.  Kam, T.  Lu, S.-L. L. Automatic Pipelining From Transactional Datapath Specifications http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715612 This paper presents a transactional specification framework (T-spec) for describing a datapath and the tool T-piper to synthesize automatically an in-order pipelined implementation with arbitrary user-specified pipeline-stage boundaries. T-spec abstractly views a datapath as executing one transaction at a time, computing the next system states based on the current ones. The synthesized pipeline maintains this semantics, yet allows concurrent execution of multiple overlapped transactions in different pipeline stages, where each stage performs a part of the next-state computation of each transaction. T-spec makes the state reading and writing events in a datapath explicit to enable T-piper to perform exact read-after-write (RAW) hazard analysis between the overlapped transactions. T-piper can automatically generate the pipeline control not only to ensure the correctness of the pipelined executions but also to minimize (using forwarding and speculation) the performance loss due to pipeline stalls in the presence of RAW dependencies. This paper reports design case studies applying T-spec and T-piper to reduced instruction set computing and complex instruction set computing processor pipeline development. In the latter, we report the results from a rapid design space exploration of 60 generated x86-subset pipelines, varying in pipeline depth, forwarding, and speculative execution, all starting from a single T-spec. TEST Wu, S.  Wang, L.-T.  Wen, X.  Jiang, Z.  Tan, L.  Zhang, Y.  Hu, Y.  Jone, W.-B.  Hsiao, M. S.  Li, J. C.-M.  Huang, J.-L.  Yu, L. Using Launch-on-Capture for Testing Scan Designs Containing Synchronous and Asynchronous Clock Domains http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715613 This paper presents a hybrid automatic test pattern generation (ATPG) technique using the staggered launch-on-capture (LOC) scheme followed by the one-hot LOC scheme for testing delay faults in a scan design containing asynchronous clock domains. Typically, the staggered scheme produces small test sets but needs long ATPG runtime, whereas the one-hot scheme takes short ATPG runtime but yields large test sets. The proposed hybrid technique is intended to reduce test pattern count with acceptable ATPG runtime for multi-million-gate scan designs. In case the scan design contains multiple synchronous clock domains, each group of synchronous clock domains is treated as a clock group and tested using a launch aligned or a capture aligned LOC scheme. By combining these schemes together, we found the pattern counts for two large industrial designs were reduced by approximately $1.7X$ to $2.1X$, while the ATPG runtime was increased by 10% to 50%, when compared to the one-hot clocking scheme alone. SHORT PAPERS ============== Chen, C.-C.  Kuo, C.-W.  Yang, Y.-J. Generating Passive Compact Models for Piezoelectric Devices http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715614 In this letter, a model-order-reduction method for generating piezoelectric compact models is presented. An Arnoldi-based technique is used to create compact models from the system matrices generated by a piezoelectric finite-element solver. The proposed approach also preserves the passivity of the generated compact models. The modeling results of a Rosen-type piezoelectric transformer are presented. The transient and frequency responses of the devices can be accurately simulated by the compact models. Moreover, compared with the full-meshed simulations, the compact models give computational speed-ups of at least two orders of magnitude. Joo, Y.-P.  Kim, S.  Ha, S. Fast Communication Architecture Exploration of Processor Pool-Based MPSoC via Static Performance Analysis http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5715602 Multiprocessor systems-on-chip (MPSoCs) are evolving toward processor pool-based architecture that employs a hierarchical on-chip network for inter-processor and intra-processor pool communication. This letter presents a systematic exploration method of the cascaded bus matrix-based on-chip network design for processor pool-based MPSoCs. It uses an evolutionary algorithm to find optimal architectures in terms of on-chip area while satisfying a given performance constraint. Since simulation is too time-consuming to evaluate the performance of complex on-chip networks during architecture exploration, we propose to prune the design space efficiently using two novel static analysis techniques: 1) bandwidth analysis considering task execution dependences, and 2) memory contention analysis for accurate performance estimation. Thanks to fast and accurate evaluation by the proposed analysis techniques, we achieved an order of magnitude speed improvement for the architecture exploration without performance loss, compared with a simulation-based approach.