March 2013 Newsletter Placing you one click away from the best new CAD research! CALL FOR NOMINATIONS: IEEE Transactions on CAD Donald O. Pederson Best Paper Award (Deadline: March 1, 2013) The IEEE Transactions on CAD invites nominations for the 2013 Donald O. Pederson Best Paper Award. All papers that appeared in TCAD between January 2011 and December 2012 (both inclusive) are eligible to be nominated. By MARCH 1, 2013, please send your nomination(s) to tcad@umn.edu. The information required in a nomination is: Nominator (should not be an author): Title: Authors: Publication information: - Volume: - Issue: - Page(s): Basis for nomination (max 100 words): REGULAR PAPERS EMBEDDED SYSTEMS Wu, J.; Wang, J.; Li, K.; Zhou, H.; Lv, Q.; Shang, L.; Sun, Y. Large-Scale Energy Storage System Design and Optimization for Emerging Electric-Drive Vehicles http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461989 Energy consumption and the associated environmental impact are a pressing challenge faced by the transportation sector. Emerging electric-drive vehicles have shown promises for substantial reductions in petroleum use and vehicle emissions. Their success, however, has been hindered by the limitations of energy storage technologies. Existing in-vehicle lithium-ion battery systems are bulky, expensive, and unreliable. Energy storage system (ESS) design and optimization is essential for emerging transportation electrification. This paper presents an integrated ESS modeling, design, and optimization framework targeting emerging electric-drive vehicles. A large-scale ESS modeling solution is first presented, which considers major runtime and long-term battery effects, and uses fast frequency-domain analysis techniques for efficient and accurate characterization of large-scale ESS. The proposed design framework unifies design-time optimization and runtime control. This conducts statistical optimization for ESS cost and lifetime, which jointly considers the variances of ESS due to manufacture tolerance and heterogeneous driver-specific runtime usage. This optimizes ESS design by incorporating complementary energy storage technologies, e.g., lithium-ion batteries and ultracapacitors. Using physical measurements of battery manufacture variation and real-world user driving profiles, our experimental study has demonstrated that the proposed framework effectively explores the statistical design space and produces cost-efficient ESS solutions with statistical system lifetime guarantees. HIGH-LEVEL SYNTHESIS Morvan, A.; Derrien, S.; Quinton, P. Polyhedral Bubble Insertion: A Method to Improve Nested Loop Pipelining for High-Level Synthesis http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461983 High-level synthesis (HLS) allows hardware to be directly produced from behavioral description in C/C++, thus accelerating the design process. Loop pipelining is a key transformation of HLS, as it improves the throughput of the design at the price of a small hardware overhead. However, for small loops, its use often results in a poor hardware utilization due to the pipeline latency overhead. Overlapping the iterations of the whole loop nest instead of only overlapping the innermost loop is a way to overcome this difficulty, but currently available techniques are restricted to perfectly nested loops with constant bounds, involving uniform dependences only. Using the polyhedral model, we extend the applicability of the nested loop pipelining transformation by proposing a new legality check and a new loop correction technique, called polyhedral bubble insertion. This method was implemented in a source-to-source compiler targeting HLS, and results on benchmark kernels show that polyhedral bubble insertion is effective in practice on a much larger class of loop nests. MODELING AND SIMULATION Yu, W.; Zhuang, H.; Zhang, C.; Hu, G.; Liu, Z. RWCap: A Floating Random Walk Solver for 3-D Capacitance Extraction of Very-Large-Scale Integration Interconnects http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461990 A floating random walk (FRW) solver, called RWCap, is presented for the capacitance extraction of very-large-scale integration (VLSI) interconnects. An approach, including the numerical characterization of the cross-interface transition probability and weight value, is proposed to accelerate the extraction of structures with multiple dielectric layers. A comprehensive variance reduction scheme based on the importance sampling and stratified sampling is proposed to improve the convergence rate of the FRW algorithm. Finally, the space management technique using an octree data structure and the parallel computing technique are presented to further improve the efficiency. Numerical experiments are carried out with the test cases generated under the 180 and 45-nm process technologies. They demonstrate that the proposed multidielectric FRW algorithm achieves up to $160times$ speedup over the FRW algorithm using spherical transition domains to cross dielectric interface, with very small memory overhead. The variance reduction techniques further bring $3times$ or more speedup without memory overhead and the loss of accuracy. The RWCap also outperforms other existing FRW algorithm and fast boundary element method solvers in terms of computational time or scalability. The experiments on an 8-core CPU machine show that the parallel RWCap is over $6times$ faster than its serial-computing version. Li, B.; Chen, N.; Xu, Y.; Schlichtmann, U. On Timing Model Extraction and Hierarchical Statistical Timing Analysis http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461978 In this paper, we investigate the challenges of applying statistical static timing analysis in hierarchical design flow, where modules supplied by IP vendors are used to hide design details for IP protection and to reduce the complexity of design and verification. For the three basic circuit types, combinational, flip-flop-based, and latch-controlled, we propose methods for extracting timing models that contain interfacing and compressed internal constraints. Using these compact timing models, the runtime of full-chip timing analysis can be reduced, while circuit details from IP vendors are not exposed. We also propose a method for reconstructing correlation between modules during full-chip timing analysis. This correlation cannot be incorporated into timing models because it depends on the layout of the corresponding modules in the chip. In addition, we investigate how to apply the extracted timing models with the reconstructed correlation to evaluate the performance of the complete design. Experiments demonstrate that using the extracted timing models and reconstructed correlation full-chip timing analysis can be several times faster than applying the flattened circuit directly, while the accuracy of statistical timing analysis is still well maintained. PHYSICAL DESIGN Chin, C.-Y.; Kuan, C.-Y.; Tsai, T.-Y.; Chen, H.-M.; Kajitani, Y. Escaped Boundary Pins Routing for High-Speed Boards http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461975 Routing for high-speed boards is still achieved manually today. There have recently been some related works to solve this problem; however, a more practical problem has not been addressed. Usually, the packages or components are designed with or without the requirement from board designers, and the boundary pins are usually fixed or advised to follow when the board design starts. In this paper, we describe this fixed ordering boundary pin routing problem, and propose a practical approach to solve it. Not only do we provide a way to address, we also further plan the wires in a better way to preserve the precious routing resources in the limited number of layers on the board, and to effectively deal with obstacles. Our approach has different features compared with the conventional shortest-path-based routing paradigm. In addition, we consider length-matching requirements and wire shape resemblance for high-speed signal routes on board. Our results show that we can utilize routing resources very carefully, and can account for the resemblance of nets in the presence of the obstacles. Our approach is workable for board buses as well. Lim, K.-H.; Joo, D.; Kim, T. An Optimal Allocation Algorithm of Adjustable Delay Buffers and Practical Extensions for Clock Skew Optimization in Multiple Power Mode Designs http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461977 Satisfying a clock skew constraint is one of the most important tasks in clock tree synthesis. Moreover, the task becomes much harder to solve when the clock tree is designed in a multiple power mode environment, in which the voltage applied to some design module varies as the power mode changes. Recently, it has been shown that an adjustable delay buffer (ADB), whose delay can be tuned dynamically, can be used to solve the clock skew problem effectively under multiple power modes. However, due to the area or control overhead by ADBs, it is very important to minimize the number of ADBs to be allocated. This paper provides a complete solution to the problem of clock skew optimization using ADBs under multiple power modes. We propose a linear-time algorithm that simultaneously solves the problems of computing: 1) the minimum (optimal) number of ADBs to be used; 2) the location where each ADB is to be placed; and 3) the delay value of each ADB to be assigned to each power mode. Experimental results show that, in comparison with the previous work, which iteratively performs the ADB allocation, placement, and value assignment, our integrated algorithm produces consistently better designs for all tested benchmarks; it reduces the numbers of ADBs by 9.27% on average under the skew bound of 30Ð50 ps, even with shorter clock latencies compared to that of previous algorithm of ADB allocation, placement, and delay assignment. To make it practically feasible, we also propose a new ADB design technique and systematic algorithmic solutions to address the problems of discrete delay values, slew rate variation, nonzero initial ADB delay, and a possible exploration of ADB resizing. Liu, W.; Calimera, A.; Macii, A.; Macii, E.; Nannarelli, A.; Poncino, M. Layout-Driven Post-Placement Techniques for Temperature Reduction and Thermal Gradient Minimization http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461982 With the continuing scaling of CMOS technology, on-chip temperature and thermal-induced variations have become a major design concern. To effectively limit the high temperature in a chip equipped with a cost-effective cooling system, thermal specific approaches, besides low power techniques, are necessary at the chip design level. The high temperature in hotspots and large thermal gradients are caused by the high local power density and the nonuniform power dissipation across the chip. With the objective of reducing power density in hotspots, we propose two placement techniques that spread cells in hotspots over a larger area. Increasing the area occupied by the hotspot directly reduces its power density, leading to a reduction in peak temperature and thermal gradient. To minimize the introduced overhead in delay and dynamic power, we maintain the relative positions of the coupling cells in the new layout. We compare the proposed methods in terms of temperature reduction, timing, and area overhead to the baseline method, which enlarges the circuit area uniformly. The experimental results showed that our methods achieve a larger reduction in both peak temperature and thermal gradient than the baseline method. The baseline method, although reducing peak temperature in most cases, has little impact on thermal gradient. Wu, P.-H.; Lin, M. P.-H.; Chen, T.-C.; Ho, T.-Y.; Chen, Y.-C.; Siao, S.-R.; Lin, S.-H. 1-D Cell Generation With Printability Enhancement http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461981 As process technologies advance to the subwavelength era, the 1-D design style is regarded as one of the most effective ways to continue scaling down the minimum feature size. To improve the printability of 1-D cell design, it is essential to insert dummy patterns and optimize line-end gap distribution for each layer. This paper presents novel 1-D cell generation algorithms that simultaneously minimize 1-D cell area and enhance the printability. Experimental results show that the proposed algorithms can effectively and efficiently reduce the number of diffusion gaps, minimize used routing tracks, insert sufficient dummy patterns, and eliminate stage-like line-end gaps without power and timing overhead. Consequently, the 1-D cell area is minimized and the printability of the cell is enhanced. To the best of our knowledge, this is also the first work in the literature that considers line-end gap distribution during 1-D cell generation. TEST Basith, I. I.; Kandalaft, N.; Rashidzadeh, R.; Ahmadi, M. Charge-Controlled Readout and BIST Circuit for MEMS Sensors http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461973 In this paper, we present a new readout circuit with an integrated built-in self-test (BIST) structure for capacitive microelectromechanical system (MEMS). In the proposed solution, instead of commonly used voltage control signals to test the device, charge-controlled stimuli are employed to cover a wider range of structural defects. The proposed test solution eliminates the risk of structural collapse in the test phase for gap-varying parallel-plate MEMS devices. Measurement results using a prototype fabricated in TSMC 65-nm CMOS technology indicate that the proposed BIST scheme can successfully detect minor structural defects altering MEMS nominal capacitance. Pomeranz, I. Generation of Functional Broadside Tests for Logic Blocks With Constrained Primary Input Sequences http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461987 This paper describes a test generation procedure that produces functional broadside tests for logic blocks whose primary input sequences are constrained. The constraints are created during functional operation by logic blocks that drive the logic block under consideration. Functional broadside tests avoid overtesting of delay faults by creating functional operation conditions during the clock cycles where delay faults are detected. Test generation procedures for functional broadside tests typically assume that the primary input sequences are unconstrained during functional operation. This paper shows that the constraints, which are imposed by a logic block driving the primary inputs of another block, can be time dependent and difficult to represent compactly. The test generation procedure described in this paper addresses this issue by separating the problem of test generation into the generation of constrained primary input sequences for the block under consideration, and the extraction of functional broadside tests from these sequences. VERIFICATION Nanshi, K.; Somenzi, F. Using Abstraction to Guide the Search for Long Error Traces http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461986 Model checking is a formal method for verifying whether the system satisfies a user-defined specification. Compared to simulation, model checking is restricted in capacity. On the other hand, simulation is weak in detecting bugs that require long and complex sequences of events to be exposed. This paper combines model checking and simulation in an abstraction-refinement scheme to mitigate the problems of both methods. Abstraction refinement iteratively constructs a simplified model to verify the original model. While a simplified model mitigates the weakness of model checking, the set of simplified error traces model helps guide simulation toward deep bugs. In abstraction refinement, concretizationÑa process of deriving an error trace in the original model from the abstract onesÑis used to invalidate spurious abstract error traces or to refute a property. In this paper, we describe a novel concretization algorithm that combines simulation with satisfiability to efficiently refute properties with very long error traces. Huang, S.-L.; Lin, W.-H.; Huang, P.-K.; Huang, C.-Y. Match and Replace: A Functional ECO Engine for Multierror Circuit Rectification http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461976 Functional engineering change order (ECO) is a popular technique for rectifying design errors after synthesis and placement stages. We present a new approach to generating the patch circuits for multierror circuit rectification. In this paper, we propose a two-phase approach of: 1) discovering the functional matches in two circuits followed by 2) determining the final patch circuits from the matches. The ECO engine in this paper discovers functional and structural matches in two circuits by coordinating the SAT-sweeping and the cut-matching algorithms. Then, the patch selection is conducted by the combinational equivalence checking technique and a linear-time selection heuristic. The experimental results on public benchmark and industrial circuits demonstrate that this ECO engine outperforms state-of-the-art interpolation-based engines. SHORT PAPERS Reviriego, P.; Pontarelli, S.; Maestro, J. A.; Ottavi, M. A Method to Construct Low Delay Single Error Correction Codes for Protecting Data Bits Only http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461988 Error correction codes (ECCs) have been used for decades to protect memories from soft errors. Single error correction (SEC) codes that can correct 1-bit error per word are a common option for memory protection. In some cases, SEC codes are extended to also provide double error detection and are known as SEC-DED codes. As technology scales, soft errors on registers also became a concern and, therefore, SEC codes are used to protect registers. The use of an ECC impacts the circuit design in terms of both delay and area. Traditional SEC or SEC-DED codes developed for memories have focused on minimizing the number of redundant bits added by the code. This is important in a memory as those bits are added to each word in the memory. However, for registers used in circuits, minimizing the delay or area introduced by the ECC can be more important. In this paper, a method to construct low delay SEC or SEC-DED codes that correct errors only on the data bits is proposed. The method is evaluated for several data block sizes, showing that the new codes offer significant delay reductions when compared with traditional SEC or SEC-DED codes. The results for the area of the encoder and decoder also show substantial savings compared to existing codes.