TCAD Newsletter - October 2010 Issue Placing you one click away from the best new CAD research! SPECIAL SECTION ON THE ACM-IEEE INTERNATIONAL CONFERENCE ON FORMAL METHODS AND MODELS FOR CODESIGN (MEMOCODE) 2009 Chung, E. S.; Hoe, J. C., "High-Level Design and Validation of the BlueSPARC Multithreaded Processor" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580230&isn... Abstract: This paper presents our experiences in using high-level methods to design and validate a 16-way multithreaded microprocessor called BlueSPARC. BlueSPARC is an in-order, high-throughput processor supporting complex features such as privileged-mode operations, memory management, and a nonblocking cache subsystem. Using a high-level design language called Bluespec System Verilog (BSV), our final implementation achieves comparable synthesis quality to a similar commercial microprocessor developed using conventional register transfer level flows, and is capable of running unmodified commercial applications while hosted on a Xilinx XCV2P70 field-programmable gate array (FPGA) at 90 MHz. To validate our implementation, an FPGA-accelerated approach was developed to efficiently check the correct execution of real, nondeterministic multithreaded programs running on the BlueSPARC processor. Together, the high-level language features of BSV along with our validation approach enabled us to achieve a working FPGA-based implementation in less than one man-year. Vasudevan, N.; Edwards, S. A., "Buffer Sharing in Rendezvous Programs" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580213&isn... Abstract: Most compilers focus on optimizing performance, often at the expense of memory, but efficient memory use can be just as important in constrained environments such as embedded systems. This paper presents a memory reduction technique for rendezvous communication, which is applied to the deterministic concurrent programming language SHIM. It focuses on reducing memory consumption by sharing communication buffers among tasks. It determines pairs of buffers that can never be in use simultaneously and use a shared region of memory for each pair. The technique produces a static abstraction of a SHIM program's dynamic behavior, which is then analyzed to find buffers that are never occupied simultaneously. Experiments show the technique runs quickly on modest-sized programs and can sometimes reduce memory requirements by half. Briand, X.; Jeannet, B., "Combining Control and Data Abstraction in the Verification of Hybrid Systems" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580228&isn... Abstract: This paper addresses the verification of hybrid systems built as the composition of a discrete software controller interacting with a physical environment exhibiting a continuous behavior. The goal is to attack the problem of the combinatorial explosion of discrete states that may happen if a complex software controller is considered. It proposes as a solution to extend an existing abstract interpretation technique, namely dynamic partitioning, to hybrid systems described in a symbolic formalism. Dynamic partitioning allows us finely tune the tradeoff between precision and efficiency in a reachability analysis. It shows the effectiveness of the approach by a case study that combines a nontrivial controller specified in the synchronous dataflow programming language Lustre with its physical environment. Bohm, P., "Incremental and Verified Modeling of the PCI Express Protocol" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580232&isn... Abstract: PCI Express is a modern, high-performance communication protocol implementing sophisticated features to meet today's performance demands. Although an off-chip protocol, PCI Express implements many principles of future on-chip communication architectures. It is a highly complex protocol that is hard to verify formally. We present the application of a new approach to the PCI Express transaction and data link layers. The methodology is based on a series of model transformation steps and revises the traditional modeling and verification workflow for designing on-chip protocols. Major parts of PCI Express, including performance-related optimizations and fault-tolerance features, are modeled incrementally to control the complexity and composed to a single model. The work has been accomplished in the Isabelle/HOL theorem prover. By restricting the models to an executable subset of the specification language, we have been able to combine the advantages of specifying in a theorem prover with the advantages of executable models in a functional programming language. FPGAS AND RECONFIGURABLE COMPUTING Lin, M.; Wawrzynek, J.; Gamal, A. E., "Exploring FPGA Routing Architecture Stochastically" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580223&isn... Abstract: This paper proposes a systematic strategy to efficiently explore the design space of field-programmable gate array (FPGA) routing architectures. The key idea is to use stochastic methods to quickly locate near-optimal solutions in designing FPGA routing architectures without exhaustively enumerating all design points. The main objective of this paper is not as much about the specific numerical results obtained, as it is to show the applicability and effectiveness of the proposed optimization approach. To demonstrate the utility of the proposed stochastic approach, we developed the tool for optimizing routing architecture (TORCH) software based on the versatile place and route tool. Given FPGA architecture parameters and a set of benchmark designs, TORCH simultaneously optimizes the routing channel segmentation and switch box patterns using the performance metric of average interconnect power-delay product estimated from placed and routed benchmark designs. Special techniques--such as incremental routing, infrequent placement, multi-modal move selection, and parallelized metric evaluation - are developed to reduce the overall run time and improve the quality of results. Our experimental results have shown that the stochastic design strategy is quite effective in co-optimizing both routing channel segmentation and switch patterns. With the optimized routing architecture, relative to the performance of our chosen architecture baseline, TORCH can achieve average improvements of 24% and 15% in delay and power consumption for the 20 largest Microelectronics Center of North Carolina benchmark designs, and 27% and 21% for the eight benchmark designs synthesized with the Altera Quartus II University Interface Program tool. Additionally, we found that the average segment length in an FPGA routing channel should decrease with technology scaling. Finally, we demonstrate the versatility of TORCH by illustrating how TORCH can be used to optimize other aspects of the routing architecture in an FPGA. Regular Papers ============== MODELING AND SIMULATION Silva, J. M. S.; Phillips, J. R.; Silveira, L. M., "Efficient Simulation of Power Grids" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580215&isn... Abstract: Modern deep sub-micron ultra-large scale integration designs with hundreds of millions of devices require huge grids for power distribution. Such grids, operating with decreasing power supply voltages, are a design limiting factor and accurate analysis of their behavior is of paramount importance as any voltage drops can seriously impact performance or functionality. As power grid models have millions of unknowns, highly optimized special-purpose simulation tools are required to handle the time and memory complexity of solving for their dynamic behavior. In this paper, we propose a hierarchical matrix representation of the power grid model that is both space and time efficient. With this representation, reduced storage matrix factors are efficiently computed and applied in the analysis at every time-step of the simulation. Results show an almost linear complexity growth, namely O(n log^{a}(n)), for some small constant "a", in both space and time, when using this matrix representation. Comparisons of our academic implementation with production-quality code prove this method to be very efficient when dealing with the simulation of large power grid models. Chakraborty, A.; Shi, S. X.; Pan, D. Z., "Stress Aware Layout Optimization Leveraging Active Area Dependent Mobility Enhancement" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580229&isn... Abstract: Starting from the 90 nm technology node, process induced stress has played a key role in the design of high-performance devices. The emergence of source/drain silicon germanium (S/D SiGe) technique as the most important stressing mechanism for p-channel metal-oxide-semiconductor field-effect transistor devices has opened up various optimization possibilities at circuit and physical design stage. In this paper, we exploit the active area dependence of the performance improvement achievable using S/D SiGe technology for late stage engineering change order (ECO) timing optimization. An active area sizing aware cell-level delay model is derived which forms the basis of linear program based optimization of a design for achieving maximum performance or target performance under a timing budget. To control the magnitude of layout perturbation and ensure predictable timing improvement, a set of physical constraints for active area sizing is proposed. Further, an efficient minimum movement legalization algorithm is proposed to remove the overlaps caused by active area sizing of timing critical cells. Results on a wide variety of benchmarks show consistent reduction in the cycle time by up to 6.3%. Predictability of the performance improvement achievable as well as resultant minuscule layout changes make our technique very attractive for late stage ECO optimization and design closure. PHYSICAL DESIGN Lu, Y.; Zhou, H.; Shang, L.; Zeng, X., "Multicore Parallelization of Min-Cost Flow for CAD Applications" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580222&isn... Abstract: Computational complexity has been the primary challenge of many very large scale integration computer-aided design (CAD) applications. The emerging multicore and many-core microprocessors have the potential to offer scalable performance improvements. How to explore the multicore resources to speed up CAD applications is thus a natural question but also a huge challenge for CAD researchers. This paper proposes a methodology to explore concurrency via nondeterministic transactional models, and to program them on multicore processors for CAD applications. Various run-time scheduling implementations on multicore shared-memory machines are discussed and the most efficient one is identified. The proposed methodology is applied to the min-cost flow problem which has been identified as the key problem in many design optimizations, from wire-length optimization in detailed placement to timing-constrained voltage assignment. A concurrent algorithm for min-cost flow has been developed based on the methodology Experiments on voltage island generation in floorplanning have demonstrated its efficiency and scalable speedup over different numbers of cores. SYSTEM-LEVEL DESIGN Arjomand, M.; Sarbazi-Azad, H., "Power-Performance Analysis of Networks-on-Chip With Arbitrary Buffer Allocation Schemes" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580217&isn... Abstract: End-to-end delay, throughput, energy consumption, and silicon area are the most important design metrics of networks-on-chip (NoCs). Although several analytical models have been previously proposed for predicting such metrics in NoCs, very few of them consider the effect of message waiting time in the buffers of network routers for predicting overall power consumptions and none of them consider structural heterogeneity of network routers. This paper introduces two inter-related analytical models to compute message latency and power consumption of NoCs with arbitrary topology, buffering structure, and routing algorithm. Buffer allocation scheme defines the buffering space for each individual channel of the NoC that can be homogenous (all channels having similar buffer structures) or heterogeneous (each channel having its own buffer structure). Here, the buffer allocation scheme can be either homogenous or heterogeneous. We assume no bandwidth sharing of virtual channels for a physical channel, and IP cores generate messages following a Poisson distribution. The results obtained from simulation experiments confirm that the proposed models exhibit acceptable accuracy for different network configurations operating under various working conditions. We have shown that basing our analysis on a Poisson traffic model is still useful for scenarios with real application workloads. Jang, W.; Pan, D. Z., "An SDRAM-Aware Router for Networks-on-Chip" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580227&isn... Abstract: Networks-on-chip (NoCs) may interface with lots of synchronous dynamic random access memories (SDRAM) to provide enough memory bandwidth and guaranteed quality-of-service for future systems-on-chip (SoCs). SDRAM is commonly controlled by a memory subsystem that schedules memory requests to improve memory efficiency and latency. However, a memory subsystem is still a performance bottleneck in the entire NoC. Therefore, memory-aware NoC optimization has attracted considerable attention. This paper presents a NoC router with an explicit SDRAM-aware flow control. Based on priority-based arbitration, our SDRAM-aware flow controller schedules memory requests to prevent bank conflict, data contention, and short turn-around bank interleaving. Moreover, our multi-stage scheduling scheme further improves memory performance and saves NoC hardware costs. Experimental results show that our cost-efficient SDRAM-aware NoC design significantly improves memory latency and utilization compared to the conventional NoC design with no SDRAM-aware router. Sharifi, S.; Rosing, T. S., "Accurate Direct and Indirect On-Chip Temperature Sensing for Efficient Dynamic Thermal Management" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580218&isn... Abstract: Dynamic thermal management techniques require accurate runtime temperature information in order to operate effectively and efficiently. In this paper, we propose two novel solutions for accurate sensing of on-chip temperature. Our first technique is used at design time for sensor allocation and placement to minimize the number of sensors while maintaining the desired accuracy. The experimental results show that this technique can improve the efficiency and accuracy of sensor allocation and placement compared to previous work and can reduce the number of required thermal sensors by about 16% on average. Secondly, we propose indirect temperature sensing to accurately estimate the temperature at arbitrary locations on the die based on the noisy temperature readings from a limited number of sensors which are located further away from the locations of interest. Our runtime technique for temperature estimation reduces the standard deviation and maximum value of temperature estimation errors by an order of magnitude. Yang, H.; Kim, S.; Ha, S., "An MILP-Based Performance Analysis Technique for Non-Preemptive Multitasking MPSoC" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580225&isn... Abstract: For real-time applications, it is necessary to estimate the worst-case performance early in the design process without actual hardware implementation. While the non-preemptive task scheduling is pertinent to multi-core platforms because of easy implementation and high performance, its scheduling anomaly behavior makes the worst-case performance estimation extremely difficult. In this paper, we propose an analysis technique based on mixed integer linear programming (MILP) to estimate the worst-case performance of each task in a non-preemptive multitask application on multi-processor system-on-chip architecture. MILP provides a systematic way to describe the complex interaction among task scheduling, communication architecture, and task execution, which affects the worst-case behavior dynamically. The proposed analysis technique overcomes several limitations that previous work usually has; it allows multiple tasks with different periods and models contention on the communication architecture. We show that the proposed analysis takes affordable computation time to make it of practical value even though it has exponential complexity in theory. The proposed technique estimates a safe bound on task latency statistically, which is demonstrated by extensive random simulations. TEST Miskov-Zivanov, N.; Marculescu, D., "Multiple Transient Faults in Combinational and Sequential Circuits: A Systematic Approach" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580219&isn... Abstract: Transient faults in logic circuits are becoming an important reliability concern for future technology nodes. Radiation-induced faults have received significant attention in recent years, while multiple transients originating from a single radiation hit are predicted to occur more often. Furthermore, some effects, like reconvergent fanout-induced glitches, are more pronounced in the case of multiple faults. Therefore, to guide the design process and the choice of circuit optimization techniques, it is important to model multiple faults and their propagation through logic circuits, while evaluating the changes in error rates resulting from multiple simultaneous faults. In this paper, we show how output error probabilities change with increasing number of simultaneous faults and we also analyze the impact of multiple errors in state flip-flops, during the cycles following the cycle when fault(s) occurred. The results obtained using the proposed framework show that output error probability resulting from multiple-event transient or multiple-bit upsets can vary across different outputs and different circuits by several orders of magnitude. The results also show that the impact of different masking factors also varies across circuits and this information can be valuable for customizing protection techniques. Tseng, T.W.; Huang, Y.J.; Li, J.F., "DABISR: A Defect-Aware Built-In Self-Repair Scheme for Single/Multi-Port RAMs in SoCs" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580226&isn... Abstract: Built-in self-repair (BISR) techniques are widely used to enhance the yield of embedded random access memories (RAMs). Fault-location ability of test algorithms executed by a BISR circuit has heavy impact on the repair efficiency of the BISR circuit. This paper proposes a defect-aware BISR (DABISR) scheme for single-port RAMs (SPRAMs) and multi-port RAMs (MPRAMs) in system chips. Multiple RAMs can share a DABISR such that the area cost of DABISR is drastically reduced. We also present two defect-location algorithms (DLAs) for identification of bridge defects between word-lines and bit-lines of MPRAMs. The DABISR can perform DLAs to locate bridge defects such that it can provide high repair efficiency. For example, simulation results show that if a faulty two-port RAM has 20% inter-port faults, the DLAs can help to gain 8.4-14.4% increase of repair rate for different redundancy configurations. In comparison with an existing shared BISR scheme, however, the DABISR only incurs about 0.34% additional area overhead to support the function of DLAs. Short Papers ============ Tenentes, V.; Kavousianos, X.; Kalligeros, E., "Single and Variable-State-Skip LFSRs: Bridging the Gap Between Test Data Compression and Test Set Embedding for IP Cores" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580216&isn... Abstract: Even though test set embedding (TSE) methods offer very high compression efficiency, their excessively long test application times prohibit their use for testing systems-on-chip (SoC). To alleviate this problem we present two new types of linear feedback shift registers (LFSRs), the Single-State-Skip and the Variable-State-Skip LFSRs. Both are normal LFSRs with the addition of the State-Skip circuit, which is used instead of the characteristic-polynomial feedback structure for performing successive jumps of constant and variable length in their state sequence. By using Single-State-Skip LFSRs for testing single or multiple identical cores and Variable-State-Skip LFSRs for testing multiple non-identical cores we get the well-known high compression efficiency of TSE with substantially reduced test sequences, thus bridging the gap between test data compression and TSE methods. Lo, C.Y.; Hsing, Y.T.; Denq, L.M.; Wu, C.W., "SOC Test Architecture and Method for 3-D ICs" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580221&isn... Abstract: 3-D integration provides another way to put more devices in a smaller footprint. However, it also introduces new challenges in testing. Flexible test architecture named test access control system for 3-D integrated circuits (TACS-3D) is proposed for 3-D integrated circuits (IC) testing. Integration of heterogeneous design-for-testability methods for logic, memory, and through-silicon via (TSV) testing further reduces the usage of test pins and TSVs. To highly reuse pre-bond test circuits in post-bond test, an innovative linking mechanism shares TSVs and test pins of the 3-D IC. No matter how many layers are there in the 3-D IC, a large portion of TSVs and test pins is reserved for data application. Therefore, smaller post-bond test time is expected. A test chip composed of a network security processor platform is taken as an example. Less than 0.4% test overhead increases in area and time between 2-D and 3-D cases. Compared with the instinctively direct access, TACS-3D reveals up to 54% test time improvement under the same TSV usage. Moiseev, K.; Kolodny, A.; Wimer, S., "Interconnect Bundle Sizing Under Discrete Design Rules" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580220&isn... Abstract: The lithography used for 32 nm and smaller very large scale integrated process technologies restricts the admissible interconnect widths and spaces to a small set of discrete values with some interdependencies, so that traditional interconnect sizing by continuous-variable optimization techniques becomes impossible. We present a dynamic programming (DP) algorithm for simultaneous sizing and spacing of all wires in interconnect bundles, yielding the optimal power-delay tradeoff curve. DP algorithm sets the width and spacing of all interconnects simultaneously, thus finding the global optimum. The DP algorithm is generic and can handle a variety of power-delay objectives, such as total power or delay, or weighted sum of both, power-delay product, max delay, and alike. The algorithm consistently yields 6% dynamic power and 5% delay reduction for interconnect channels in industrial microprocessor blocks designed in 32 nm process technology, when applied as a post-layout optimization step to redistribute wires within interconnect channels of fixed width, without changing the area of the original layout. Liu, F.; Song, X.; Tan, Q.; Chen, G., "Formal Analysis of End-Around-Carry Adder in Floating-Point Unit" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5580224&isn... Abstract: End-around-carry (EAC) adder is extensively used in microprocessor's floating-point units. This paper presents an algebraic characterization of an EAC adder's algorithm. A novel symbolic analysis approach is presented to prove the EAC adder's correctness. Algebraic structures and first-order recursive equations are harnessed in proof derivations. A hybrid prefix/EAC architecture is considered.