TCAD Newsletter - March 2010 Issue Placing you one click away from the best new CAD research! Regular Papers ============== Zheng, H.; "Compositional Reachability Analysis for Efficient Modular Verification of Asynchronous Designs" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419238&isnumber=5419222 Abstract: Compositional verification is essential to address state explosion in model checking. Traditionally, an over-approximate context is needed for each individual component in a system for sound verification. This may cause state explosion for the intermediate results as well as inefficiency for abstraction refinement. This paper presents an opposite approach, a compositional reachability method, which constructs the state space of each component from an under-approximate context gradually until a counter-example is found or a fixpoint in state space is reached. This method has an additional advantage in that counter-examples, if there are any, can be found much earlier, thus leading to faster verification. Furthermore, this modular verification framework does not require complex compositional reasoning rules. The experimental results indicate that this method is promising. Verma, A. K.; Brisk, P.; Ienne, P.; "Fast, Nearly Optimal ISE Identification With I/O Serialization Through Maximal Clique Enumeration" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419242&isnumber=5419222 Abstract: The last decade has witnessed the emergence of the application-specific instruction-set processor (ASIP) as a viable platform for embedded systems. Extensible ASIPs allow the user to augment a base processor with instruction set extensions (ISEs) that execute on dedicated hardware application-specific functional units (AFUs). Due to the limited number of read and write ports in the register file of the base processor, the size and complexity of AFUs are generally limited. Recent papers have focused on overcoming these constraints by serializing access to the register file. Exhaustive ISE enumeration methods are not scalable and generally fail for larger applications and register files with a large number of read and write ports. To address this concern, a new approach to ISE identification is proposed. The approach presented in this paper significantly prunes the list of the best possible ISE candidates compared to previous approaches. Experimentally, we observe tha t the new approach produces optimal results on larger applications where prior approaches either fail or produce inferior results. Lee, H.; Paik, S.; Shin, Y.; "Pulse Width Allocation and Clock Skew Scheduling: Optimizing Sequential Circuits Based on Pulsed Latches" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419229&isnumber=5419222 Abstract: Pulsed latches, latches driven by a brief clock pulse, offer the same convenience of timing verification and optimization as flip-flop-based circuits, while retaining the advantages of latches over flip-flops. But a pulsed latch that uses a single pulse width has a lower bound on its clock period, limiting its capacity to deal with higher frequencies or operate at lower $V_{dd}$. The limitation still exists even when clock skew scheduling is employed, since the amount of skew that can be assigned and realized is practically limited due to process variation. For the first time, we formulate the problem of allocating pulse widths, out of a small discrete number of predefined widths, and scheduling clock skews, within a predefined upper bound on skew, for optimizing pulsed latch-based sequential circuits. We then present an algorithm called PWCS_Optimize (pulse width allocation and clock skew scheduling, PWCS) to solve the problem. The allocated skews are realized through synthesis of local clock trees between pulse generators and latches, and a global clock tree between a clock source and pulse generators. Experiments with 65-nm technology demonstrate that combining a small number of different pulse widths with clock skews of up to 10% of the clock period yield the minimum achievable clock period for many benchmark circuits. The results have an average figure of merit of 0.86, where 1.0 indicates a minimum clock period, and the average reduction in area by 11%. The design flow including PWCS_Optimize, placement and routing, and synthesis of local and global clock trees is presented and assessed with example circuits. Yan, J. Z.; Chu, C.; "DeFer: Deferred Decision Making Enabled Fixed-Outline Floorplanning Algorithm" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419234&isnumber=5419222 Abstract: In this paper, we present DeFer\u2014a fast, high-quality, scalable, and nonstochastic fixed-outline floorplanning algorithm. DeFer generates a nonslicing floorplan by compacting a slicing floorplan. To find a good slicing floorplan, instead of searching through numerous slicing trees by simulated annealing as in traditional approaches, DeFer considers only one single slicing tree. However, we generalize the notion of slicing tree based on the principle of deferred decision making (DDM). When two subfloorplans are combined at each node of the generalized slicing tree, DeFer does not specify their orientations, the left-right/top-bottom order between them, and the slice line direction. DeFer even does not specify the slicing tree structure for small subfloorplan. In other words, we are deferring the decisions on these factors, which are specified arbitrarily at an early step in traditional approaches. Because of DDM, one slicing tree actually corresponds to a large number of slicing floorplan solutions, all of which are efficiently maintained in one single shape curve. With the final shape curve, it is straightforward to choose a good floorplan fitting into the fixed outline. Several techniques are also proposed to further optimize the wirelength. For both fixed-outline and classical floorplanning problems, experimental results show that DeFer achieves the best success rate, the best wirelength, the best runtime, and the best area on average compared - with all other state-of-the-art floorplanners. Cabodi, G.; Garcia, L. A.; Murciano, M.; Nocco, S.; Quer, S.; "Partitioning Interpolant-Based Verification for Effective Unbounded Model Checking" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419228&isnumber=5419222 Abstract: Interpolant-based model checking has been shown to be effective on large verification instances, as it efficiently combines automated abstraction and reachability fixed-point checks. On the other hand, methods based on variable quantification have proved their ability to remove free inputs, thus projecting the search space over state variables. In this paper, we propose an integrated approach which combines the abstraction power of interpolation with techniques that rely on and-inverter graph (AIG) and/or binary decision diagram (BDD) representations of states, directly supporting variable quantification and fixed-point checks. The underlying idea of this combination is to adopt AIG or BDD-based quantifications to limit and restrict the search space and the complexity of the interpolant-based approach. The exploited strategies, most of which are individually well known, are integrated with a new flavor, specifically designed to improve their effectiveness on difficu lt verification instances. Experimental results, specifically oriented to hard-to-solve verification problems, show the robustness of our approach. Chen, M.; Mishra, P.; "Functional Test Generation Using Efficient Property Clustering and Learning Techniques" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419232&isnumber=5419222 Abstract: Functional verification is one of the major bottlenecks in system-on-chip design due to the combined effects of increasing complexity and lack of automated techniques for generating efficient tests. Several promising ideas using bounded model checking are proposed over the years to efficiently generate counterexamples (tests). The existing researchers have used incremental satisfiability to improve the counterexample generation, involving only one property by sharing knowledge across instances of the same property with incremental bounds. In this paper, we present a framework that can efficiently reduce the overall test generation time by exploiting the similarity among different properties. This paper makes two primary contributions: 1) it proposes novel methods to cluster similar properties; and 2) it develops efficient learning techniques that can significantly reduce the overall test generation time for the properties in a cluster by sharing knowledge across similar test generation instances. Our experimental results using both software and hardware benchmarks demonstrate that our approach can drastically reduce (on average three to five times) the overall test generation time compared to existing methods. Kinsman, A. B.; Nicolici, N.; "Bit-Width Allocation for Hardware Accelerators for Scientific Computing Using SAT-Modulo Theory" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419231&isnumber=5419222 Abstract: This paper investigates the application of computational methods via Satisfiability Modulo Theory (SMT) to the bit-width allocation problem for finite precision implementation of numerical calculations, specifically in the context of scientific computing where division frequently occurs. In contrast to the large body of work targeted at the precision aspect of the problem, this paper addresses the range problem where employing SMT leads to more accurate bounds estimation than those provided by other analytical methods, in turn yielding smaller bit-widths, and hence a reduction in hardware cost and/or increased parallelism, while maintaining robustness as is necessary for scientific applications. Hashemi, M.; Ghiasi, S.; "Versatile Task Assignment for Heterogeneous Soft Dual-Processor Platforms" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419233&isnumber=5419222 Abstract: Heterogeneous soft multiprocessor systems are likely to find a larger share in the application-specific computing market due to increasing cost and defect rates in foreseeable manufacturing technologies. We study the problem of mapping streaming applications onto heterogeneous soft dual-processor systems, in which processors' limited memory resources and application throughput form the outstanding constraints and objective, respectively. A key step in the compilation process is task assignment, where tasks are assigned to the processors. We develop a provably-effective algorithm for task assignment. Our algorithm is versatile, in that its formal properties hold for, and hence it is applicable to, a variety of platforms. Measurement of generated code size, and throughput of emulated systems validate the effectiveness of our approach. We advance the state-of-the-art by considerably outperforming two recent competitors in terms of both versatility and application throu ghput. Palesi, M.; Kumar, S.; Catania, V.; "Leveraging Partially Faulty Links Usage for Enhancing Yield and Performance in Networks-on-Chip" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419230&isnumber=5419222 Abstract: The communication infrastructure of a complex multicore system-on-a-chip is getting an increasing fraction of the overall chip area. According to the International Technology Roadmap for Semiconductors, killer defect density does not decrease over successive technology generations. For this reason, the probability that a manufacturing defect affects the communication system is predicted to increase. In this paper, we deal with manufacturing defects which affect the links in a network-on-chip-based interconnection system. The goal of this paper is to show that by using effective routing functions, supported by appropriate selection policies and with a limited amount of extra logic in the router, it is easy to exploit partially faulty links to improve the performance of the system. We show that, instead of discarding partially faulty links, they can be used at reduced capacity to improve the distribution of the traffic over the network, yielding performance and power improvements. We couple an application-specific routing function with a set of selection policies which are aware of link fault distribution and evaluate them on both synthetic traffic and a real complex multimedia application. We also present an implementation of the router, augmented with the extra logic, to support both the proposed selection functions and the transmission of messages over partially faulty links. We analyze the router in terms of silicon area, timing, and power dissipation. Mukherjee, N.; Pogiel, A.; Rajski, J.; Tyszer, J.; "High Volume Diagnosis in Memory BIST Based on Compressed Failure Data" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419237&isnumber=5419222 Abstract: Embedded memories are increasingly identified as having potential for introducing new yield loss mechanisms at a rate, magnitude, and complexity large enough to demand major changes in fault diagnosis techniques. In particular, time-related or complex read faults that originate in the highest density areas of semiconductor designs require new methods to diagnose more complex faults affecting large groups of memory cells. This paper presents a built-in self-test (BIST)-based fault diagnosis scheme that can be used to identify a variety of failures in embedded random-access memory arrays. The proposed solution employs flexible test logic to record test responses at the system speed with no interruptions of a BIST session. It offers a simple test flow and enables detection of time-related faults. Furthermore, the way test responses are processed allows accurate and time-efficient reconstruction of error bitmaps. The proposed diagnostic algorithms use a number of techniques, including discrete logarithm-based counting with ring generators acting as very fast event counters and signature analyzers. Experimental results confirm high diagnostic accuracy of the proposed scheme and its time efficiency. Pomeranz, I.; Reddy, S. M.; "TOV: Sequential Test Generation by Ordering of Test Vectors" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419236&isnumber=5419222 Abstract: We describe a new approach to test generation for stuck-at faults in synchronous sequential circuits. Under this approach, the input vectors comprising the test sequence are fixed in advance. The process of generating the test sequence consists of ordering the precomputed input vectors such that the resulting test sequence has as high a fault coverage as possible. The advantage of this approach is that its computational complexity is limited by limiting the search space to a given set of input vectors and a given test sequence length. We describe a specific implementation of this approach. Experimental results demonstrate that restricting the search space to a fixed number of precomputed input vectors is sufficient for achieving the highest known fault coverage, or a fault coverage close to it, for benchmark circuits. Das, D.; Killpack, K.; Kashyap, C.; Jas, A.; Zhou, H.; "Pessimism Reduction in Coupling-Aware Static Timing Analysis Using Timing and Logic Filtering" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419240&isnumber=5419222 Abstract: With continued scaling of technology into nanometer regimes, the impact of coupling induced delay variations is significant. While several coupling-aware static timers have been proposed, the results are often pessimistic with many false failures. We present an integrated iterative timing filtering and logic filtering based approach to reduce pessimism. We use a realistic coupling model based on arrival times and slews, and show that non-iterative pessimism reduction algorithms proposed in previous research may give potentially non-conservative timing results. On a functional block from an industrial 65 nm microprocessor, our algorithm produced a maximum pessimism reduction of 11.18% of cycle time over converged timing filtering analysis that does not consider logic constraints. Karfa, c.; Sarkar, D.; Mandal, C.; "Verification of Datapath and Controller Generation Phase in High-Level Synthesis of Digital Circuits" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419243&isnumber=5419222 Abstract: A formal verification method of the datapath and controller generation phase of a high-level synthesis (HLS) process is described in this paper. The goal is achieved in two steps. In the first step, the datapath interconnection and the controller finite state machine description generated by a high-level synthesis process are analyzed to obtain the register transfer-operations executed in the datapath for a given control assertion pattern in each control step. In the second step, an equivalence checking method is deployed to establish equivalence between the input and the output behaviors of this phase. A rewriting method has been developed for the first step. Unlike many other reported techniques, our method is capable of validating pipelined and multicycle operations, if any, spanning over several states. The correctness and complexity of the presented method have been treated formally. The method is implemented and integrated with an existing HLS tool, called str uctured architecture synthesis tool. The experimental results on several HLS benchmarks indicate the effectiveness of the presented method. Short Papers ============ Musoll, E.; "Hardware-Based Load Balancing for Massive Multicore Architectures Implementing Power Gating" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419241&isnumber=5419222 Abstract: Many-core architectures provide a computation platform with high execution throughput, enabling them to efficiently execute workloads with a significant degree of thread-level parallelism. The burstlike nature of these workloads allows large power savings by power gating the idle cores. In addition, the load balancing of threads to cores also impacts the power and thermal behavior of the processor. Processor implementations of many-core architectures may choose to group several cores into clusters sharing the area overhead, so that the whole cluster is power gated as opposed to the individual cores. However, the potential for power savings is reduced due to the coarser level of power gating. In this paper, several hardware-based stateless load-balancing schemes are evaluated for these clustered homogeneous multicore architectures in terms of their power and thermal behavior. All these methods can be unified into a parameterized technique that dynamically adjusts to obtain the desired goal (lower power, higher performance, and lower hotspot temperature). Choi, S. H.; Kang, K.; Dartu, F.; Roy, K.; "Timed Input Pattern Generation for an Accurate Delay Calculation Under Multiple Input Switching" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419235&isnumber=5419222 Abstract: In multiple input switching (MIS) analysis, input signal alignment is one of the key factors which determines the quality and the accuracy of the approach. In this paper, we propose a new signal alignment methodology for MIS analysis based on a transistor level simulator at the core of the static timing analysis. Our proposed methodology searches through the possible input vectors in an efficient order to reduce the number of simulations and finds a true worst case signal alignment for both the MIN and the MAX analysis. In our 180 nm simulation setup, the worst-case delay is predicted within 0.5% error for more than 97% of test cases performing an average of less than two simulations per logic gate. Pomeranz, I.; Reddy, S. M.; "On Test Generation With Test Vector Improvement" URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5419239&isnumber=5419222 Abstract: We investigate the introduction of a new step, referred to as test vector improvement, into test generation processes. After a fully specified test vector or a partially specified test cube $t$ is generated at an arbitrary iteration of the test generation process, the test vector improvement step modifies $t$ so as to increase the number of yet-undetected target faults that $t$ detects. This is done in this paper using a simulation-based process. We show that even if $t$ was generated using dynamic test compaction heuristics, it is possible to improve $t$ further. When $t$ is partially specified to accommodate test data compression, the test vector improvement step does not change the number of unspecified bits of $t$. The final result is a smaller test set and/or a higher fault coverage (if the test generation process does not detect all the detectable faults).