January 2012 Newsletter Placing you one click away from the best new CAD research! Plain-text version at http://www.umn.edu/~tcad/newsletter/2012-01.txt WE WISH YOU A HAPPY NEW YEAR AND ALL THE BEST FOR 2012! Editorial http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6106728 List of Reviewers for 2011 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6106727 SPECIAL SECTION ON PAR-CAD: PARALLEL CAD ALGORITHMS AND CAD FOR PARALLEL ARCHITECTURES/SYSTEMS Marculescu, D. Li, P. Guest Editorial http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6106736 Kapre, N. DeHon, A. SPICE2: Spatial Processors Interconnected for Concurrent Execution for Accelerating the SPICE Circuit Simulator Using an FPGA http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106733 Spatial processing of sparse, irregular, double-precision floating-point computation using a single field-programmable gate array (FPGA) enables up to an order of magnitude speedup (mean 2.8x speedup) over a conventional microprocessor for the SPICE circuit simulator. We develop a parallel, FPGA-based, heterogeneous architecture customized for accelerating the SPICE simulator to deliver this speedup. To properly parallelize the complete simulator, we decompose SPICE into its three constituent phases-model evaluation, sparse matrix-solve, and iteration control-and customize a spatial architecture for each phase independently. Our heterogeneous FPGA organization mixes very large instruction word, dataflow and streaming architectures into a cohesive, unified design to match the parallel patterns exposed by our programming framework. This FPGA architecture is able to outperform conventional processors due to a combination of factors, including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and streaming, overlapped processing of the control algorithms. We demonstrate that we can independently accelerate model evaluation by a mean factor of 6.x (1.4-23x) across a range of nonlinear device models and matrix solve by 2.4x (0.6-13x) across various benchmark matrices while delivering a mean combined speedup of 2.8x (0.2-11x) for the composite design when comparing a Xilinx Virtex-6 LX760 (40 nm) with an Intel Core i7 965 (45 nm). We also estimate mean energy savings of 8.9x (up to 40.9x) when comparing a Xilinx Virtex-6 LX760 with an Intel Core i7 965. Sridhar, A. Vincenzi, A. Ruggiero, M. Atienza, D. Neural Network-Based Thermal Simulation of Integrated Circuits on GPUs http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106739 With the rising challenges in heat removal in integrated circuits (ICs), the development of thermal-aware computing architectures and run-time management systems has become indispensable to the continuation of IC design scaling. These thermal-aware design technologies of the future strongly depend on the availability of efficient and accurate means for thermal modeling and analysis. These thermal models must have not only the sufficient accuracy to capture the complex mechanisms that regulate thermal diffusion in ICs, but also a level of abstraction that allows for their fast execution for design space exploration. In this paper, we propose an innovative thermal modeling approach for full-chips that can handle the scalability problem of transient heat flow simulation in large 2-D/3-D multiprocessor ICs. This is achieved by parallelizing the computation-intensive task of transient temperature tracking using neural networks and exploiting the computational power of massively parallel graphics processing units. Our results show up to 35x run-time speedup compared to state-of-the-art IC thermal simulation tools while keeping the error lower than 1C. Speedups scale with the size of the 3-D multiprocessor ICs and our proposed method serves as a valuable design space exploration tool. Villena, J. F. Silveira, L. Exploiting Parallelism for Improved Automation of Multidimensional Model Order Reduction http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106740 This paper addresses the issue of automatically generating reduced order models of very large multidimensional systems. To tackle this problem we introduce an efficient parallel projection based model order reduction framework for parameterized linear systems. The underlying methodology is based on an automated multidimensional sample selection procedure that maximizes effectiveness in the generation of the projection basis. The parallel nature of the algorithm is efficiently exploited using both shared and distributed memory architectures. This leads to a highly scalable, automatic, and reliable parallel reduction scheme, able to handle very large systems depending on multiple parameters. In addition, the framework is general enough to provide a good approximation regardless of the model's representation or underlying nature, as will be demonstrated on a variety of benchmark examples. The method provides the potential to tackle, in an automatic fashion, extremely challenging models that would be otherwise difficult to address with existing sequential approaches. Kim, M.-C. Lee, D.-J. Markov, I. L. SimPL: An Effective Placement Algorithm http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106735 We propose a self-contained, flat, quadratic global placer that is simpler than existing placers and easier to integrate into timing-closure flows. It maintains lower-bound and upper-bound placements that converge to a final solution. The upper-bound placement is produced by a novel look-ahead legalization algorithm. Our placer SimPL outperforms mPL6, FastPlace3, NTUPlace3, APlace2, and Capo simultaneously in runtime and solution quality, running 7.10 times faster than mPL6 (when using a single thread) and reducing wirelength by 3% on the ISPD 2005 benchmark suite. More significant improvements are achieved on larger benchmarks. The new algorithm is amenable to parallelism, and we report empirical studies with SSE2 instructions and up to eight parallel threads. Gort, M. Anderson, J. H. Accelerating FPGA Routing Through Parallelization and Engineering Enhancements http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106729 We present parallelization and heuristic techniques to reduce the run-time of field-programmable gate array (FPGA) negotiated congestion routing. Two heuristic optimizations provide over 3x speedup versus a sequential baseline. In our parallel approach, sets of design signals are assigned to different processor cores and routed concurrently. Communication between cores is through the message passing interface communications protocol. We propose a geographic partitioning of signals into independent sets to help minimize the communication overhead. Our parallel implementation provides approximately 2.3x speedup using four cores and produces deterministic/repeatable results. When combined, the parallel and heuristic techniques provide over 7x speedup with four cores versus the router in the widely used Versatile Place and Route (VPR) FPGA placement/routing framework, with no significant impact on circuit speed or wirelength. REGULAR PAPERS LOGIC SYNTHESIS Agyekum, M. Y. Nowick, S. M. Error-Correcting Unordered Codes and Hardware Support for Robust Asynchronous Global Communication http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106737 This paper introduces a new family of error-correction unordered (ECU) codes for global communication, called Zero-Sum. They combine the timing-robustness of delay-insensitive (i.e., unordered) codes with the fault-tolerance of error-correcting codes (providing 1-bit error correction or 2-bit detection). Two key features of the codes are that they are systematic, allowing direct extraction of data, and weighted, where the check field is computed as the sum of data index weights. A wide variety of weight assignments is shown to be feasible. Two practical enhancements are also proposed. The Zero-Sum+ code extends error detection to 3-bit errors, or alternatively handles 2-bit detection and 1-bit correction. The Zero-Sum* code supports heuristic 2-bit correction, while still guaranteeing 2-bit detection, under different strategies of weight assignment. Detailed hardware implementations of the supporting components (encoder, completion detection, error corrector) are given, as well as an outline of the system microarchitecture. In comparison to the best alternative systematic ECU code, the basic Zero-Sum code provided better or comparable coding efficiency, with a 5.74%-18.18% reduction in average number of wire transitions for most field sizes. Several Zero-Sum* codes were also evaluated for their 2-bit error correction coverage; initial results are promising, where the best strategy corrected 52.92%-71.16% of all 2-bit errors for most field sizes, with only a moderate decrease in coding efficiency and increase in wire transitions. Technology-mapped pre-layout implementations of the supporting Zero-Sum code hardware were synthesized with the UC Berkeley ABC tool using a 90 nm industrial standard cell library. Results indicate that they have moderate area and delay overheads. In comparison, supporting hardware for the best nonsystematic ECU codes have 3.82-10.44x greater area for larger field sizes. MODELING AND SIMULATION Hajj, I. N. Extended Nodal Analysis http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106732 This paper presents an extension to the popular nodal and modified nodal formulation methods that allows elements whose characteristic functions include controlling variables, in addition to voltages and currents, other variables, such as charge, flux, and other physical parameters, to be included in the circuit equation formulation in a straightforward manner. Stamps, similar to nodal and modified nodal circuit element stamps, are developed to include these elements in the circuit matrix equation without the need of deriving equivalent circuit models consisting of interconnections of elements characterized only by currents and voltages, as in the current practice. The method is applied to derive circuit stamps of memristive, memcapacitive, meminductive, and other complex device models. The method reduces the size of the overall circuit matrix and allows easy model evaluation and linearization during the circuit iterative solution process. Zhou, Y. Gad, E. Nakhla, M. S. Achar, R. Structural Characterization and Efficient Implementation Techniques for A-Stable High-Order Integration Methods http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106731 This paper presents structural characterization and performance enhancement strategies for the recently proposed A-stable and L-stable high-order integration methods based on the Obreshkov formula. It is demonstrated that although the Jacobian matrix in these methods has a bigger size than the Jacobian matrix in classical low-order methods, it enjoys a special structure that can be used to develop efficient factorization techniques. In addition, the paper proposes a method to reduce the number of Newton-Raphson iterations needed to converge in the large Jacobian domain. Wang, Y. Hu, X. Cheng, C.-K. Pang, G. K. H. Wong, N. A Realistic Early-Stage Power Grid Verification Algorithm Based on Hierarchical Constraints http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106741 Power grid verification has become an indispensable step to guarantee a functional and robust chip design. Vectorless power grid verification methods, by solving linear programming (LP) problems under current constraints, enable worst-case voltage drop predictions at an early stage of design when the specific waveforms of current drains are unknown. In this paper, a novel power grid verification algorithm based on hierarchical constraints is proposed. By introducing novel power constraints, the proposed algorithm generates more realistic current patterns and provides less pessimistic voltage drop predictions. The model order reduction-based coefficient computation algorithm reduces the complexity of formulating the LP problems from being proportional to steps to being independent of steps. Utilizing the special hierarchical constraint structure, the submodular polyhedron greedy algorithm dramatically reduces the complexity of solving the LP problems from over O(km3) to roughly O(km log km), where km is the number of variables. Numerical results have shown that the proposed algorithm provides less pessimistic voltage drop prediction while at the same time achieves dramatic speedup. SYSTEM-LEVEL DESIGN Yun, D. Kim, S. Ha, S. A Parallel Simulation Technique for Multicore Embedded Systems and Its Performance Analysis http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106734 A virtual prototyping system is constructed by replacing real processing components with component simulators running concurrently. The performance of such a distributed simulation decreases drastically as the number of component simulators increases. Thus, we propose a novel parallel simulation technique to boost up the simulation speed. In the proposed technique, a simulator wrapper performs time synchronization with the simulation backplane on behalf of the associated component simulator itself. Component simulators send null messages periodically to the backplane to enable parallel simulation without any causality problems. Since excessive communication may degrade the simulation performance, we also propose a novel performance analysis technique to determine an optimal period of null message transfer, considering both the characteristics of a target application and the configurations of the simulation host. Through intensive experiments, we show that the proposed parallel simulation achieves almost linear speedup to the number of processor cores if the frequency of null message transfer is optimally decided. The proposed analysis technique could predict the simulation performance with more than 90% accuracy in the worst case for various target applications and simulation environments we have used for experiments. Yang, C. Orailoglu, A. Tackling Resource Variations Through Adaptive Multicore Execution Frameworks http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106742 Multicore architectures have been widely adopted to accommodate the rising performance demand in various application domains, ranging from high-end supercomputing to low-end consumer electronics. Yet due to the ever growing integration density and application complexity, such architectures suffer from increased level of core availability variations. At runtime, issues such as device failures, heat buildup, as well as resource competitions and preemptions can make computational resources unavailable, necessitating execution schedules capable of delivering diverse performance levels to match the varying resource allocations. The adaptive execution framework introduced in this paper delivers high-quality schedules capable of predictably reconfiguring execution and gracefully degrading performance in the face of resource unavailability. By adhering to a novel band structure, a set of possible execution schedules are compactly engendered in readiness at compile time, thus delivering predictable responses to runtime resource variations. More importantly, through the exploitation of an extra degree of freedom in the scheduling process, the scheduler can perform task assignments in such a way that adaptivity can be embedded within the preoptimized schedules at almost no cost. The efficacy of the proposed technique is confirmed by incorporating it into a conventional, widely adopted scheduling heuristic and experimentally verifying it in the context of single core degradations. Daneshtalab, M. Ebrahimi, M. Liljeberg, P. Plosila, J. Tenhunen, H. Memory-Efficient On-Chip Network With Adaptive Interfaces http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106730 To achieve higher memory bandwidth in network-based multiprocessor architectures, multiple dynamic random access memories can be accessed simultaneously. In such architectures, not only resource utilization and latency are the critical issues but also a reordering mechanism is required to deliver the response transactions of concurrent memory accesses in-order. In this paper, we present a memory-efficient on-chip network architecture to cope with these issues efficiently. Each node of the network is equipped with a novel network interface (NI) to deal with out-of-order delivery, and a priority-based router to decrease the network latency. The proposed NI exploits a streamlined reordering mechanism to handle the in-order delivery and utilizes the advance extensible interface transaction-based protocol to maintain compatibility with existing intellectual property cores. To improve the memory utilization and reduce the memory latency, an optimized memory controller is integrated in the presented NI. Experimental results with synthetic test cases demonstrate that the proposed on-chip network architecture provides significant improvements in average network latency (16%), average memory access latency (19%), and average memory utilization (22%). SHORT PAPERS Shi, B. Zhang, Y. Srivastava, A. Accelerating Gate Sizing Using Graphics Processing Units http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6106738 In this paper, we investigate the gate sizing problem and develop techniques for improving the runtime by effectively exploiting the graphics processing unit (GPU) resources. Theoretically, we investigate a randomized cutting plane-based convex optimization technique which is highly parallelizable and can effectively exploit the single instruction multiple data structure imposed by GPUs. In order to further improve the runtime, we also develop GPU-oriented implementation guidelines that exploit the specific structure that convex gate sizing formulations impose. We implemented our method on NVIDIA Tesla 10 GPU and obtain 21x to 400x speedup compared to the MOSEK optimization tool implemented on conventional CPU. The quality of solution of our method is very close to that achieved by MOSEK, since both are optimal.