Computer Architecture
Z. Torabi; Armin Belghadr
Abstract
Background and Objectives: Residue number system (RNS) is considered as a prominent candidate for high-speed arithmetic applications due to its limited carry propagation, fault tolerance, and parallelism in “Addition”, “Subtraction”, and “Multiplication” operations. ...
Read More
Background and Objectives: Residue number system (RNS) is considered as a prominent candidate for high-speed arithmetic applications due to its limited carry propagation, fault tolerance, and parallelism in “Addition”, “Subtraction”, and “Multiplication” operations. Whereas, “Comparison”, “Division”, “Scaling”, “Overflow Detection” and “Sign Detection” are considered as complicated operations in residue number systems, which have also received a surge of attention in a multitude of publications. Efficient realization of Comparators facilitates other hard-to-implement operations and extends the spectrum of RNS applications. Such comparators can substitute the straightforward method (i.e. converting the comparison operands to binary and comparing them with wide word binary comparators) to compare RNS numbers. Methods: Dynamic Range Partitioning (DRP) method has shown advantages for comparing unsigned RNS numbers in the 3-moduli sets {2^n,2^n±1} and {2^n,2^n-〖1,2〗^(n+1)-1}, in comparison with other methods. In this paper, we employed DRP components and designed a unified unit that detects the sign of operands and also compares numbers, for the 5-moduli set γ={2^2n,2^n±1,2^n±3}. This unit can be used for comparison of signed and also unsigned RNS numbers in the moduli set γ.Results: Synthesized comparison results reveal 47% (54%) speed-up, 35% (32%) less area consumption, 25% (24%) lower power dissipation, and 60% (65%) less energy for n=8 (16) in comparison to the straightforward signed comparator. Conclusion: According to the results of this study, DRP method for sign detection and comparison operations outperforms other methods in different moduli sets including 5-moduli set γ={2^2n,2^n±1,2^n±3}.
Computer Architecture
A. Teymouri; H. Dorosti; M. Ersali Salehi Nasab; S.M. Fakhraie
Abstract
Background and Objectives: The future demands of multimedia and signal processing applications forced the IC designers to utilize efficient high performance techniques in more complex SoCs to achieve higher computing throughput besides energy/power efficiency improvement. In recent technologies, variation ...
Read More
Background and Objectives: The future demands of multimedia and signal processing applications forced the IC designers to utilize efficient high performance techniques in more complex SoCs to achieve higher computing throughput besides energy/power efficiency improvement. In recent technologies, variation effects and leakage power highly affect the design specifications and designers need to consider these parameters in design time. Considering both challenges as well as boosting the computation throughput makes the design more difficult.Methods: In this article, we propose a simple serial core for higher energy/power efficiency and also utilize data level parallel structures to achieve required computation throughput.Results: Using the proposed core we have 35% (75%) energy (power) improvement and also using parallel structure results in 8x higher throughput. The proposed architecture is able to provide 76 MIPS computation throughput by consuming only 2.7 pj per instruction. The outstanding feature of this processor is its resiliency against the variation effects.Conclusion: Simple serial architecture reduces the effect of variations on design paths, furthermore, the effect of process variation on throughput loss and energy dissipation is negligible and almost zero. Proposed processor architecture is proper for energy/power constrained applications such as internet of things (IoT) and mobile devices to enable easy energy harvesting for longer lifetime.
Computer Architecture
H. Dorosti
Abstract
Background and Objectives: Considering the fast growing low-power internet of things, the power/energy and performance constraints have become more challenging in design and operation time. Static and dynamic variations make the situation worse in terms of reliability, performance, and energy consumption. ...
Read More
Background and Objectives: Considering the fast growing low-power internet of things, the power/energy and performance constraints have become more challenging in design and operation time. Static and dynamic variations make the situation worse in terms of reliability, performance, and energy consumption. In this work, a novel slack measurement circuit is proposed to have precise frequency management based on timing violation measurement.Methods: Proposed slack measurement circuit is based on measuring the delay difference between the edge clock pulse and possible transition on path end-points (primary outputs of design). The output of proposed slack monitoring circuits is a digital code related to the current state of target critical path delay. In order to convert this digital code to equivalent delay difference, the delay of a reference gate is mandatory which is basic unit in proposed monitor. This monitor enables the design to have more precise and efficient frequency management, while maintaining the correct functionality regarding low-power mode. Results: Applying this method on a MIPS processor reduces the amount of performance penalty and recovery energy overhead up to 30% with only 2% additional hardware. Results for benchmark applications in low-power mode, show 7-30% power improvement in normal execution mode. If the application is resilient against occurred errors duo to timing violations, proposed method achieves 20-60% power reduction considering approximate computation as long as application is showing resilience. The performance of proposed method depends on the degree of application resilience against the timing errors. In order to keep generality of propsoed monitor for different applications, the resilience threshold is user programmable to configure according to the requirements of each application.Conclusion: The results show that precise frequency scheduling is more energy/power efficient in static and dynamic variation management. Utilizing a proper monitor capable of measureing the amount of violation will help to have finer frequency management. At the other hand, this method will help to use the resilience of application according to estimation about the possible error value based on measured vilation amount.
Computer Architecture
A. Tajary; E. Tahanian
Abstract
Background and Objectives: Wireless Network on Chip (WNoC) is one of the promising interconnection architectures for future many-core processors. Besides the architectures and topologies of these WNoCs, designing an efficient routing algorithm that uses the provided frequency band to achieve better network ...
Read More
Background and Objectives: Wireless Network on Chip (WNoC) is one of the promising interconnection architectures for future many-core processors. Besides the architectures and topologies of these WNoCs, designing an efficient routing algorithm that uses the provided frequency band to achieve better network latency is one of the challenges.Methods: Using wireless connections reduces the number of hops for sending data in a network, which can lead to lower latency for data delivery and higher throughput in WNoCs. On the other hand, since using wireless links reduces the number of hops for data transfer; this can result in congestion around the wireless nodes. The congestion may result in more delay in data transfer which reduces the network throughput of WNoCs. Although there are some good routing algorithms that balance traffic using wired and wireless connections for synthetic traffic patterns, they cannot deal with dynamic traffic patterns that existed in real-world applications. In this paper, we propose a new routing algorithm that uses the wireless connections as much as possible, and in the case of congestion, it uses the wired connection instead.Results: We investigated the proposed method using eight applications from the Parsec benchmark suite. Simulation results show that the proposed method can achieve up to 13.9% higher network throughput with a power consumption reduction up to 1.4%.Conclusion: In this paper, we proposed an adaptive routing algorithm that uses wireless links to deliver data over the network on chip. We investigated the proposed method on real-work applications. Simulation results show that the proposed method can achieve higher network throughput and lower power consumption.
Computer Architecture
M. Mosayebi; M. Dehyadegari
Abstract
Background and Objectives: Graph processing is increasingly gaining attention during era of big data. However, graph processing applications are highly memory intensive due to nature of graphs. Processing-in-memory (PIM) is an old idea which revisited recently with the advent of technology specifically ...
Read More
Background and Objectives: Graph processing is increasingly gaining attention during era of big data. However, graph processing applications are highly memory intensive due to nature of graphs. Processing-in-memory (PIM) is an old idea which revisited recently with the advent of technology specifically the ability to manufacture 3D stacked chipsets. PIM puts forward to enrich memory units with computational capabilities to reduce the cost of data movement between processor and memory system.This approach seems to be a way of dealing with large-scale graph processing, considering recent advances in the field.Methods: This paper explores real-world PIM technology to improve graph processing efficiency by reducing irregular access patterns and improving temporal locality using HMC.We propose NodeFetch, a new method to access nodes and their neighbors while processing a graph by adding a new command to HMC system.Results: Results of our simulation on a set of real-world graphs point out that the proposed idea can achieve 3.3x speed up in average and 69% reduction of energy consumption over the baseline PIM architecture which is HMC.Conclusion: Most of the techniques in the field of processing-in-memory, hire methods to reduce movement of data between processor and memory. This paper proposes a method to reduce graph processing execution time and energy consumption by reducing cache misses while processing a graph.