US20160011642A1 - Power and throughput optimization of an unbalanced instruction pipeline - Google Patents
Power and throughput optimization of an unbalanced instruction pipeline Download PDFInfo
- Publication number
- US20160011642A1 US20160011642A1 US14/860,095 US201514860095A US2016011642A1 US 20160011642 A1 US20160011642 A1 US 20160011642A1 US 201514860095 A US201514860095 A US 201514860095A US 2016011642 A1 US2016011642 A1 US 2016011642A1
- Authority
- US
- United States
- Prior art keywords
- stage
- processing
- rate
- constituent
- instruction pipeline
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/324—Power saving characterised by the action undertaken by lowering clock frequency
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
- G06F9/3869—Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/88—Monitoring involving counting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/885—Monitoring specific for caches
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Advance Control (AREA)
Abstract
A method includes determining a rate of resource occupancy of a constituent stage of an unbalanced instruction pipeline implemented in a processor through profiling an instruction code. The method also includes performing data processing at a maximum throughput at an optimum clock frequency based on the rate of resource occupancy.
Description
- This application is a Divisional of prior application Ser. No. 13/089,101, filed Apr. 18, 2011, currently pending;
- And also claims priority from Indian Provisional Application Serial No. 1129/CHE/2010 filed on Apr. 20, 2010, entitled “POWER AND THROUGHPUT OPTIMIZATION OF AN UNBALANCED INSTRUCTION PIPELINE”, which is incorporated herein by reference in its entirety. Embodiments of the disclosure relate to instruction pipelining in processors.
- Instruction pipelining is a technique used in processors (e.g., microprocessors, microcontrollers) to allow for parallel processing of instructions. For example, one instruction is associated with a first stage of an instruction pipeline and another instruction is associated with a second stage of the instruction pipeline. The instruction pipeline allows for “breaking” of the timing associated with a large data path, and provides parallelism in executing the instructions at an increased clock frequency.
- The instruction pipeline offers optimum performance only when the constituent stages are perfectly balanced. A balanced pipeline implies that processing associated with a constituent stage of the pipeline takes a completion time equal to the completion time associated with all other constituent stage(s) of the instruction pipeline. However, there are scenarios (e.g., hard macro(s) such as memory/memories being in the data path of the pipeline, Arithmetic Logic Units (ALU units) such as multipliers, adders, bit shifters and dividers being in a same constituent stage of the pipeline) where a programmer/user is not able to perfectly balance the instruction pipeline. Here, the maximum frequency at which the unbalanced pipeline is clocked is determined through the constituent stage therein offering the maximum delay.
- Assuming no stalls in an unbalanced instruction pipeline, the maximum frequency, at which the unbalanced instruction pipeline is clocked is expressed in example Equation (1) as:
-
- where d is the maximum delay offered by a constituent stage.
- Assuming the time taken for executing N instructions to be (N+ns) cycles (ns being the number of constituent stages of the unbalanced instruction pipeline), the effective throughput, E, is be expressed in example Equation (2) as:
-
- The throughput, E as seen in Equation (2), is the number of instructions per second. Increased throughput is associated with a higher fmax, which implies a lower maximum delay offered by the constituent stage of the unbalanced instruction pipeline.
- The pipeline can be clocked at a frequency higher than that computed based on the max-delay, and when the usage of timing-path involving the max-delay is detected, then the pipeline can be stalled for a number of cycles equivalent to the delay offered by the timing-path. This is known as pipeline stalling.
- With the above approach, the frequency might not be optimal, if the usage of the timing-path involving max-delay is not frequent. It would lead to unnecessary dynamic power dissipation. Hence, there is a need to arrive at an optimum frequency for a given rate of usage of the timing-path involving the maximum delay.
- In one aspect, a method includes determining a rate of resource occupancy of a constituent stage of an unbalanced instruction pipeline implemented in a processor through profiling an instruction code. The method also includes performing data processing associated with the unbalanced instruction pipeline at a maximum throughput at an optimum clock frequency based on the rate of resource occupancy
- In another aspect, a method determining a time interval within a processing time associated with a constituent stage of an unbalanced instruction pipeline implemented in a processor based on a change in a processing scenario associated with data processing therein. The method also includes dynamically determining a rate of resource occupancy of the constituent stage periodically with a time period equal to the determined time interval through profiling an instruction code associated therewith. Further, the method includes periodically obtaining a clock frequency associated with the rate of resource occupancy of the constituent stage and performing the data processing at the periodically obtained clock frequency. The clock frequency corresponds to an optimized power consumption and/or a throughput associated with the unbalanced instruction pipeline.
- In yet another aspect, a computing system includes a processor having an unbalanced instruction pipeline implemented therein and a memory configured to store an instruction code associated with processing through the unbalanced instruction pipeline. The computing system also includes a determination module configured to determine a rate of resource occupancy of a constituent stage of the unbalanced instruction pipeline through profiling the instruction code associated with processing through the unbalanced instruction pipeline. The processor is configured to perform data processing at a maximum throughput at an optimum clock frequency based on the rate of resource occupancy.
- Other features will be apparent from the accompanying drawings and from the detailed description that follows.
-
FIG. 1 is a schematic view of a data path and a control path associated with an unbalanced instruction pipeline, according to one or more embodiments. -
FIG. 2 is an illustrative view of an example processing scenario associated with the unbalanced instruction pipeline ofFIG. 1 . -
FIG. 3 is a schematic view of logic associated with a pipeline control unit configured to dynamically profile an instruction code associated with a constituent stage of the unbalanced instruction pipeline ofFIG. 1 . -
FIG. 4 is a plot of throughput associated with a constituent stage of the unbalanced instruction pipeline ofFIG. 1 as a function of a clock frequency for different example values of the rate of resource occupancy associated with the constituent stage. -
FIG. 5 is a schematic view of a computing system including a processor in which the unbalanced instruction pipeline ofFIG. 1 is implemented. -
FIG. 6 is a process flow diagram detailing the operations involved in a method of performing optimum data processing through the unbalanced instruction pipeline ofFIG. 1 , according to one or more embodiments. -
FIG. 7 is a process flow diagram detailing the operations involved in a method of performing optimum and dynamic data processing through the unbalanced instruction pipeline ofFIG. 1 , according to one or more embodiments. - Other features of the present embodiments will be apparent from the accompanying drawings and from the detailed description that follows.
- Disclosed are a method, an apparatus and/or a system to optimize power and throughput in an unbalanced instruction pipeline implemented in a processor associated therewith. Although the present embodiments have been described with reference to specific example embodiments, it will be evident that various modifications and changes is made to these embodiments without departing from the broader spirit and scope of the various embodiments.
-
FIG. 1 illustrates adata path 162 and acontrol path 164 associated with anunbalanced instruction pipeline 100, according to one or more embodiments. An instruction code associated with processing throughunbalanced instruction pipeline 100 is stored inprogram memory 102.Program memory 102 is a Read-Only Memory (ROM). In some cases, a data memory (not shown) in the form of a Random Access Memory (RAM) is used to store intermediate results and variables associated with the processing.Program memory 102 may also be configured to store constants associated with the processing. Instructions stored inprogram memory 102 is decoded throughinstruction decoder 104 and matching control signals for thepipelined data path 162 is generated. - The aforementioned operations (e.g., instruction decoding) constitute
stage 1 106 ofunbalanced instruction pipeline 100. In the example embodiment shown inFIG. 1 ,unbalanced instruction pipeline 100 is shown to include stages (e.g.,stage 1 106,stage 2 108,stage 3 110,stage 4 112). Theunbalanced instruction pipeline 100 includes more than four stages or even less than four stages, and that the four stages shown inFIG. 1 merely serve as an example. In another example embodiment,stage 1 106 is associated with an instruction fetch operation,stage 2 108 is associated with an instruction decode operation,stage 3 110 is associated with an execute operation,stage 4 112 is associated with a memory access operation, and stage 5 (not shown) is associated with a write back operation. - Registers are inserted between stages of
unbalanced instruction pipeline 100. Specifically, in one or more embodiments, output of each stage is an input to a flip-flop (e.g., FF1 114, FF2 116, FF3 118, FF4 120). For example, as shown inFIG. 1 , D flip-flops are for the aforementioned purpose. Each D flip-flop is configured to receive the output of the previous stage (e.g.,instruction decoder 104 output, output of D flip-flop (Q)) as the D input thereof. Each flip-flop (e.g., FF1 114, FF2 116, FF3 118, FF4 120) is clocked through a clock generation circuit (e.g., CLK GEN 1 132, CLK GEN 2 134, CLK GEN 3 136, CLK GEN 4 138).Program memory 102 also has a clock generation circuit (e.g., CLK GEN 0 130) associated therewith. In an example embodiment, the clock generation circuit includes a crystal oscillator. The clock generation circuits (e.g.,CLK GEN 0 130,CLK GEN 1 132,CLK GEN 2 134,CLK GEN 3 136,CLK GEN 4 138) associated with the individual stages are controlled throughpipeline control unit 150. -
Unbalanced instruction pipeline 100 may include adata path 162 and acontrol path 164. As shown inFIG. 1 ,data path 162 may include flip-flops configured to latch onto and propagate data to succeeding stages.Control path 164 may include control elements (e.g.,control element 1 142,control element 2 144,control element 3 146,control element 4 148) configured to control data processing through the stages ofunbalanced instruction pipeline 100. For example, control elements is configured to assert a signal to enable data transfer throughdata path 162 at an output. Flip-flops are used as control elements inunbalanced instruction pipeline 100. In one or more embodiments,pipeline control unit 150 is also configured to control clock gating (to be discussed below) and data forwarding through each stage ofunbalanced instruction pipeline 100 using the decoded instruction control signals available through control elements. Further,pipeline control unit 150 is configured to utilize the decoded instruction control signals from each stage ofunbalanced instruction pipeline 100 to detect data hazards therein. - In the example embodiment of
FIG. 1 ,stage 3 110 includes logic associated therewith. Specifically,FIG. 1 illustratesstage 3 110 as includinglogic 1 122,logic 2 124, andlogic 3 126. Also, a multiplexer (MUX 128) may select one oflogic 1 122,logic 2 124 andlogic 3 126 based on a control signal. It is noted that there is more logic units associated withstage 3 110.Logic 1 122,logic 2 124 and/orlogic 3 126 is Arithmetic Logic Units (ALU units) (e.g., multiplier, adder, bit shifter, divider). For the sake of convenience in understanding, it is assumed thatlogic 1 122 is a divider,logic 2 124 is an adder, andlogic 3 126 is a multiplier, and that a task completion time associated withlogic 1 122 is 15 nanoseconds (15 ns), a task completion time associated withlogic 2 124 is 2 ns, and a task completion time associated withlogic 3 126 is 5 ns. The task completion times associated with all other stages (e.g.,stage 1 106,stage 2 108,stage 4 112) is assumed to be 2 ns. - Thus, the maximum delay associated with
unbalanced instruction pipeline 100/stage 3 110 will be 15 ns. Further, it is assumed that the probability oflogic 1 122 being utilized during processing is lower than the probability associated with the use oflogic 2 124 andlogic 3 126. In other words,MUX 128 is configured to selectlogic 2 124 orlogic 126 more frequently thanlogic 1 122. Ifunbalanced instruction pipeline 100 is clocked at a frequency associated with the maximum delay instage 3 110 (e.g., 15 ns due tologic 1 122), the throughput (see, e.g., Equation (2)) associated withunbalanced instruction pipeline 100 is limited as the clock frequency is limited (e.g., to a maximum of 66.7 MHz) and the probability of use oflogic 1 122 is low. - Thus, it is beneficial to clock
unbalanced instruction pipeline 100 at a frequency higher than the example 66.7 MHz discussed above. For example,unbalanced instruction pipeline 100 is clocked at a frequency associated with the smallest delay associated with any of the constituent stages (e.g.,stage 1 106,stage 2 108,stage 3 110,stage 4 112). In the example scenario discussed above, the smallest delay associated with the stages is 2 ns. Therefore,unbalanced instruction pipeline 100 is clocked at a frequency associated with 2 ns (i.e., 500 MHz). - Whenever the use of
logic 1 122 is required, the execution (or, task completion) associated withstage 3 110 and the previous stages thereof (e.g.,stage 2 108,stage 1 106) is stalled for at least a number of clocks corresponding to the delay associated withlogic 1 122 (e.g., 15 ns). The minimum number of 2 ns clocks required to cover 15 ns is 8. Thus, execution associated withlogic 1 122,logic 2 124, andlogic 3 126 ofstage 3 110 are stalled for eight clock cycles, one clock cycle and three clock cycles respectively. Stalling is accomplished through gating the clock inputs to the flip-flops associated withstage 3 110 (e.g., FF3 118) and the previous stages thereof (e.g.,stage 2 108 andFF 2 116,stage 1 106 and FF1 114). In one or more embodiments, new instructions are prevented from enteringunbalanced instruction pipeline 100 during the stall. - Clock gating for the purpose of stalling is controlled by pipeline control unit 150 (to be described below). Clock gating is controlled through control elements (e.g.,
control element 3 146,control element 2 144,control element 1 142), in association withpipeline control unit 150. At the simplest level, an AND gate (not shown) is employed for the clock gating. Here, the signal(s) associated with the stages (e.g.,stage 3 110,stage 2 108,stage 1 106) that are stalled is inverted and input to the AND gate. The clock signals generated from the clock generation circuits (e.g.,CLK GEN 3 136,CLK GEN 2 134,CLK GEN 1 132,CLK GEN 0 130) may also be input to the AND gate. Whenever the signal(s) is high, the inverted input to the AND gate is low and the clock output of the AND gate is also low, regardless of the state of the clock inputs. Clock gating circuits are known to one skilled in the art, and, therefore, discussion of more examples thereof is skipped for the sake of convenience. - In one embodiment, constituent stages of
unbalanced instruction pipeline 100 include multi-cycle paths.Stage 3 110, for example, may include a multi-cycle path throughlogic 1 122. The multi-cycle path may require more than one clock cycle for completion of the task associated therewith. The task initiation is accomplished through a source flip-flop changing a state thereof, following which the result of the execution is transmitted to a destination flip-flop. The timing checks associated with the aforementioned stall process is part of, for example, a Static Timing Analysis (STA) utilized. Also, the multi-cycle path discussed above is defined during the STA by the programmer/user of a computing system executing tasks associated withunbalanced instruction pipeline 100. - If the probability of use of
logic 1 122 for processing is high, the number of stalls increases for every instruction associated with the aforementioned processing. Thus, dynamic power consumption is impacted as the number of clock cycles is proportional to the dynamic power. In addition,unbalanced instruction pipeline 100 has clock buffers, the constituent flip-flop(s) of which toggles at rising/falling edges of clock pulses. This may contribute to increased dynamic power consumption. Therefore, in the abovementioned example, it is preferable to clockunbalanced instruction pipeline 100 at a frequency lower than 500 MHz. - It is possible to determine the rate of resource occupancy associated with an instruction/a constituent stage of
unbalanced instruction pipeline 100 through profiling an instruction code associated therewith. In the example described above, an instruction is associated with division, multiplication and addition. For example,logic 1 122 is associated with division operations,logic 2 124 is associated with multiplication operations, andlogic 3 126 is associated with addition operations. The rate of use (i.e., resource occupancy) oflogic 1 122,logic 2 124 andlogic 3 126 is expressed in example Equation (3) as: -
- where R1, R2 and R3 are the rates of use of
logic 1 122,logic 2 124 andlogic 3 126 respectively, N is the number of instructions, and Ndivision, Nmultiplication and Naddition are the number of division, multiplication and addition instructions respectively. - As discussed above, R1, R2 and R3 is obtained through profiling the instruction code associated with processing through
stage 3 110 ofunbalanced instruction pipeline 100. For example, compiling/executing the instruction code associated therewith yields R1, R2 and R3. Also, the rate of resource occupancy may depend on a system level scenario in which the instruction code is executed. Thus, obtaining the rate of resource occupancy associated with a stage (or, a sub-stage) ofunbalanced instruction pipeline 100 may include monitoring utilization of a processor/memory associated therewith. Parameters associated with the aforementioned monitoring also includes instruction cache (e.g., instruction cache associated with program memory 102) hits/misses and data cache (e.g., data cache associated with data memory) hits/misses. - The instruction cache and the data cache may, respectively, allow for increased speed of an instruction fetch process and a data fetch/store process. In order to monitor these parameters, performance counters (or, registers) are employed in the processing/operating environment associated with processing through
unbalanced instruction pipeline 100. The performance counters (or, registers) are configured to keep track of the abovementioned processor/memory utilization and/or a number of instruction/data cache hits/misses. The number of stall cycles associated with a clock frequency (e.g., 500/66.7 MHz) is estimated through the delay (e.g., 2 ns/15 ns) associated with the stage ofunbalanced instruction pipeline 100, as discussed above. - In certain scenarios, rate vectors <R> is constant throughout run-time. For example, the instruction code being executed is associated with a reliability test of a product, which may take values of the same parameters that are approximately close to one another on different days and check for continued reliability. In such scenarios, an initial profiling of the instruction code may suffice to determine the rate vectors <R>. The clock frequency and the number of stall cycles is kept constant for the instruction code. In other scenarios, the rate vectors <R> may not be constant throughout run-time, and is changed dynamically, as will be discussed below.
-
FIG. 2 illustrates an example processing scenario, according to one or more embodiments. It is assumed that there is a processor in whichunbalanced instruction pipeline 100 is implemented. The processor is configured to supportvideo processing 202 for the first 10 seconds (s) of an operation.Audio processing 204 for the next 20 seconds and, again,video processing 206 for the next 10 seconds.Video processing 206 is analogous tovideo processing 202. As discussed above, R1 is associated with the rate of use oflogic 1 122, R2 is associated with the rate of use oflogic 2 124, and R2 is associated with the rate of use oflogic 3 126. As video processing (202, 206) involves operations (e.g., mathematical operations) that are different from that ofaudio processing 204, and the rate vector <R1> (e.g.,(R1, R2, R3)) associated with video processing (202, 206) is different from the rate vector <R2> (e.g.,(R1, R2, R3)) associated withaudio processing 204, as shown inFIG. 2 . - In the example shown in
FIG. 2 , the minimum occupancy time associated with audio/video processing (202, 204, 206) is 10 seconds. The minimum occupancy time is then sampled at, for example, every 1 s, which is the interval for estimating <R>. The pre-defined intervals for determining <R> are thus chosen based on the rate at which change in processing scenarios (e.g.,audio processing 204, video processing (202, 206)) for the processor. - Thus, <R> (e.g., <R1>, <R2>) is estimated at pre-defined intervals, depending on which clock frequency and stall cycles is updated to the hardware associated with the processing. As shown in
FIG. 2 ,video processing 202 involves a rate vector <R1> for which clock frequency f1 and the associated stall vector <s1> (e.g., (s1, s2, s3)) is obtained based on maximizing throughput. Here, s1 denotes the number of stall cycles associated withlogic 1 122, s2 denotes the number of stall cycles associated withlogic 2 124, and s3 denotes the number of stall cycles associated withlogic 3 126. - At the end of the first 10 seconds, the clock frequency and the stall vector is updated in the hardware to f2 and <s2> (e.g., (s1, s2, s3)) respectively to allow for an optimum (e.g., maximum) throughput during
audio processing 204. The clock frequency and the stall vector continues to be f2 and <s2> for the next 20 seconds, although the associated rate vector <R2> is still monitored for changes in the rates therein. At the end of the 20 seconds, the clock frequency and the stall vector is switched to f1 and <s1> asaudio processing 204 switches tovideo processing 206. The aforementioned operations, including the calculation of <R>, are performed throughpipeline control unit 150 having associated logic. -
FIG. 3 illustrates logic associated withpipeline control unit 150 configured to dynamically profile the instruction code associated withstage 3 110 ofunbalanced instruction pipeline 100, according to one or more embodiments. As shown inFIG. 3 and as discussed above, decoded instruction control signals (e.g., decoded instruction control (stage 3) 302 associated withstage 3 110) is input to pipeline control unit 150 (e.g., to the aforementioned logic associated with pipeline control unit 150).Counter 1 304, counter 2 306 andcounter 3 308 is associated with computing a rate vector <R> associated with a processing scenario. A pre-defined interval for profiling an instruction code associated with the processing scenario is chosen analogous to the example discussed inFIG. 2 . It is assumed that there is M average number of instructions in the pre-defined interval. - A Look Up Table (LUT) 312 is implemented in the logic associated with
pipeline control unit 150 to obtain the clock frequency and stall cycles (or, stall vectors) for different values of rate vector <R>.LUT 312 is implemented using a multiplexer having inputs to LUT 312 (e.g., <R>=<R1>, <R2>) as select lines thereof. The output ofLUT 312 is the clock frequency (e.g., f=f1, f2) and/or the stall vector (e.g., <s>=<s2>, <s2>). At the end of every interval, the counters (e.g., counter 1 304, counter 2 306, counter 3 308) is reset throughinterval counter 310.Interval counter 310 is also be configured to count the pre-defined intervals (e.g., interval period inFIG. 3 ). Implementations of interval counters 310 are known to one skilled in the art, and, therefore, discussion associated therewith is skipped for the sake of convenience. - To summarize, in one or more embodiments, at every interval, the hardware associated with processing through
unbalanced instruction pipeline 100 is updated with a new frequency and a stall vector, if applicable, based on a change in the rate vector (e.g., <R2>) when compared to the previous rate vector (e.g., <R1>) associated with the previous interval. Then, the counters (e.g., counter 1 304, counter 2 306, counter 3 308) associated with computing <R> (e.g., <R1>, <R2>) is reset to begin with the next averaging. -
FIG. 4 illustratesthroughput 402 associated with a stage ofunbalanced instruction pipeline 100 as a function of clock frequency,f 404, for different example values of the rate vector, <R> 406, according to one or more embodiments. As discussed above, the rate vector, <R> 406 (e.g., <R1>, <R2>), is determined from the compiled instruction code associated therewith. In one or more embodiments, the plot is obtained through a knowledge of stall vector, <s> 410 (e.g., <s1>, <s2>), associated with clock frequency, f 404 (e.g., f1, f2). Increasing clock frequency,f 404, beyond a certain value (e.g., f1, f2) is not required asthroughput 402 may saturate beyond a certain value.FIG. 4 also shows a table associating <R> 406,f 404, and <s> 410. Clock frequency,f 404, is configurable based on <R> 406. As seen in the discussion associated withFIG. 3 , the output ofLUT 312 may yield clock frequency,f 404. As a phase-locked loop (PLL) is used for generation of clock frequency,f 404, the PLL is programmed to select an appropriate frequency. The PLL is associated with a clock generation circuit (e.g.,CLK GEN 3 136,CLK GEN 2 134,CLK GEN 1 132,CLK GEN 0 130) of a stage ofunbalanced instruction circuit 100. -
FIG. 5 illustrates acomputing system 500 includingprocessor 502 in whichunbalanced instruction pipeline 100 is implemented, according to one or more embodiments.Computing system 500 is a personal computer, a laptop, a notebook computer and/or a system utilizing the benefits associated with optimizedunbalanced instruction pipeline 100.Computing system 500 also includes a microcontroller with aprocessor 502.Computing system 500 includes a memory 504 (e.g., program memory 102) configured to store the instruction code associated with processing throughunbalanced instruction pipeline 100.Computing system 500 also includes adetermination module 506 configured to determine the rate of resource occupancy of a constituent stage ofunbalanced instruction pipeline 100 through profiling the instruction code associated with processing throughunbalanced instruction pipeline 100.Processor 502 is configured to perform processing associated withunbalanced instruction pipeline 100 at a clock frequency based on an optimum a power consumption and/or a throughput associated withunbalanced instruction pipeline 100 for the determined rate of resource occupancy of the constituent stage. -
FIG. 6 illustrates a process flow diagram detailing the operations involved in a method of performing optimum data processing throughunbalanced instruction pipeline 100, according to one or more embodiments.Operation 602 involves determining a rate of resource occupancy of a constituent stage ofunbalanced instruction pipeline 100 implemented inprocessor 502 through profiling an instruction code associated therewith.Operation 604 then involves performing data processing associated withunbalanced instruction pipeline 100 at a maximum throughput at an optimum clock frequency based on the resource occupancy. -
FIG. 7 illustrates a process flow diagram detailing the operations involved in a method of performing optimum and dynamic data processing throughunbalanced instruction pipeline 100, according to one or more embodiments.Operation 702 involves determining a time interval within a processing time associated with a constituent stage ofunbalanced instruction pipeline 100 implemented inprocessor 502 based on a change in a processing scenario associated with data processing.Operation 704 involves dynamically determining a rate of resource occupancy of the constituent stage periodically with a time period equal to the determined time interval through profiling an instruction code associated therewith.Operation 706 involves periodically obtaining a clock frequency associated with the rate of resource occupancy of the constituent stage. The clock frequency corresponds to optimized power consumption and/or a throughput associated withunbalanced instruction pipeline 100.Operation 708 then involves performing the data processing at the periodically obtained clock frequency. - Exemplary embodiments discussed above can be used in high-performance, low power computing applications. Specifically, the exemplary embodiments is used in delay-locked loops (DLLs) associated with Global Positioning System (GPS) receivers and embedded/Digital System Processing (DSP) applications requiring large-scale processing. Other applications utilizing the concepts discussed herein are within the scope of the exemplary embodiments.
Stage 3 110 ofunbalanced instruction pipeline 100 may involve a hard macro (e.g., the data memory discussed above) therein. The divider logic, adder logic, and multiplier logic discussed above are merely for purposes of illustration. Modifications in the constituent elements of stages (e.g., increasing/decreasing the number of constituent elements, varying the constituent elements) ofunbalanced instruction pipeline 100 are well within the scope of the exemplary embodiments. In one or more embodiments, it is possible that a constituent stage (e.g.,stage 3 110) ofunbalanced instruction pipeline 100 may include a single element, which may contribute to the maximum delay associated withunbalanced instruction pipeline 100. Optimization, as discussed above, may then be done based on the aforementioned single element. - Although the present embodiments have been described with reference to specific example embodiments, it will be evident that various modifications and changes is made to these embodiments without departing from the broader spirit and scope of the various embodiments. For example, the various systems, devices, apparatuses, and circuits, etc. described herein is enabled and operated using hardware circuitry, firmware, software or any combination of hardware, firmware, or software embodied in a machine readable medium. The various electrical structures and methods is embodied using transistors, logic gates, application specific integrated (ASIC) circuitry or Digital Signal Processor (DSP) circuitry.
- In addition, it will be appreciated that the various operations, processes, and methods disclosed herein is embodied in a machine-readable medium or a machine accessible medium compatible with a data processing system, and is performed in any order. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
Claims (16)
1. A method comprising:
determining a rate of resource occupancy of a constituent stage of an unbalanced instruction pipeline implemented in a processor through profiling an instruction code; and
performing data processing associated with the unbalanced instruction pipeline at a maximum throughput and at an optimum clock frequency based on the rate of resource occupancy.
2. The method of claim 1 , wherein performing the data processing includes stalling processing associated with at least one of the constituent stage of the unbalanced instruction pipeline and a previous stage, for at least a number of clock cycles corresponding to a delay time associated with the processing, through the constituent stage by gating a clock input to the at least one of the constituent stage and the previous stage.
3. The method of claim 1 , further comprising:
determining a time interval within a processing time associated with the constituent stage of the unbalanced instruction pipeline based on a change in a processing scenario associated with processing;
dynamically determining the rate of resource occupancy of the constituent stage periodically with a time period equal to the determined time interval; and
obtaining, at every time interval, the clock frequency associated with the rate of resource occupancy of the constituent stage for performing the data processing associated with the unbalanced instruction pipeline.
4. The method of claim 3 , wherein the clock frequency associated with the data processing is higher than a frequency corresponding to the higher delay time associated with the constituent stage.
5. The method of claim 2 , further comprising obtaining a number of stall cycles associated with stalling in at least one of the constituent stage of the unbalanced instruction pipeline and a previous stage thereof for at least the number of stall cycles, wherein the number of stall cycles corresponds to a delay time associated with the processing through the constituent stage.
6. The method of claim 5 , wherein determining the rate of resource occupancy of the constituent stage of the unbalanced instruction pipeline includes:
inputting a control signal associated with a decoded instruction associated with the processing through the constituent stage to a counter associated therewith;
determining the rate of resource occupancy of the constituent stage through the counter; and
maintaining a Look Up Table (LUT) associated with the counter to map the determined rate of resource occupancy and at least one of the clock frequency and the number of stall cycles associated therewith.
7. The method of claim 6 , further comprising:
updating hardware associated with the processing through the constituent stage with the at least one of the clock frequency and the number of stall cycles determined through the LUT when the at least one of the clock frequency and the number of stall cycles varies from a value thereof during a previous time interval; and
resetting the counter at the end of the time interval.
8. The method of claim 6 , comprising implementing the LUT through a multiplexer having the rate of resource occupancy as an input and a select line.
9. A method comprising:
determining a time interval within a processing time associated with a constituent stage of an unbalanced instruction pipeline implemented in a processor based on a change in a processing scenario associated with data processing;
dynamically determining a rate of resource occupancy of the constituent stage periodically with a time period equal to the time interval through profiling an instruction code;
periodically obtaining a clock frequency associated with the rate of resource occupancy of the constituent stage, the clock frequency corresponding to an optimized at least one of a power consumption and a throughput associated with the unbalanced instruction pipeline; and
performing the data processing at the periodically obtained clock frequency.
10. The method of claim 9 , further comprising obtaining a number of stall cycles associated with stalling processing in at least one of the constituent stage of the unbalanced instruction pipeline and a previous stage thereof for at least the number of stall cycles,
wherein the number of stall cycles corresponds to a delay time associated with the processing through the constituent stage.
11. The method of claim 9 , wherein dynamically determining the rate of resource occupancy of the constituent stage includes:
inputting a control signal associated with a decoded instruction associated with the processing through the constituent stage to a counter associated therewith;
determining the rate of resource occupancy of the constituent stage through the counter; and
maintaining a Look Up Table (LUT) associated with the counter to map the determined rate of resource occupancy and at least one of the clock frequency and the number of stall cycles associated therewith.
12. The method of claim 11 , further comprising:
updating hardware associated with the processing through the constituent stage with the at least one of the clock frequency and the number of stall cycles determined through the LUT when the at least one of the clock frequency and the number of stall cycles varies from a value thereof during a previous time interval; and
resetting the counter at the end of the time interval.
13. The method of claim 11 , comprising implementing the LUT through a multiplexer having the rate of resource occupancy as an input and a select line thereof.
14. A computing system comprising:
a processor having an unbalanced instruction pipeline;
a memory configured to store an instruction code associated with processing through the unbalanced instruction pipeline; and
a determination module configured to determine a rate of resource occupancy of a constituent stage of the unbalanced instruction pipeline through profiling the instruction code associated with processing through the unbalanced instruction pipeline, the processor being configured to perform data processing at a maximum throughput at an optimum clock frequency based on the rate of resource occupancy.
15. The computing system of claim 14 , further comprising a pipeline control unit configured to control a clock generation circuit associated with the constituent stage of the unbalanced instruction pipeline.
16. The computing system of claim 16 , wherein the pipeline control unit further comprises a Look Up Table (LUT) implemented therein configured to map the rate of resource occupancy of the constituent stage determined through the determination module to at least one of the clock frequency and a number of stall cycles,
wherein the number of stall cycles is associated with stalling processing in at least one of the constituent stage of the unbalanced instruction pipeline and a previous stage thereof for at least the number of stall cycles, and
wherein the number of stall cycles corresponds to a delay time associated with the processing through the constituent stage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/860,095 US20160011642A1 (en) | 2010-04-20 | 2015-09-21 | Power and throughput optimization of an unbalanced instruction pipeline |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN1129CH2010 | 2010-04-20 | ||
IN1129/CHE/2010 | 2010-04-20 | ||
US13/089,101 US9141392B2 (en) | 2010-04-20 | 2011-04-18 | Different clock frequencies and stalls for unbalanced pipeline execution logics |
US14/860,095 US20160011642A1 (en) | 2010-04-20 | 2015-09-21 | Power and throughput optimization of an unbalanced instruction pipeline |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/089,101 Division US9141392B2 (en) | 2010-04-20 | 2011-04-18 | Different clock frequencies and stalls for unbalanced pipeline execution logics |
Publications (1)
Publication Number | Publication Date |
---|---|
US20160011642A1 true US20160011642A1 (en) | 2016-01-14 |
Family
ID=44789092
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/089,101 Active 2033-05-01 US9141392B2 (en) | 2010-04-20 | 2011-04-18 | Different clock frequencies and stalls for unbalanced pipeline execution logics |
US14/860,095 Abandoned US20160011642A1 (en) | 2010-04-20 | 2015-09-21 | Power and throughput optimization of an unbalanced instruction pipeline |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/089,101 Active 2033-05-01 US9141392B2 (en) | 2010-04-20 | 2011-04-18 | Different clock frequencies and stalls for unbalanced pipeline execution logics |
Country Status (1)
Country | Link |
---|---|
US (2) | US9141392B2 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114724233A (en) * | 2020-12-21 | 2022-07-08 | 青岛海尔多媒体有限公司 | Method and device for gesture control of terminal equipment and terminal equipment |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3623017A (en) * | 1969-10-22 | 1971-11-23 | Sperry Rand Corp | Dual clocking arrangement for a digital computer |
US4819164A (en) * | 1983-12-12 | 1989-04-04 | Texas Instruments Incorporated | Variable frequency microprocessor clock generator |
US4821229A (en) * | 1985-12-12 | 1989-04-11 | Zenith Electronics Corporation | Dual operating speed switchover arrangement for CPU |
US5179693A (en) * | 1985-03-29 | 1993-01-12 | Fujitsu Limited | System for controlling operation of processor by adjusting duty cycle of performance control pulse based upon target performance value |
US5197126A (en) * | 1988-09-15 | 1993-03-23 | Silicon Graphics, Inc. | Clock switching circuit for asynchronous clocks of graphics generation apparatus |
US5301306A (en) * | 1990-09-19 | 1994-04-05 | U.S. Philips Corporation | Circuit for slowing portion of microprocessor operating cycle in all successive operating cycles regardless of whether a slow device is accessed in the portion of any operating cycle |
US20020104032A1 (en) * | 2001-01-30 | 2002-08-01 | Mazin Khurshid | Method for reducing power consumption using variable frequency clocks |
US7076681B2 (en) * | 2002-07-02 | 2006-07-11 | International Business Machines Corporation | Processor with demand-driven clock throttling power reduction |
US20090019265A1 (en) * | 2007-07-11 | 2009-01-15 | Correale Jr Anthony | Adaptive execution frequency control method for enhanced instruction throughput |
US7809932B1 (en) * | 2004-03-22 | 2010-10-05 | Altera Corporation | Methods and apparatus for adapting pipeline stage latency based on instruction type |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6993669B2 (en) * | 2001-04-18 | 2006-01-31 | Gallitzin Allegheny Llc | Low power clocking systems and methods |
JP2003186567A (en) * | 2001-12-19 | 2003-07-04 | Matsushita Electric Ind Co Ltd | Microprocessor |
US7770034B2 (en) * | 2003-12-16 | 2010-08-03 | Intel Corporation | Performance monitoring based dynamic voltage and frequency scaling |
JP3862715B2 (en) * | 2004-06-01 | 2006-12-27 | 株式会社ソニー・コンピュータエンタテインメント | Task management method, task management device, semiconductor integrated circuit, electronic device, and task management system |
JP2006059068A (en) * | 2004-08-19 | 2006-03-02 | Matsushita Electric Ind Co Ltd | Processor device |
US20060200651A1 (en) * | 2005-03-03 | 2006-09-07 | Collopy Thomas K | Method and apparatus for power reduction utilizing heterogeneously-multi-pipelined processor |
US8219993B2 (en) * | 2008-02-27 | 2012-07-10 | Oracle America, Inc. | Frequency scaling of processing unit based on aggregate thread CPI metric |
CN102119380B (en) * | 2009-06-10 | 2014-04-02 | 松下电器产业株式会社 | Trace processing device and trace processing system |
US8527796B2 (en) * | 2009-08-24 | 2013-09-03 | Intel Corporation | Providing adaptive frequency control for a processor using utilization information |
-
2011
- 2011-04-18 US US13/089,101 patent/US9141392B2/en active Active
-
2015
- 2015-09-21 US US14/860,095 patent/US20160011642A1/en not_active Abandoned
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3623017A (en) * | 1969-10-22 | 1971-11-23 | Sperry Rand Corp | Dual clocking arrangement for a digital computer |
US4819164A (en) * | 1983-12-12 | 1989-04-04 | Texas Instruments Incorporated | Variable frequency microprocessor clock generator |
US5179693A (en) * | 1985-03-29 | 1993-01-12 | Fujitsu Limited | System for controlling operation of processor by adjusting duty cycle of performance control pulse based upon target performance value |
US4821229A (en) * | 1985-12-12 | 1989-04-11 | Zenith Electronics Corporation | Dual operating speed switchover arrangement for CPU |
US5197126A (en) * | 1988-09-15 | 1993-03-23 | Silicon Graphics, Inc. | Clock switching circuit for asynchronous clocks of graphics generation apparatus |
US5301306A (en) * | 1990-09-19 | 1994-04-05 | U.S. Philips Corporation | Circuit for slowing portion of microprocessor operating cycle in all successive operating cycles regardless of whether a slow device is accessed in the portion of any operating cycle |
US20020104032A1 (en) * | 2001-01-30 | 2002-08-01 | Mazin Khurshid | Method for reducing power consumption using variable frequency clocks |
US7076681B2 (en) * | 2002-07-02 | 2006-07-11 | International Business Machines Corporation | Processor with demand-driven clock throttling power reduction |
US7809932B1 (en) * | 2004-03-22 | 2010-10-05 | Altera Corporation | Methods and apparatus for adapting pipeline stage latency based on instruction type |
US20090019265A1 (en) * | 2007-07-11 | 2009-01-15 | Correale Jr Anthony | Adaptive execution frequency control method for enhanced instruction throughput |
Non-Patent Citations (3)
Title |
---|
Christine Placek, "Voltage Over-scaling in Unbalanced Pipelines with Adaptive Clocking," April 2009, a Master Degree Thesis, Purdue University, 31 pages. * |
Dropsho et al., "Dynamically Trading Frequency for Complexity in a GALS Microprocessor," Dec. 2004, Proceedings of the 37th International Symposium on Microarchitecture, 12 pages. * |
Swaroop Ghosh & Kaushik Roy, "Exploring high-speed low-power hybrid arithmetic units at scaled supply and adaptive clock-stretching," March 2008, Asia and South Pacific Design Automation Conference, pp. 635-40. * |
Also Published As
Publication number | Publication date |
---|---|
US20110258417A1 (en) | 2011-10-20 |
US9141392B2 (en) | 2015-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8412974B2 (en) | Global synchronization of parallel processors using clock pulse width modulation | |
US20120079255A1 (en) | Indirect branch prediction based on branch target buffer hysteresis | |
US8281113B2 (en) | Processor having ALU with dynamically transparent pipeline stages | |
Ritpurkar et al. | Design and simulation of 32-Bit RISC architecture based on MIPS using VHDL | |
US20150341032A1 (en) | Locally asynchronous logic circuit and method therefor | |
US20160011642A1 (en) | Power and throughput optimization of an unbalanced instruction pipeline | |
US7065636B2 (en) | Hardware loops and pipeline system using advanced generation of loop parameters | |
US7134036B1 (en) | Processor core clock generation circuits | |
US6920547B2 (en) | Register adjustment based on adjustment values determined at multiple stages within a pipeline of a processor | |
Tina et al. | Performance improvement of MIPS Architecture by Adding New features | |
US20130013902A1 (en) | Dynamically reconfigurable processor and method of operating the same | |
US20140317433A1 (en) | Clock control circuit and method | |
Rohit et al. | Implementation of 32-bit RISC processors without interlocked Pipelining on Artix-7 FPGA board | |
US20130262910A1 (en) | Time keeping in unknown and unstable clock architecture | |
Trivedi et al. | Reduced-hardware Hybrid Branch Predictor Design, Simulation & Analysis | |
Yao et al. | A dynamic control mechanism for pipeline stage unification by identifying program phases | |
TWI477941B (en) | Dynamic modulation processing device and its processing method | |
CN109564459B (en) | Method and apparatus for automatic adaptive voltage control | |
Albeck et al. | Energy efficient computing by multi-mode addition | |
Kumar et al. | Performance Analysis of Different types of Adders for High Speed 32 bit Multiply and Accumulate Unit | |
Sato et al. | A Crossbar Switch Circuit Design for Reconfigurable Wave-Pipelined Circuits | |
US10579414B2 (en) | Misprediction-triggered local history-based branch prediction | |
Selvi et al. | Performance analysis of multiplier using various techniques | |
Yeh et al. | Adaptive Pipeline Voltage Scaling in High Performance Microprocessor | |
Lozano et al. | A deeply embedded processor for smart devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |