US20160011642A1 - Power and throughput optimization of an unbalanced instruction pipeline - Google Patents

Power and throughput optimization of an unbalanced instruction pipeline Download PDF

Info

Publication number
US20160011642A1
US20160011642A1 US14/860,095 US201514860095A US2016011642A1 US 20160011642 A1 US20160011642 A1 US 20160011642A1 US 201514860095 A US201514860095 A US 201514860095A US 2016011642 A1 US2016011642 A1 US 2016011642A1
Authority
US
United States
Prior art keywords
stage
processing
rate
constituent
instruction pipeline
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/860,095
Inventor
Senthilkannan Chandrasekaran
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Texas Instruments Inc
Original Assignee
Texas Instruments Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Texas Instruments Inc filed Critical Texas Instruments Inc
Priority to US14/860,095 priority Critical patent/US20160011642A1/en
Publication of US20160011642A1 publication Critical patent/US20160011642A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/324Power saving characterised by the action undertaken by lowering clock frequency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
    • G06F9/3869Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/88Monitoring involving counting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/885Monitoring specific for caches
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Advance Control (AREA)

Abstract

A method includes determining a rate of resource occupancy of a constituent stage of an unbalanced instruction pipeline implemented in a processor through profiling an instruction code. The method also includes performing data processing at a maximum throughput at an optimum clock frequency based on the rate of resource occupancy.

Description

    FIELD OF TECHNOLOGY
  • This application is a Divisional of prior application Ser. No. 13/089,101, filed Apr. 18, 2011, currently pending;
  • And also claims priority from Indian Provisional Application Serial No. 1129/CHE/2010 filed on Apr. 20, 2010, entitled “POWER AND THROUGHPUT OPTIMIZATION OF AN UNBALANCED INSTRUCTION PIPELINE”, which is incorporated herein by reference in its entirety. Embodiments of the disclosure relate to instruction pipelining in processors.
  • BACKGROUND
  • Instruction pipelining is a technique used in processors (e.g., microprocessors, microcontrollers) to allow for parallel processing of instructions. For example, one instruction is associated with a first stage of an instruction pipeline and another instruction is associated with a second stage of the instruction pipeline. The instruction pipeline allows for “breaking” of the timing associated with a large data path, and provides parallelism in executing the instructions at an increased clock frequency.
  • The instruction pipeline offers optimum performance only when the constituent stages are perfectly balanced. A balanced pipeline implies that processing associated with a constituent stage of the pipeline takes a completion time equal to the completion time associated with all other constituent stage(s) of the instruction pipeline. However, there are scenarios (e.g., hard macro(s) such as memory/memories being in the data path of the pipeline, Arithmetic Logic Units (ALU units) such as multipliers, adders, bit shifters and dividers being in a same constituent stage of the pipeline) where a programmer/user is not able to perfectly balance the instruction pipeline. Here, the maximum frequency at which the unbalanced pipeline is clocked is determined through the constituent stage therein offering the maximum delay.
  • Assuming no stalls in an unbalanced instruction pipeline, the maximum frequency, at which the unbalanced instruction pipeline is clocked is expressed in example Equation (1) as:
  • f max = 1 d , ( 1 )
  • where d is the maximum delay offered by a constituent stage.
  • Assuming the time taken for executing N instructions to be (N+ns) cycles (ns being the number of constituent stages of the unbalanced instruction pipeline), the effective throughput, E, is be expressed in example Equation (2) as:
  • E = f max · N ( N + n s ) f max ( 2 )
  • The throughput, E as seen in Equation (2), is the number of instructions per second. Increased throughput is associated with a higher fmax, which implies a lower maximum delay offered by the constituent stage of the unbalanced instruction pipeline.
  • The pipeline can be clocked at a frequency higher than that computed based on the max-delay, and when the usage of timing-path involving the max-delay is detected, then the pipeline can be stalled for a number of cycles equivalent to the delay offered by the timing-path. This is known as pipeline stalling.
  • With the above approach, the frequency might not be optimal, if the usage of the timing-path involving max-delay is not frequent. It would lead to unnecessary dynamic power dissipation. Hence, there is a need to arrive at an optimum frequency for a given rate of usage of the timing-path involving the maximum delay.
  • SUMMARY
  • In one aspect, a method includes determining a rate of resource occupancy of a constituent stage of an unbalanced instruction pipeline implemented in a processor through profiling an instruction code. The method also includes performing data processing associated with the unbalanced instruction pipeline at a maximum throughput at an optimum clock frequency based on the rate of resource occupancy
  • In another aspect, a method determining a time interval within a processing time associated with a constituent stage of an unbalanced instruction pipeline implemented in a processor based on a change in a processing scenario associated with data processing therein. The method also includes dynamically determining a rate of resource occupancy of the constituent stage periodically with a time period equal to the determined time interval through profiling an instruction code associated therewith. Further, the method includes periodically obtaining a clock frequency associated with the rate of resource occupancy of the constituent stage and performing the data processing at the periodically obtained clock frequency. The clock frequency corresponds to an optimized power consumption and/or a throughput associated with the unbalanced instruction pipeline.
  • In yet another aspect, a computing system includes a processor having an unbalanced instruction pipeline implemented therein and a memory configured to store an instruction code associated with processing through the unbalanced instruction pipeline. The computing system also includes a determination module configured to determine a rate of resource occupancy of a constituent stage of the unbalanced instruction pipeline through profiling the instruction code associated with processing through the unbalanced instruction pipeline. The processor is configured to perform data processing at a maximum throughput at an optimum clock frequency based on the rate of resource occupancy.
  • Other features will be apparent from the accompanying drawings and from the detailed description that follows.
  • BRIEF DESCRIPTION OF THE VIEWS OF DRAWINGS
  • FIG. 1 is a schematic view of a data path and a control path associated with an unbalanced instruction pipeline, according to one or more embodiments.
  • FIG. 2 is an illustrative view of an example processing scenario associated with the unbalanced instruction pipeline of FIG. 1.
  • FIG. 3 is a schematic view of logic associated with a pipeline control unit configured to dynamically profile an instruction code associated with a constituent stage of the unbalanced instruction pipeline of FIG. 1.
  • FIG. 4 is a plot of throughput associated with a constituent stage of the unbalanced instruction pipeline of FIG. 1 as a function of a clock frequency for different example values of the rate of resource occupancy associated with the constituent stage.
  • FIG. 5 is a schematic view of a computing system including a processor in which the unbalanced instruction pipeline of FIG. 1 is implemented.
  • FIG. 6 is a process flow diagram detailing the operations involved in a method of performing optimum data processing through the unbalanced instruction pipeline of FIG. 1, according to one or more embodiments.
  • FIG. 7 is a process flow diagram detailing the operations involved in a method of performing optimum and dynamic data processing through the unbalanced instruction pipeline of FIG. 1, according to one or more embodiments.
  • Other features of the present embodiments will be apparent from the accompanying drawings and from the detailed description that follows.
  • DETAILED DESCRIPTION
  • Disclosed are a method, an apparatus and/or a system to optimize power and throughput in an unbalanced instruction pipeline implemented in a processor associated therewith. Although the present embodiments have been described with reference to specific example embodiments, it will be evident that various modifications and changes is made to these embodiments without departing from the broader spirit and scope of the various embodiments.
  • FIG. 1 illustrates a data path 162 and a control path 164 associated with an unbalanced instruction pipeline 100, according to one or more embodiments. An instruction code associated with processing through unbalanced instruction pipeline 100 is stored in program memory 102. Program memory 102 is a Read-Only Memory (ROM). In some cases, a data memory (not shown) in the form of a Random Access Memory (RAM) is used to store intermediate results and variables associated with the processing. Program memory 102 may also be configured to store constants associated with the processing. Instructions stored in program memory 102 is decoded through instruction decoder 104 and matching control signals for the pipelined data path 162 is generated.
  • The aforementioned operations (e.g., instruction decoding) constitute stage 1 106 of unbalanced instruction pipeline 100. In the example embodiment shown in FIG. 1, unbalanced instruction pipeline 100 is shown to include stages (e.g., stage 1 106, stage 2 108, stage 3 110, stage 4 112). The unbalanced instruction pipeline 100 includes more than four stages or even less than four stages, and that the four stages shown in FIG. 1 merely serve as an example. In another example embodiment, stage 1 106 is associated with an instruction fetch operation, stage 2 108 is associated with an instruction decode operation, stage 3 110 is associated with an execute operation, stage 4 112 is associated with a memory access operation, and stage 5 (not shown) is associated with a write back operation.
  • Registers are inserted between stages of unbalanced instruction pipeline 100. Specifically, in one or more embodiments, output of each stage is an input to a flip-flop (e.g., FF1 114, FF2 116, FF3 118, FF4 120). For example, as shown in FIG. 1, D flip-flops are for the aforementioned purpose. Each D flip-flop is configured to receive the output of the previous stage (e.g., instruction decoder 104 output, output of D flip-flop (Q)) as the D input thereof. Each flip-flop (e.g., FF1 114, FF2 116, FF3 118, FF4 120) is clocked through a clock generation circuit (e.g., CLK GEN 1 132, CLK GEN 2 134, CLK GEN 3 136, CLK GEN 4 138). Program memory 102 also has a clock generation circuit (e.g., CLK GEN 0 130) associated therewith. In an example embodiment, the clock generation circuit includes a crystal oscillator. The clock generation circuits (e.g., CLK GEN 0 130, CLK GEN 1 132, CLK GEN 2 134, CLK GEN 3 136, CLK GEN 4 138) associated with the individual stages are controlled through pipeline control unit 150.
  • Unbalanced instruction pipeline 100 may include a data path 162 and a control path 164. As shown in FIG. 1, data path 162 may include flip-flops configured to latch onto and propagate data to succeeding stages. Control path 164 may include control elements (e.g., control element 1 142, control element 2 144, control element 3 146, control element 4 148) configured to control data processing through the stages of unbalanced instruction pipeline 100. For example, control elements is configured to assert a signal to enable data transfer through data path 162 at an output. Flip-flops are used as control elements in unbalanced instruction pipeline 100. In one or more embodiments, pipeline control unit 150 is also configured to control clock gating (to be discussed below) and data forwarding through each stage of unbalanced instruction pipeline 100 using the decoded instruction control signals available through control elements. Further, pipeline control unit 150 is configured to utilize the decoded instruction control signals from each stage of unbalanced instruction pipeline 100 to detect data hazards therein.
  • In the example embodiment of FIG. 1, stage 3 110 includes logic associated therewith. Specifically, FIG. 1 illustrates stage 3 110 as including logic 1 122, logic 2 124, and logic 3 126. Also, a multiplexer (MUX 128) may select one of logic 1 122, logic 2 124 and logic 3 126 based on a control signal. It is noted that there is more logic units associated with stage 3 110. Logic 1 122, logic 2 124 and/or logic 3 126 is Arithmetic Logic Units (ALU units) (e.g., multiplier, adder, bit shifter, divider). For the sake of convenience in understanding, it is assumed that logic 1 122 is a divider, logic 2 124 is an adder, and logic 3 126 is a multiplier, and that a task completion time associated with logic 1 122 is 15 nanoseconds (15 ns), a task completion time associated with logic 2 124 is 2 ns, and a task completion time associated with logic 3 126 is 5 ns. The task completion times associated with all other stages (e.g., stage 1 106, stage 2 108, stage 4 112) is assumed to be 2 ns.
  • Thus, the maximum delay associated with unbalanced instruction pipeline 100/stage 3 110 will be 15 ns. Further, it is assumed that the probability of logic 1 122 being utilized during processing is lower than the probability associated with the use of logic 2 124 and logic 3 126. In other words, MUX 128 is configured to select logic 2 124 or logic 126 more frequently than logic 1 122. If unbalanced instruction pipeline 100 is clocked at a frequency associated with the maximum delay in stage 3 110 (e.g., 15 ns due to logic 1 122), the throughput (see, e.g., Equation (2)) associated with unbalanced instruction pipeline 100 is limited as the clock frequency is limited (e.g., to a maximum of 66.7 MHz) and the probability of use of logic 1 122 is low.
  • Thus, it is beneficial to clock unbalanced instruction pipeline 100 at a frequency higher than the example 66.7 MHz discussed above. For example, unbalanced instruction pipeline 100 is clocked at a frequency associated with the smallest delay associated with any of the constituent stages (e.g., stage 1 106, stage 2 108, stage 3 110, stage 4 112). In the example scenario discussed above, the smallest delay associated with the stages is 2 ns. Therefore, unbalanced instruction pipeline 100 is clocked at a frequency associated with 2 ns (i.e., 500 MHz).
  • Whenever the use of logic 1 122 is required, the execution (or, task completion) associated with stage 3 110 and the previous stages thereof (e.g., stage 2 108, stage 1 106) is stalled for at least a number of clocks corresponding to the delay associated with logic 1 122 (e.g., 15 ns). The minimum number of 2 ns clocks required to cover 15 ns is 8. Thus, execution associated with logic 1 122, logic 2 124, and logic 3 126 of stage 3 110 are stalled for eight clock cycles, one clock cycle and three clock cycles respectively. Stalling is accomplished through gating the clock inputs to the flip-flops associated with stage 3 110 (e.g., FF3 118) and the previous stages thereof (e.g., stage 2 108 and FF 2 116, stage 1 106 and FF1 114). In one or more embodiments, new instructions are prevented from entering unbalanced instruction pipeline 100 during the stall.
  • Clock gating for the purpose of stalling is controlled by pipeline control unit 150 (to be described below). Clock gating is controlled through control elements (e.g., control element 3 146, control element 2 144, control element 1 142), in association with pipeline control unit 150. At the simplest level, an AND gate (not shown) is employed for the clock gating. Here, the signal(s) associated with the stages (e.g., stage 3 110, stage 2 108, stage 1 106) that are stalled is inverted and input to the AND gate. The clock signals generated from the clock generation circuits (e.g., CLK GEN 3 136, CLK GEN 2 134, CLK GEN 1 132, CLK GEN 0 130) may also be input to the AND gate. Whenever the signal(s) is high, the inverted input to the AND gate is low and the clock output of the AND gate is also low, regardless of the state of the clock inputs. Clock gating circuits are known to one skilled in the art, and, therefore, discussion of more examples thereof is skipped for the sake of convenience.
  • In one embodiment, constituent stages of unbalanced instruction pipeline 100 include multi-cycle paths. Stage 3 110, for example, may include a multi-cycle path through logic 1 122. The multi-cycle path may require more than one clock cycle for completion of the task associated therewith. The task initiation is accomplished through a source flip-flop changing a state thereof, following which the result of the execution is transmitted to a destination flip-flop. The timing checks associated with the aforementioned stall process is part of, for example, a Static Timing Analysis (STA) utilized. Also, the multi-cycle path discussed above is defined during the STA by the programmer/user of a computing system executing tasks associated with unbalanced instruction pipeline 100.
  • If the probability of use of logic 1 122 for processing is high, the number of stalls increases for every instruction associated with the aforementioned processing. Thus, dynamic power consumption is impacted as the number of clock cycles is proportional to the dynamic power. In addition, unbalanced instruction pipeline 100 has clock buffers, the constituent flip-flop(s) of which toggles at rising/falling edges of clock pulses. This may contribute to increased dynamic power consumption. Therefore, in the abovementioned example, it is preferable to clock unbalanced instruction pipeline 100 at a frequency lower than 500 MHz.
  • It is possible to determine the rate of resource occupancy associated with an instruction/a constituent stage of unbalanced instruction pipeline 100 through profiling an instruction code associated therewith. In the example described above, an instruction is associated with division, multiplication and addition. For example, logic 1 122 is associated with division operations, logic 2 124 is associated with multiplication operations, and logic 3 126 is associated with addition operations. The rate of use (i.e., resource occupancy) of logic 1 122, logic 2 124 and logic 3 126 is expressed in example Equation (3) as:
  • R 1 , 2 , 3 = N division , multiplication , addition N ( 3 )
  • where R1, R2 and R3 are the rates of use of logic 1 122, logic 2 124 and logic 3 126 respectively, N is the number of instructions, and Ndivision, Nmultiplication and Naddition are the number of division, multiplication and addition instructions respectively.
  • As discussed above, R1, R2 and R3 is obtained through profiling the instruction code associated with processing through stage 3 110 of unbalanced instruction pipeline 100. For example, compiling/executing the instruction code associated therewith yields R1, R2 and R3. Also, the rate of resource occupancy may depend on a system level scenario in which the instruction code is executed. Thus, obtaining the rate of resource occupancy associated with a stage (or, a sub-stage) of unbalanced instruction pipeline 100 may include monitoring utilization of a processor/memory associated therewith. Parameters associated with the aforementioned monitoring also includes instruction cache (e.g., instruction cache associated with program memory 102) hits/misses and data cache (e.g., data cache associated with data memory) hits/misses.
  • The instruction cache and the data cache may, respectively, allow for increased speed of an instruction fetch process and a data fetch/store process. In order to monitor these parameters, performance counters (or, registers) are employed in the processing/operating environment associated with processing through unbalanced instruction pipeline 100. The performance counters (or, registers) are configured to keep track of the abovementioned processor/memory utilization and/or a number of instruction/data cache hits/misses. The number of stall cycles associated with a clock frequency (e.g., 500/66.7 MHz) is estimated through the delay (e.g., 2 ns/15 ns) associated with the stage of unbalanced instruction pipeline 100, as discussed above.
  • In certain scenarios, rate vectors <R> is constant throughout run-time. For example, the instruction code being executed is associated with a reliability test of a product, which may take values of the same parameters that are approximately close to one another on different days and check for continued reliability. In such scenarios, an initial profiling of the instruction code may suffice to determine the rate vectors <R>. The clock frequency and the number of stall cycles is kept constant for the instruction code. In other scenarios, the rate vectors <R> may not be constant throughout run-time, and is changed dynamically, as will be discussed below.
  • FIG. 2 illustrates an example processing scenario, according to one or more embodiments. It is assumed that there is a processor in which unbalanced instruction pipeline 100 is implemented. The processor is configured to support video processing 202 for the first 10 seconds (s) of an operation. Audio processing 204 for the next 20 seconds and, again, video processing 206 for the next 10 seconds. Video processing 206 is analogous to video processing 202. As discussed above, R1 is associated with the rate of use of logic 1 122, R2 is associated with the rate of use of logic 2 124, and R2 is associated with the rate of use of logic 3 126. As video processing (202, 206) involves operations (e.g., mathematical operations) that are different from that of audio processing 204, and the rate vector <R1> (e.g.,(R1, R2, R3)) associated with video processing (202, 206) is different from the rate vector <R2> (e.g.,(R1, R2, R3)) associated with audio processing 204, as shown in FIG. 2.
  • In the example shown in FIG. 2, the minimum occupancy time associated with audio/video processing (202, 204, 206) is 10 seconds. The minimum occupancy time is then sampled at, for example, every 1 s, which is the interval for estimating <R>. The pre-defined intervals for determining <R> are thus chosen based on the rate at which change in processing scenarios (e.g., audio processing 204, video processing (202, 206)) for the processor.
  • Thus, <R> (e.g., <R1>, <R2>) is estimated at pre-defined intervals, depending on which clock frequency and stall cycles is updated to the hardware associated with the processing. As shown in FIG. 2, video processing 202 involves a rate vector <R1> for which clock frequency f1 and the associated stall vector <s1> (e.g., (s1, s2, s3)) is obtained based on maximizing throughput. Here, s1 denotes the number of stall cycles associated with logic 1 122, s2 denotes the number of stall cycles associated with logic 2 124, and s3 denotes the number of stall cycles associated with logic 3 126.
  • At the end of the first 10 seconds, the clock frequency and the stall vector is updated in the hardware to f2 and <s2> (e.g., (s1, s2, s3)) respectively to allow for an optimum (e.g., maximum) throughput during audio processing 204. The clock frequency and the stall vector continues to be f2 and <s2> for the next 20 seconds, although the associated rate vector <R2> is still monitored for changes in the rates therein. At the end of the 20 seconds, the clock frequency and the stall vector is switched to f1 and <s1> as audio processing 204 switches to video processing 206. The aforementioned operations, including the calculation of <R>, are performed through pipeline control unit 150 having associated logic.
  • FIG. 3 illustrates logic associated with pipeline control unit 150 configured to dynamically profile the instruction code associated with stage 3 110 of unbalanced instruction pipeline 100, according to one or more embodiments. As shown in FIG. 3 and as discussed above, decoded instruction control signals (e.g., decoded instruction control (stage 3) 302 associated with stage 3 110) is input to pipeline control unit 150 (e.g., to the aforementioned logic associated with pipeline control unit 150). Counter 1 304, counter 2 306 and counter 3 308 is associated with computing a rate vector <R> associated with a processing scenario. A pre-defined interval for profiling an instruction code associated with the processing scenario is chosen analogous to the example discussed in FIG. 2. It is assumed that there is M average number of instructions in the pre-defined interval.
  • A Look Up Table (LUT) 312 is implemented in the logic associated with pipeline control unit 150 to obtain the clock frequency and stall cycles (or, stall vectors) for different values of rate vector <R>. LUT 312 is implemented using a multiplexer having inputs to LUT 312 (e.g., <R>=<R1>, <R2>) as select lines thereof. The output of LUT 312 is the clock frequency (e.g., f=f1, f2) and/or the stall vector (e.g., <s>=<s2>, <s2>). At the end of every interval, the counters (e.g., counter 1 304, counter 2 306, counter 3 308) is reset through interval counter 310. Interval counter 310 is also be configured to count the pre-defined intervals (e.g., interval period in FIG. 3). Implementations of interval counters 310 are known to one skilled in the art, and, therefore, discussion associated therewith is skipped for the sake of convenience.
  • To summarize, in one or more embodiments, at every interval, the hardware associated with processing through unbalanced instruction pipeline 100 is updated with a new frequency and a stall vector, if applicable, based on a change in the rate vector (e.g., <R2>) when compared to the previous rate vector (e.g., <R1>) associated with the previous interval. Then, the counters (e.g., counter 1 304, counter 2 306, counter 3 308) associated with computing <R> (e.g., <R1>, <R2>) is reset to begin with the next averaging.
  • FIG. 4 illustrates throughput 402 associated with a stage of unbalanced instruction pipeline 100 as a function of clock frequency, f 404, for different example values of the rate vector, <R> 406, according to one or more embodiments. As discussed above, the rate vector, <R> 406 (e.g., <R1>, <R2>), is determined from the compiled instruction code associated therewith. In one or more embodiments, the plot is obtained through a knowledge of stall vector, <s> 410 (e.g., <s1>, <s2>), associated with clock frequency, f 404 (e.g., f1, f2). Increasing clock frequency, f 404, beyond a certain value (e.g., f1, f2) is not required as throughput 402 may saturate beyond a certain value. FIG. 4 also shows a table associating <R> 406, f 404, and <s> 410. Clock frequency, f 404, is configurable based on <R> 406. As seen in the discussion associated with FIG. 3, the output of LUT 312 may yield clock frequency, f 404. As a phase-locked loop (PLL) is used for generation of clock frequency, f 404, the PLL is programmed to select an appropriate frequency. The PLL is associated with a clock generation circuit (e.g., CLK GEN 3 136, CLK GEN 2 134, CLK GEN 1 132, CLK GEN 0 130) of a stage of unbalanced instruction circuit 100.
  • FIG. 5 illustrates a computing system 500 including processor 502 in which unbalanced instruction pipeline 100 is implemented, according to one or more embodiments. Computing system 500 is a personal computer, a laptop, a notebook computer and/or a system utilizing the benefits associated with optimized unbalanced instruction pipeline 100. Computing system 500 also includes a microcontroller with a processor 502. Computing system 500 includes a memory 504 (e.g., program memory 102) configured to store the instruction code associated with processing through unbalanced instruction pipeline 100. Computing system 500 also includes a determination module 506 configured to determine the rate of resource occupancy of a constituent stage of unbalanced instruction pipeline 100 through profiling the instruction code associated with processing through unbalanced instruction pipeline 100. Processor 502 is configured to perform processing associated with unbalanced instruction pipeline 100 at a clock frequency based on an optimum a power consumption and/or a throughput associated with unbalanced instruction pipeline 100 for the determined rate of resource occupancy of the constituent stage.
  • FIG. 6 illustrates a process flow diagram detailing the operations involved in a method of performing optimum data processing through unbalanced instruction pipeline 100, according to one or more embodiments. Operation 602 involves determining a rate of resource occupancy of a constituent stage of unbalanced instruction pipeline 100 implemented in processor 502 through profiling an instruction code associated therewith. Operation 604 then involves performing data processing associated with unbalanced instruction pipeline 100 at a maximum throughput at an optimum clock frequency based on the resource occupancy.
  • FIG. 7 illustrates a process flow diagram detailing the operations involved in a method of performing optimum and dynamic data processing through unbalanced instruction pipeline 100, according to one or more embodiments. Operation 702 involves determining a time interval within a processing time associated with a constituent stage of unbalanced instruction pipeline 100 implemented in processor 502 based on a change in a processing scenario associated with data processing. Operation 704 involves dynamically determining a rate of resource occupancy of the constituent stage periodically with a time period equal to the determined time interval through profiling an instruction code associated therewith. Operation 706 involves periodically obtaining a clock frequency associated with the rate of resource occupancy of the constituent stage. The clock frequency corresponds to optimized power consumption and/or a throughput associated with unbalanced instruction pipeline 100. Operation 708 then involves performing the data processing at the periodically obtained clock frequency.
  • Exemplary embodiments discussed above can be used in high-performance, low power computing applications. Specifically, the exemplary embodiments is used in delay-locked loops (DLLs) associated with Global Positioning System (GPS) receivers and embedded/Digital System Processing (DSP) applications requiring large-scale processing. Other applications utilizing the concepts discussed herein are within the scope of the exemplary embodiments. Stage 3 110 of unbalanced instruction pipeline 100 may involve a hard macro (e.g., the data memory discussed above) therein. The divider logic, adder logic, and multiplier logic discussed above are merely for purposes of illustration. Modifications in the constituent elements of stages (e.g., increasing/decreasing the number of constituent elements, varying the constituent elements) of unbalanced instruction pipeline 100 are well within the scope of the exemplary embodiments. In one or more embodiments, it is possible that a constituent stage (e.g., stage 3 110) of unbalanced instruction pipeline 100 may include a single element, which may contribute to the maximum delay associated with unbalanced instruction pipeline 100. Optimization, as discussed above, may then be done based on the aforementioned single element.
  • Although the present embodiments have been described with reference to specific example embodiments, it will be evident that various modifications and changes is made to these embodiments without departing from the broader spirit and scope of the various embodiments. For example, the various systems, devices, apparatuses, and circuits, etc. described herein is enabled and operated using hardware circuitry, firmware, software or any combination of hardware, firmware, or software embodied in a machine readable medium. The various electrical structures and methods is embodied using transistors, logic gates, application specific integrated (ASIC) circuitry or Digital Signal Processor (DSP) circuitry.
  • In addition, it will be appreciated that the various operations, processes, and methods disclosed herein is embodied in a machine-readable medium or a machine accessible medium compatible with a data processing system, and is performed in any order. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.

Claims (16)

What is claimed is:
1. A method comprising:
determining a rate of resource occupancy of a constituent stage of an unbalanced instruction pipeline implemented in a processor through profiling an instruction code; and
performing data processing associated with the unbalanced instruction pipeline at a maximum throughput and at an optimum clock frequency based on the rate of resource occupancy.
2. The method of claim 1, wherein performing the data processing includes stalling processing associated with at least one of the constituent stage of the unbalanced instruction pipeline and a previous stage, for at least a number of clock cycles corresponding to a delay time associated with the processing, through the constituent stage by gating a clock input to the at least one of the constituent stage and the previous stage.
3. The method of claim 1, further comprising:
determining a time interval within a processing time associated with the constituent stage of the unbalanced instruction pipeline based on a change in a processing scenario associated with processing;
dynamically determining the rate of resource occupancy of the constituent stage periodically with a time period equal to the determined time interval; and
obtaining, at every time interval, the clock frequency associated with the rate of resource occupancy of the constituent stage for performing the data processing associated with the unbalanced instruction pipeline.
4. The method of claim 3, wherein the clock frequency associated with the data processing is higher than a frequency corresponding to the higher delay time associated with the constituent stage.
5. The method of claim 2, further comprising obtaining a number of stall cycles associated with stalling in at least one of the constituent stage of the unbalanced instruction pipeline and a previous stage thereof for at least the number of stall cycles, wherein the number of stall cycles corresponds to a delay time associated with the processing through the constituent stage.
6. The method of claim 5, wherein determining the rate of resource occupancy of the constituent stage of the unbalanced instruction pipeline includes:
inputting a control signal associated with a decoded instruction associated with the processing through the constituent stage to a counter associated therewith;
determining the rate of resource occupancy of the constituent stage through the counter; and
maintaining a Look Up Table (LUT) associated with the counter to map the determined rate of resource occupancy and at least one of the clock frequency and the number of stall cycles associated therewith.
7. The method of claim 6, further comprising:
updating hardware associated with the processing through the constituent stage with the at least one of the clock frequency and the number of stall cycles determined through the LUT when the at least one of the clock frequency and the number of stall cycles varies from a value thereof during a previous time interval; and
resetting the counter at the end of the time interval.
8. The method of claim 6, comprising implementing the LUT through a multiplexer having the rate of resource occupancy as an input and a select line.
9. A method comprising:
determining a time interval within a processing time associated with a constituent stage of an unbalanced instruction pipeline implemented in a processor based on a change in a processing scenario associated with data processing;
dynamically determining a rate of resource occupancy of the constituent stage periodically with a time period equal to the time interval through profiling an instruction code;
periodically obtaining a clock frequency associated with the rate of resource occupancy of the constituent stage, the clock frequency corresponding to an optimized at least one of a power consumption and a throughput associated with the unbalanced instruction pipeline; and
performing the data processing at the periodically obtained clock frequency.
10. The method of claim 9, further comprising obtaining a number of stall cycles associated with stalling processing in at least one of the constituent stage of the unbalanced instruction pipeline and a previous stage thereof for at least the number of stall cycles,
wherein the number of stall cycles corresponds to a delay time associated with the processing through the constituent stage.
11. The method of claim 9, wherein dynamically determining the rate of resource occupancy of the constituent stage includes:
inputting a control signal associated with a decoded instruction associated with the processing through the constituent stage to a counter associated therewith;
determining the rate of resource occupancy of the constituent stage through the counter; and
maintaining a Look Up Table (LUT) associated with the counter to map the determined rate of resource occupancy and at least one of the clock frequency and the number of stall cycles associated therewith.
12. The method of claim 11, further comprising:
updating hardware associated with the processing through the constituent stage with the at least one of the clock frequency and the number of stall cycles determined through the LUT when the at least one of the clock frequency and the number of stall cycles varies from a value thereof during a previous time interval; and
resetting the counter at the end of the time interval.
13. The method of claim 11, comprising implementing the LUT through a multiplexer having the rate of resource occupancy as an input and a select line thereof.
14. A computing system comprising:
a processor having an unbalanced instruction pipeline;
a memory configured to store an instruction code associated with processing through the unbalanced instruction pipeline; and
a determination module configured to determine a rate of resource occupancy of a constituent stage of the unbalanced instruction pipeline through profiling the instruction code associated with processing through the unbalanced instruction pipeline, the processor being configured to perform data processing at a maximum throughput at an optimum clock frequency based on the rate of resource occupancy.
15. The computing system of claim 14, further comprising a pipeline control unit configured to control a clock generation circuit associated with the constituent stage of the unbalanced instruction pipeline.
16. The computing system of claim 16, wherein the pipeline control unit further comprises a Look Up Table (LUT) implemented therein configured to map the rate of resource occupancy of the constituent stage determined through the determination module to at least one of the clock frequency and a number of stall cycles,
wherein the number of stall cycles is associated with stalling processing in at least one of the constituent stage of the unbalanced instruction pipeline and a previous stage thereof for at least the number of stall cycles, and
wherein the number of stall cycles corresponds to a delay time associated with the processing through the constituent stage.
US14/860,095 2010-04-20 2015-09-21 Power and throughput optimization of an unbalanced instruction pipeline Abandoned US20160011642A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/860,095 US20160011642A1 (en) 2010-04-20 2015-09-21 Power and throughput optimization of an unbalanced instruction pipeline

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
IN1129CH2010 2010-04-20
IN1129/CHE/2010 2010-04-20
US13/089,101 US9141392B2 (en) 2010-04-20 2011-04-18 Different clock frequencies and stalls for unbalanced pipeline execution logics
US14/860,095 US20160011642A1 (en) 2010-04-20 2015-09-21 Power and throughput optimization of an unbalanced instruction pipeline

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US13/089,101 Division US9141392B2 (en) 2010-04-20 2011-04-18 Different clock frequencies and stalls for unbalanced pipeline execution logics

Publications (1)

Publication Number Publication Date
US20160011642A1 true US20160011642A1 (en) 2016-01-14

Family

ID=44789092

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/089,101 Active 2033-05-01 US9141392B2 (en) 2010-04-20 2011-04-18 Different clock frequencies and stalls for unbalanced pipeline execution logics
US14/860,095 Abandoned US20160011642A1 (en) 2010-04-20 2015-09-21 Power and throughput optimization of an unbalanced instruction pipeline

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US13/089,101 Active 2033-05-01 US9141392B2 (en) 2010-04-20 2011-04-18 Different clock frequencies and stalls for unbalanced pipeline execution logics

Country Status (1)

Country Link
US (2) US9141392B2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114724233A (en) * 2020-12-21 2022-07-08 青岛海尔多媒体有限公司 Method and device for gesture control of terminal equipment and terminal equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3623017A (en) * 1969-10-22 1971-11-23 Sperry Rand Corp Dual clocking arrangement for a digital computer
US4819164A (en) * 1983-12-12 1989-04-04 Texas Instruments Incorporated Variable frequency microprocessor clock generator
US4821229A (en) * 1985-12-12 1989-04-11 Zenith Electronics Corporation Dual operating speed switchover arrangement for CPU
US5179693A (en) * 1985-03-29 1993-01-12 Fujitsu Limited System for controlling operation of processor by adjusting duty cycle of performance control pulse based upon target performance value
US5197126A (en) * 1988-09-15 1993-03-23 Silicon Graphics, Inc. Clock switching circuit for asynchronous clocks of graphics generation apparatus
US5301306A (en) * 1990-09-19 1994-04-05 U.S. Philips Corporation Circuit for slowing portion of microprocessor operating cycle in all successive operating cycles regardless of whether a slow device is accessed in the portion of any operating cycle
US20020104032A1 (en) * 2001-01-30 2002-08-01 Mazin Khurshid Method for reducing power consumption using variable frequency clocks
US7076681B2 (en) * 2002-07-02 2006-07-11 International Business Machines Corporation Processor with demand-driven clock throttling power reduction
US20090019265A1 (en) * 2007-07-11 2009-01-15 Correale Jr Anthony Adaptive execution frequency control method for enhanced instruction throughput
US7809932B1 (en) * 2004-03-22 2010-10-05 Altera Corporation Methods and apparatus for adapting pipeline stage latency based on instruction type

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6993669B2 (en) * 2001-04-18 2006-01-31 Gallitzin Allegheny Llc Low power clocking systems and methods
JP2003186567A (en) * 2001-12-19 2003-07-04 Matsushita Electric Ind Co Ltd Microprocessor
US7770034B2 (en) * 2003-12-16 2010-08-03 Intel Corporation Performance monitoring based dynamic voltage and frequency scaling
JP3862715B2 (en) * 2004-06-01 2006-12-27 株式会社ソニー・コンピュータエンタテインメント Task management method, task management device, semiconductor integrated circuit, electronic device, and task management system
JP2006059068A (en) * 2004-08-19 2006-03-02 Matsushita Electric Ind Co Ltd Processor device
US20060200651A1 (en) * 2005-03-03 2006-09-07 Collopy Thomas K Method and apparatus for power reduction utilizing heterogeneously-multi-pipelined processor
US8219993B2 (en) * 2008-02-27 2012-07-10 Oracle America, Inc. Frequency scaling of processing unit based on aggregate thread CPI metric
CN102119380B (en) * 2009-06-10 2014-04-02 松下电器产业株式会社 Trace processing device and trace processing system
US8527796B2 (en) * 2009-08-24 2013-09-03 Intel Corporation Providing adaptive frequency control for a processor using utilization information

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3623017A (en) * 1969-10-22 1971-11-23 Sperry Rand Corp Dual clocking arrangement for a digital computer
US4819164A (en) * 1983-12-12 1989-04-04 Texas Instruments Incorporated Variable frequency microprocessor clock generator
US5179693A (en) * 1985-03-29 1993-01-12 Fujitsu Limited System for controlling operation of processor by adjusting duty cycle of performance control pulse based upon target performance value
US4821229A (en) * 1985-12-12 1989-04-11 Zenith Electronics Corporation Dual operating speed switchover arrangement for CPU
US5197126A (en) * 1988-09-15 1993-03-23 Silicon Graphics, Inc. Clock switching circuit for asynchronous clocks of graphics generation apparatus
US5301306A (en) * 1990-09-19 1994-04-05 U.S. Philips Corporation Circuit for slowing portion of microprocessor operating cycle in all successive operating cycles regardless of whether a slow device is accessed in the portion of any operating cycle
US20020104032A1 (en) * 2001-01-30 2002-08-01 Mazin Khurshid Method for reducing power consumption using variable frequency clocks
US7076681B2 (en) * 2002-07-02 2006-07-11 International Business Machines Corporation Processor with demand-driven clock throttling power reduction
US7809932B1 (en) * 2004-03-22 2010-10-05 Altera Corporation Methods and apparatus for adapting pipeline stage latency based on instruction type
US20090019265A1 (en) * 2007-07-11 2009-01-15 Correale Jr Anthony Adaptive execution frequency control method for enhanced instruction throughput

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Christine Placek, "Voltage Over-scaling in Unbalanced Pipelines with Adaptive Clocking," April 2009, a Master Degree Thesis, Purdue University, 31 pages. *
Dropsho et al., "Dynamically Trading Frequency for Complexity in a GALS Microprocessor," Dec. 2004, Proceedings of the 37th International Symposium on Microarchitecture, 12 pages. *
Swaroop Ghosh & Kaushik Roy, "Exploring high-speed low-power hybrid arithmetic units at scaled supply and adaptive clock-stretching," March 2008, Asia and South Pacific Design Automation Conference, pp. 635-40. *

Also Published As

Publication number Publication date
US20110258417A1 (en) 2011-10-20
US9141392B2 (en) 2015-09-22

Similar Documents

Publication Publication Date Title
US8412974B2 (en) Global synchronization of parallel processors using clock pulse width modulation
US20120079255A1 (en) Indirect branch prediction based on branch target buffer hysteresis
US8281113B2 (en) Processor having ALU with dynamically transparent pipeline stages
Ritpurkar et al. Design and simulation of 32-Bit RISC architecture based on MIPS using VHDL
US20150341032A1 (en) Locally asynchronous logic circuit and method therefor
US20160011642A1 (en) Power and throughput optimization of an unbalanced instruction pipeline
US7065636B2 (en) Hardware loops and pipeline system using advanced generation of loop parameters
US7134036B1 (en) Processor core clock generation circuits
US6920547B2 (en) Register adjustment based on adjustment values determined at multiple stages within a pipeline of a processor
Tina et al. Performance improvement of MIPS Architecture by Adding New features
US20130013902A1 (en) Dynamically reconfigurable processor and method of operating the same
US20140317433A1 (en) Clock control circuit and method
Rohit et al. Implementation of 32-bit RISC processors without interlocked Pipelining on Artix-7 FPGA board
US20130262910A1 (en) Time keeping in unknown and unstable clock architecture
Trivedi et al. Reduced-hardware Hybrid Branch Predictor Design, Simulation & Analysis
Yao et al. A dynamic control mechanism for pipeline stage unification by identifying program phases
TWI477941B (en) Dynamic modulation processing device and its processing method
CN109564459B (en) Method and apparatus for automatic adaptive voltage control
Albeck et al. Energy efficient computing by multi-mode addition
Kumar et al. Performance Analysis of Different types of Adders for High Speed 32 bit Multiply and Accumulate Unit
Sato et al. A Crossbar Switch Circuit Design for Reconfigurable Wave-Pipelined Circuits
US10579414B2 (en) Misprediction-triggered local history-based branch prediction
Selvi et al. Performance analysis of multiplier using various techniques
Yeh et al. Adaptive Pipeline Voltage Scaling in High Performance Microprocessor
Lozano et al. A deeply embedded processor for smart devices

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION