HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.
Need for fixed point representation in DSP applications
Most of the commercially available DSPs doesnt support floating point arithmetic due to cost of extra silicon
imposed on the hardware. For implementing algorithms that need floating point operations in DSPs, one needs to
emulate the floating point operations in software. The software emulated floating point operations are very
costly in terms of cycle usages. This results into degradation of the performance of the algorithm.
Fixed point implementation of algorithms gives a significant improvement in execution speed due to inherent
support for integer mathematics in DSPs. Since, fixed point numbers have reduced range and accuracy, the fixed
point implementation of algorithm may suffer in terms of accuracy. Extra care has to be taken for overflow and
saturation.
Binary representation of fixed point
An N-bit binary number can represent 2N states. These states can be mapped to represent virtually any 2N
entities. In case of fixed point numbers, these states represent rational numbers. A rational number is in
form of a/b. Here a and b are integers. For fixed point representation, b is of form 2n. n decides the position
of binary point. The virtual binary point is not represented.
Concepts
There are many representation formats for representing fixed point numbers. This document usages the format as
prescribed by Texas Instrument.
Format
A fixed point binary number can be represented in form of Qm.n.
- Q: Texas Instruments representation for signed fixed-point numbers.
- m (optional; default=0): The number of bits used to designate the two's complement integer portion of the
number, exclusive of the sign bit.
- n: The number of bits used to designate the two's complement fractional portion of the number, i.e. the
number of bits to the right of the binary point.
So, Q7.8 means that there are 7 bits for representing integer portion, 8 bits for fractional portion and 1 bit
is reserved for sign. Similarly Q15 means there are 0 bits for integer portion, 15-bits for fraction and 1 sign
bit.
Range
The difference of largest and the smallest real number that can be represented using fixed point format is known
as the range of the fixed point format.
For Qm.n format, Range = [-2^m , 2^m - 2^{-n}]
Example:
Q7.8 Range = [ - 2^7, 2^7 - 2^{-8} ] = [-128 , 127.99609]
Q15 Range = [ -2^0 , 2^0 - 2^{-15} ] = [-1, 0.9999695]
Resolution
The difference between two consecutive real numbers in the fixed point format is known as its resolution. The
resolution of fixed point format defines the precision with which numbers can be stored in given Q format.
For Qm.n format, Resolution = 2^-n^
Example:
| Q7.8 | Resolution = 2^{-8} = 0.0039063 |
| Q15 | Resolution = 2^{-15} = 0.0000305 |
Word Length
Word length defines the storage requirement for a fixed point number format. For Qm.n format, Word Length
= m+n+1 bits.
Conversion
Fixed to Float
To convert a number from Qm.n format to floating point:
- Convert the number directly to floating point
- Divide by 2^n
Float to Fixed
To convert a number from floating point to Qm.n format:
- Multiply the floating point number by 2^n
- Round to the nearest integer
Basic Arithmetic
Following sections explain the basic arithmetic over fixed point numbers. It shall be noted that fixed point
numbers are a ratio of two integers, the numerator is kept in storage, the denominator is equal to 2^n.
Addition
\frac{num1}{d} + \frac{num2}{d} = \frac{num1 + num2}{d}
For addition, the two numbers must be in same Qm.n format.
So, adding the two fixed point number is simple integer addition. The result is in same Qm.n format.
unsigned short SATURATE_FLAG = 1; //to be read by functions
unsigned short OVERFLOW_FLAG = 0; //to be written by
functions
signed short Fx_Add(signed short x, signed short y)
{
signed long z;
OVERFLOW_FLAG = 0;
z = (signed long)x + y;
/* check if no overflow happens */
if(z = -32768)
return(x + y);
OVERFLOW_FLAG = 1;
if(SATURATE_FLAG == 1)
if(z
return(0x8000);
else
return(0x7FFF);
else
return(x + y);
}
Subtraction
\frac{num1}{d} - \frac{num2}{d} = \frac{num1 - num2}{d}
For subtraction, the two numbers must be in same Qm.n format. So, subtracting the two fixed point number is simple integer subtraction. The result is in same Qm.n format.
unsigned short SATURATE_FLAG = 1; //to be read by functions
unsigned short OVERFLOW_FLAG = 0; //to be written by
functions
signed short Fx_Sub(signed short x, signed short y)
{
signed long z;
OVERFLOW_FLAG = 0;
z = (signed long)x - y;
/* check if no overflow happens */
if(z = -32768)
return(x - y);
OVERFLOW_FLAG = 1;
if(SATURATE_FLAG == 1)
if(z
return(0x8000);
else
return(0x7FFF);
else
return(x - y);
}
Multiplication
\frac{a}{2^{n1}} * \frac{b}{2^{n2}} = \frac{a.b}{2^{n1+n2}}
Two fixed point number in Qm1.n1 and Qm2.n2 format can be multiplied. Simple * operator will result into Qm.n
number. Here n = n1 + n2 . Since the * operation result into two bits being reserved for sign. As a
convention, the result is shifted left by 1 to have a single sign bit. So, the result is in Qm.n format:
\begin{eqnarray}
m = m1 + m2 \\
n = n1 + n2 + 1
\end{eqnarray}
signed short Fx_Mult(signed short x, signed short y)
{
signed long z;
z = x * y;
z
}
Example
Below is example of fixed point C code for implementing sine function. The algorithm used here is CORDIC.
CORDIC: (COordinate Rotation DIgital Calculation) finds the sine or cosine of an angle iteratively, using only
simple math operations such as add, subtract, compare, shift, and table lookup.
Figure 1: CORDIC
The diagonal blue line is angle \theta above the horizontal. The diagonal red line is the blue line rotated
counter-clockwise by angle \theta_1 . The new X and Y values are related to the old X and Y values as follows:
\begin{eqnarray}
X2 = X1 * \cos(\theta) - Y1 * \sin(\theta) \\
Y2 = X1 * \sin(\theta) + Y1 * \cos(\theta)
\end{eqnarray}
For CORDIC, the final angle \theta_2 is the angle of interest; the angle whose sine or cosine we want to
calculate. The initial angle \theta_1 is set to a convenient value such as 0. Rather than rotating from
\theta_1 to \theta_2 in one fell swoop, we move in steps. With careful choice of step values, the only
math used is shifts and adds.
The equations above can be re-written as:
\begin{eqnarray}
X2 = \cos(\theta) * (X1 - Y1 \tan(\theta) )\\
Y2 = \cos(\theta) * ( X1 \tan(\theta) + Y1 )
\end{eqnarray}
Values for \theta are chosen such that \tan(\theta) is a fractional power of 2:
| \tan(\theta21) = 1/1 | \theta21 = 45 | \cos(\theta21) = 0.707107 |
| \tan(\theta32) = 1/2 | \theta32 = 26.5650 | \cos(\theta32) = 0.894427 |
| \tan(\theta43) = 1/4 | \theta43 = 14.0362 | \cos(\theta43) = 0.970142 |
| \tan(\theta54) = 1/8 | \theta54 = 7.12502 | \cos(\theta54) = 0.992278 |
| \tan(\theta65) = 1/16 | \theta65 = 3.57633 | \cos(\theta65) = 0.998053 |
| \tan(\theta76) = 1/32 | \theta76 = 1.78991 | \cos(\theta76) = 0.999512 |
| \tan(\theta87) = 1/64 | \theta87 = 0.895174 | \cos(\theta87) = 0.999878 |
| \tan(\theta98) = 1/128 | \theta98 = 0.447614 | \cos(\theta98) = 0.999969 |
This lets us replace the multiplication by \tan(\theta)4 with a simple, fast right-shift operation.
The cosine factors fall out, to form an iterative product, i.e.
\cos(\theta21) * \cos(\theta32) * \cos(\theta43) ... * \cos(\theta nn)
Expressing the \theta values in terms of inverse tangents gives the equivalent product series
\prod_{N=0}^{\infty} \frac{2^N}{\sqrt{1 + 2^N}} = 0.607253
The value, to which it converges, 0.607253, is the aggregate constant.
\cos(\theta) terms are ignored and multiplied in the beginning or end.
C Code
signed short Fx_Sine(signed short z)
{
signed short at14? = {0x2000, 0x12e4, 0x09fb, 0x0511,
0x028b,
0x0146, 0x00a3, 0x0051, 0x0029, 0x0014,
0x000a, 0x0005, 0x0003, 0x0001};
signed long k=0x4dba0000;
signed long x, y, x1, y1;
signed short i, z1;
x = k; //holds the cos(z) value
y = 0;//holds the sine(z) value
/* compute only in 1st and 4th quadrant. */
/* Subtracting PI/2 (0x4000) from positive angle values and
Adding
* PI/2 from negative angle values will result into the new
angle
* between PI/2 to PI/2. The sine and cosine terms has to be
* accordingly changed as per trigonometric identities.
*/
if( z >= 0)
z -=0x4000;
else
{
z +=0x4000;
x = -x;
}
for(i = 0; i
{
x1 = x >> i;
y1 = y >> i;
z1 = ati?;
if (z >= 0)
{
x = x - y1;
y = y + x1;
z = z - z1;
}
else
{
x = x + y1;
y = y - x1;
z = z + z1;
}
}
return(x>>16);
}
Performance
The performance of Fixed Point implementation of an algorithm is measured by the SQNR (Signal to Quantization Noise Ratio). SQNR is defined as ratio of energy of actual signal and energy of quantization noise.
SQNR = \frac{\sum S^2}{\sum E^2}
The SQNR for the above Fixed Point C code is 77.464756 dB.
Figure 2: Plot of sine wave
References
[1 ] Eric L. Oberstar, Fixed-Point Representation and Fractional Math, Oberstar Consulting.
[2 ] Randy Yates, Fixed-Point Arithmetic: An Introduction, Digital Signal Labs
[3 ] http://my.execpc.com/~geezer/embed/cordic.htm
[4 ] Wikipedia:Q_
Comments