Benchmarking of mathematical co-processors
Benchmarking of common functions on common processor
While working on speed optimizations of fitspng, I encountered that the power function pow() slows extremely full run. (pow() is used in conversion from CIE XYZ to CIE Luv).
My inspection of glibc sources revealed that numerical functions provided by GNU glibc just only wraps implementation of ones by mathematical co-processor. No own rational approximation is implemented. The hardware support is directly assured by C99 IEEE standard.
Following the track, I created a small test code to test duration of execution various computing functions in GNU system math library. One is practically tests implementation of the functions in mathematical co-processors.
Fortran
program funs integer, parameter :: double = selected_real_kind(15) integer, parameter :: nsteps = 10000 integer :: i,j real(double) :: x,y, s, xsteps = nsteps s = 0.0 do j = 1,nsteps do i = 1,nsteps x = i/xsteps y = x*(1.0_double/3.0_double) s = s + y end do end do write(*,*) s end program funs
C
#include <stdio.h> #include <math.h> int main() { const int nsteps = 10000; int i,j; double xsteps = (double) nsteps; double x,y,s; s = 0.0; for(j = 1; j <= nsteps; j++) for(i = 1; i <= nsteps; i++) { x = (double) i /xsteps; /*!*/ y = pow(x,1.0/3.0); s += y; } return s > 0.0 ? 1 : 0; }
(note that no-effect summations and return values are added to prevent compiler’s over-optimizations)
The function in line /*!*/ have been changed.
Computations has been done on Intel(R) Core Quad and verified on Intel(R) Xeon machines (cca 30% faster). The gcc and gfortran compilers gives the same results, gfortran may give a litte bit faster code).
Results are summarized in following table.
+-----------------+-------------------------------------------+ | Core Quad | +-----------------+--------+--------+-------+---------+-------+ | | [nsec] | [nsec] | ratio | fortran | i686 | +-----------------+--------+--------+-------+---------+-------+ | | -O | -O4 | -O4 | +=================+========+========+=======+=========+=======+ | pow(x,1.0/3.0) |128.56 |123.01 | 7.2 | 123.06 |287.68 | | exp(1/3*log(x)) |120.21 |116.14 | 6.8 | 117.14 |269.20 | | cbtr(x) | 90.15 | 85.79 | 5.0 | q | q | | sinh(x) | 95.15 | 91.80 | 5.4 | 91.99 |235.34 | | log10(x) | 95.74 | 91.14 | 5.4 | 91.73 | 95.35 | | atan2(x,2.0) | 83.94 | 79.22 | 4.6 | 79.09 |106.00 | | log(x) | 79.30 | 74.98 | 4.4 | 74.88 | 95.22 | | cosh(x) | 78.13 | 74.34 | 4.4 | 74.61 |198.99 | | tan(x) | 78.89 | 73.29 | 4.3 | 73.50 |118.17 | | atan(x) | 64.27 | 58.02 | 3.4 | 57.99 |101.92 | | exp(x) | 58.97 | 54.58 | 3.2 | 54.55 |143.93 | | asin(x) | 58.52 | 53.87 | 3.2 | 53.44 |165.31 | | acos(x) | 58.70 | 55.30 | 3.2 | 54.94 |165.32 | | cos(x) | 51.63 | 45.57 | 2.7 | 45.64 | 93.98 | | sin(x) | 50.03 | 44.29 | 2.6 | 44.77 | 92.30 | | sqrt(x) | 47.68 | 37.23 | 2.2 | 39.35 | 34.11 | | fabs(x) | 20.93 | 16.14 | 1.0 | 16.11 | 17.25 | | (1.0/3.0)*x | 23.02 | 17.16 | 1.0 | 17.22 | 17.87 | +-------------------------------------------------------------+
(i686 = Intel(R) Pentium(R) 4 CPU 2.66GHz, 0.376 nsec) (Core Quad = Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 0.4167 ns )
As expected, the optimization flag -O4 just only marginally speed-up the computation, because ones are actually passed to be computed by the co-processor.
From the table, It is directly visible that pow(x,a) is computed as exp(a*log(x)) because 5.5+7.5 is approximately 13.0. Moreover, the results perhaps shows that computing of pow() by direct use of the identity exp(..log) gives a little bit faster computation. Also log10 is computed perhaps as log(x)/log(10) because 1.7+7.5 is 9.5.
Computation times divides functions onto groups by duration:
- fast: arithmetical operations, fabs
- medium: sin,cos, tan, atan, cbtr, sqrt, asin, acos, sinh, cosh, log(10), exp, atan2
- slow: pow
By the way, I changed computation of x^(1/3) from pow(x,1.0/3.0) to cbtr(x) which speeds-up my code just only about few percents. Of course, the slower computation of a general power is not solved by the way.
Finally, computation times looks horribly but I think that a method which computes a function with precision of 15 decimals just only 7 times slower (!) than basic arithmetical operations is a miracle. The real world miracle.:)