ふと思いついてPthreadとOpenMPの性能の比較をしてみた。 blasを使った倍精度の内積計算。
AMD: C-50 (2core)
gcc: 4.7.0 (gmp-5.0.5)
real:
serial 1.079 s
Omp 0.830 s ( 2 thread )
pthread 1.225 s ( 2 thread )
ネットブックのじゃ、スレッド呼出のオーバヘッドの方が大きいのか? pthreadがなんだか遅い。 pthreadだけ作業用にglobalでとってあげたりしたんだけどあんまり早くならない。 pthread_joinが結構食ってる感じだ。外すと他と同じぐらいになるのだけれど答えが合わない。 内積計算はメモリーアクセスの問題の方が大きいかも
研究室のクラスタでやったらXeon( sandy世代 )
serial 0.19 s
Omp 0.001 s ( 2 thread )
pthread 0.001 s ( 2 thread )
もっと正確に測んなきゃだめですね。 とにかく順当に早くなっていました。Pthreadのほうが欠かなきゃいけない量が多いのでOpenMPですね。
EN
I just wondered which is faster OpenMP or Pthread. Here is my measurement.
The test code is illustrated below. It's a simple double precision dot product using cblas written in C. The compiler used was gcc-4.7.0. I initially tested on my Netbook with AMD C50 APU and got follwing result
real:
serial 1.079 s
Omp 0.830 s ( 2 thread )
pthread 1.225 s ( 2 thread )
I didn't got what I expected somehow pthread was slower than serial ??? For validation, I also tested on my Uni's Cluster, which has sandy bridge generation Xeon, and got follwing result.
serial 0.19 s
Omp 0.001 s ( 2 thread )
pthread 0.001 s ( 2 thread )
Looks OK now though more accurate measurement is required. Pthread looks slightly slower for me, yet pthread required more longer codes. Then I'm for OpenMP.
以下テストコード
( serial, pthread and OpenMP performance comparison by Cblas dot product)
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<cblas.h>
#include<pthread.h>
#include<omp.h>
#define N 165535
//#define N 8
double *x_global;
double *y_global;
double tmp_global;
//--------------------------------------------------------
void init(
//--------------------------------------------------------
double *x,
double *y
){
size_t i;
for( i=0; i<N; i++ ){
x[i] = i;
y[i] = i;
}
}
//--------------------------------------------------------
void serial_dot(
//--------------------------------------------------------
double *x,
double *y
){
cblas_ddot( N, x, 1, y, 1 );
#ifdef SERIAL
printf( " %lf\n", cblas_ddot( N, x, 1, y, 1 ) );
#endif
}
//--------------------------------------------------------
void omp_dot(
//--------------------------------------------------------
double *x,
double *y
){
size_t i;
const size_t nprocs = omp_get_num_procs();
double tmp;
omp_set_num_threads(nprocs);
tmp = 0.0;
#pragma omp parallel for
for( i=0; i<nprocs; i++ ){
tmp += cblas_ddot( N/2, &x[i*N/2], 1, &y[i*N/2], 1 );
}
#ifdef OMP
printf("%lf\n",tmp );
#endif
}
//--------------------------------------------------------
void *pddot( void *arg ){
//--------------------------------------------------------
size_t i;
i = (size_t)arg;
#ifdef PTHREAD
printf("\tthread[%lu]\n",i);
#endif
tmp_global += cblas_ddot( N/2, &x_global[i*N/2], 1, &y_global[i*N/2], 1 );
}
//--------------------------------------------------------
void pthread_dot(
//--------------------------------------------------------
pthread_t threads[2],
double *x,
double *y
){
size_t i;
tmp_global = 0;
for( i=0; i<2; i++ ){
pthread_create( &threads[i], NULL, pddot, (void*)i );
}
for( i=0; i<2; i++ ){
pthread_join( threads[i], NULL );
}
#ifdef PTHREAD
printf("%lf\n",tmp_global);
#endif
}
//--------------------------------------------------------
int main(){
//--------------------------------------------------------
int i;
pthread_t threads[2];
double *x, *y;
x = (double*)malloc(sizeof(double)*N);
y = (double*)malloc(sizeof(double)*N);
x_global = (double*)malloc(sizeof(double)*N);
y_global = (double*)malloc(sizeof(double)*N);
init( x, y );
memcpy( x_global, x, sizeof(double)*N );
memcpy( y_global, y, sizeof(double)*N );
for( i=0; i<1000; i++ ){
//serial_dot( x, y );
//omp_dot( x, y );
pthread_dot( threads, x, y );
}
return 0;}
0 件のコメント:
コメントを投稿