AMDのニュースは最近明るいのきかない。
お安いのが多いのでAMDは結構好きなのだけども。 パソコンメーカがあんまり魅力的な商品を作っていない。 GPUではIntelよりも 優れた技術を持っているのに full HD でるノート型のPCを発売しているメーカーがない。
15.6型ぐらいでお安いのをどこか出してくれないだろうか。
今, lenovoの安いのを買おうかちょっと迷ってる、でも1366x768なのが残念。どこかのメーカーでAMDの製品にスイッチしてさらにお安くとかしてくれないだろうか。そういえば一時期Appleにそんな噂が...。 AMDと運命を共にするだろうけど
2012/11/19
2012/11/15
CUDAの罠? ホスト側のメモリーを静的に取るな!
以下のコードを
nvcc -D DYNAMIC -arch=sm_20 vect_add.cu
nvcc -D STATIC -arch=sm_20 vect_add.cu
でコンパイルして実行してみると面白いことがわかる。
CUDAではCPU側のメモリーを動的に取る必要があるようだ。もし指定いない場合デバイスからホスト側にメモリーのコピー時にエラーコード11を出し止まる。
科学計算系のコードは静的に取ることが多いのでいちいち書き直すのめんどくさい。
GB
Here is a simple vector addition CUDA code. The difference is memory allocation one is static and one is dynamic. Test the code compiling by follwing options.
nvcc -D DYNAMIC -arch=sm_20 vect_add.cu
nvcc -D STATIC -arch=sm_20 vect_add.cu
For static version, memory copy from device to host should failed with error code 11. I'm unsure why this happened. I felt it's a bit inconvenient because most scientific codes allocate memory statically.
================================================
vect_add.cu
================================================
#include<iostream>
#include<cuda.h>
#define N 4
//--------------------------------------------
__global__ void device_hello(
//--------------------------------------------
const double *a,
const double *b,
double *c
){
size_t i = threadIdx.x + blockIdx.x * blockDim.x;
c[i] = a[i] + b[i];
}
//--------------------------------------------
int main(){
//--------------------------------------------
size_t i;
size_t n = N;
double *a,*b,*c;
#ifdef DYNAMIC
double *a_cuda, *b_cuda, *c_cuda;
#endif
#ifdef STATIC
double a_cuda[N], b_cuda[N], c_cuda[N];
#endif
cudaError_t stat;
#ifdef DYNAMIC
a = (double*)malloc(sizeof(double)*n);
b = (double*)malloc(sizeof(double)*n);
c = (double*)malloc(sizeof(double)*n);
#endif
for( i=0; i<n; i++){
a[i] = 1.0;
b[i] = i;
c[i] = 0.0;
}
cudaMalloc( (void**)&a_cuda, sizeof(double)*n );
cudaMalloc( (void**)&b_cuda, sizeof(double)*n );
cudaMalloc( (void**)&c_cuda, sizeof(double)*n );
cudaMemcpy( &a_cuda[0], a, sizeof(double)*n, cudaMemcpyHostToDevice );
cudaMemcpy( &b_cuda[0], b, sizeof(double)*n, cudaMemcpyHostToDevice );
cudaMemcpy( &c_cuda[0], c, sizeof(double)*n, cudaMemcpyHostToDevice );
device_hello<<<n,1>>>( a_cuda, b_cuda, c_cuda );
stat = cudaMemcpy( &c[0], c_cuda, sizeof(double)*n, cudaMemcpyDeviceToHost );
std::cout << "stat " << stat << std::endl;
//for( i=0; i<n; i++){
// std::cout << c[i] << std::endl;
//}
cudaFree( a_cuda );
cudaFree( b_cuda );
cudaFree( c_cuda );
#ifdef DYNAMIC
free( a );
free( b );
free( c );
#endif
return 0;}
nvcc -D DYNAMIC -arch=sm_20 vect_add.cu
nvcc -D STATIC -arch=sm_20 vect_add.cu
でコンパイルして実行してみると面白いことがわかる。
CUDAではCPU側のメモリーを動的に取る必要があるようだ。もし指定いない場合デバイスからホスト側にメモリーのコピー時にエラーコード11を出し止まる。
科学計算系のコードは静的に取ることが多いのでいちいち書き直すのめんどくさい。
GB
Here is a simple vector addition CUDA code. The difference is memory allocation one is static and one is dynamic. Test the code compiling by follwing options.
nvcc -D DYNAMIC -arch=sm_20 vect_add.cu
nvcc -D STATIC -arch=sm_20 vect_add.cu
For static version, memory copy from device to host should failed with error code 11. I'm unsure why this happened. I felt it's a bit inconvenient because most scientific codes allocate memory statically.
================================================
vect_add.cu
================================================
#include<iostream>
#include<cuda.h>
#define N 4
//--------------------------------------------
__global__ void device_hello(
//--------------------------------------------
const double *a,
const double *b,
double *c
){
size_t i = threadIdx.x + blockIdx.x * blockDim.x;
c[i] = a[i] + b[i];
}
//--------------------------------------------
int main(){
//--------------------------------------------
size_t i;
size_t n = N;
double *a,*b,*c;
#ifdef DYNAMIC
double *a_cuda, *b_cuda, *c_cuda;
#endif
#ifdef STATIC
double a_cuda[N], b_cuda[N], c_cuda[N];
#endif
cudaError_t stat;
#ifdef DYNAMIC
a = (double*)malloc(sizeof(double)*n);
b = (double*)malloc(sizeof(double)*n);
c = (double*)malloc(sizeof(double)*n);
#endif
for( i=0; i<n; i++){
a[i] = 1.0;
b[i] = i;
c[i] = 0.0;
}
cudaMalloc( (void**)&a_cuda, sizeof(double)*n );
cudaMalloc( (void**)&b_cuda, sizeof(double)*n );
cudaMalloc( (void**)&c_cuda, sizeof(double)*n );
cudaMemcpy( &a_cuda[0], a, sizeof(double)*n, cudaMemcpyHostToDevice );
cudaMemcpy( &b_cuda[0], b, sizeof(double)*n, cudaMemcpyHostToDevice );
cudaMemcpy( &c_cuda[0], c, sizeof(double)*n, cudaMemcpyHostToDevice );
device_hello<<<n,1>>>( a_cuda, b_cuda, c_cuda );
stat = cudaMemcpy( &c[0], c_cuda, sizeof(double)*n, cudaMemcpyDeviceToHost );
std::cout << "stat " << stat << std::endl;
//for( i=0; i<n; i++){
// std::cout << c[i] << std::endl;
//}
cudaFree( a_cuda );
cudaFree( b_cuda );
cudaFree( c_cuda );
#ifdef DYNAMIC
free( a );
free( b );
free( c );
#endif
return 0;}
2012/11/13
clang のperformance
himeno benchで結構出てる。倍精度はGCCと変わらない?
http://openbenchmarking.org/result/1203235-SU-GCC47LLVM92
http://openbenchmarking.org/result/1203235-SU-GCC47LLVM92
2012/11/07
Pthread vs OpenMPの性能比較してみた (科学計算)
ふと思いついてPthreadとOpenMPの性能の比較をしてみた。 blasを使った倍精度の内積計算。
AMD: C-50 (2core)
gcc: 4.7.0 (gmp-5.0.5)
real:
serial 1.079 s
Omp 0.830 s ( 2 thread )
pthread 1.225 s ( 2 thread )
ネットブックのじゃ、スレッド呼出のオーバヘッドの方が大きいのか? pthreadがなんだか遅い。 pthreadだけ作業用にglobalでとってあげたりしたんだけどあんまり早くならない。 pthread_joinが結構食ってる感じだ。外すと他と同じぐらいになるのだけれど答えが合わない。 内積計算はメモリーアクセスの問題の方が大きいかも
研究室のクラスタでやったらXeon( sandy世代 )
serial 0.19 s
Omp 0.001 s ( 2 thread )
pthread 0.001 s ( 2 thread )
もっと正確に測んなきゃだめですね。 とにかく順当に早くなっていました。Pthreadのほうが欠かなきゃいけない量が多いのでOpenMPですね。
EN
I just wondered which is faster OpenMP or Pthread. Here is my measurement.
The test code is illustrated below. It's a simple double precision dot product using cblas written in C. The compiler used was gcc-4.7.0. I initially tested on my Netbook with AMD C50 APU and got follwing result
real:
serial 1.079 s
Omp 0.830 s ( 2 thread )
pthread 1.225 s ( 2 thread )
I didn't got what I expected somehow pthread was slower than serial ??? For validation, I also tested on my Uni's Cluster, which has sandy bridge generation Xeon, and got follwing result.
serial 0.19 s
Omp 0.001 s ( 2 thread )
pthread 0.001 s ( 2 thread )
Looks OK now though more accurate measurement is required. Pthread looks slightly slower for me, yet pthread required more longer codes. Then I'm for OpenMP.
以下テストコード
( serial, pthread and OpenMP performance comparison by Cblas dot product)
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<cblas.h>
#include<pthread.h>
#include<omp.h>
#define N 165535
//#define N 8
double *x_global;
double *y_global;
double tmp_global;
//--------------------------------------------------------
void init(
//--------------------------------------------------------
double *x,
double *y
){
size_t i;
for( i=0; i<N; i++ ){
x[i] = i;
y[i] = i;
}
}
//--------------------------------------------------------
void serial_dot(
//--------------------------------------------------------
double *x,
double *y
){
cblas_ddot( N, x, 1, y, 1 );
#ifdef SERIAL
printf( " %lf\n", cblas_ddot( N, x, 1, y, 1 ) );
#endif
}
//--------------------------------------------------------
void omp_dot(
//--------------------------------------------------------
double *x,
double *y
){
size_t i;
const size_t nprocs = omp_get_num_procs();
double tmp;
omp_set_num_threads(nprocs);
tmp = 0.0;
#pragma omp parallel for
for( i=0; i<nprocs; i++ ){
tmp += cblas_ddot( N/2, &x[i*N/2], 1, &y[i*N/2], 1 );
}
#ifdef OMP
printf("%lf\n",tmp );
#endif
}
//--------------------------------------------------------
void *pddot( void *arg ){
//--------------------------------------------------------
size_t i;
i = (size_t)arg;
#ifdef PTHREAD
printf("\tthread[%lu]\n",i);
#endif
tmp_global += cblas_ddot( N/2, &x_global[i*N/2], 1, &y_global[i*N/2], 1 );
}
//--------------------------------------------------------
void pthread_dot(
//--------------------------------------------------------
pthread_t threads[2],
double *x,
double *y
){
size_t i;
tmp_global = 0;
for( i=0; i<2; i++ ){
pthread_create( &threads[i], NULL, pddot, (void*)i );
}
for( i=0; i<2; i++ ){
pthread_join( threads[i], NULL );
}
#ifdef PTHREAD
printf("%lf\n",tmp_global);
#endif
}
//--------------------------------------------------------
int main(){
//--------------------------------------------------------
int i;
pthread_t threads[2];
double *x, *y;
x = (double*)malloc(sizeof(double)*N);
y = (double*)malloc(sizeof(double)*N);
x_global = (double*)malloc(sizeof(double)*N);
y_global = (double*)malloc(sizeof(double)*N);
init( x, y );
memcpy( x_global, x, sizeof(double)*N );
memcpy( y_global, y, sizeof(double)*N );
for( i=0; i<1000; i++ ){
//serial_dot( x, y );
//omp_dot( x, y );
pthread_dot( threads, x, y );
}
return 0;}
AMD: C-50 (2core)
gcc: 4.7.0 (gmp-5.0.5)
real:
serial 1.079 s
Omp 0.830 s ( 2 thread )
pthread 1.225 s ( 2 thread )
ネットブックのじゃ、スレッド呼出のオーバヘッドの方が大きいのか? pthreadがなんだか遅い。 pthreadだけ作業用にglobalでとってあげたりしたんだけどあんまり早くならない。 pthread_joinが結構食ってる感じだ。外すと他と同じぐらいになるのだけれど答えが合わない。 内積計算はメモリーアクセスの問題の方が大きいかも
研究室のクラスタでやったらXeon( sandy世代 )
serial 0.19 s
Omp 0.001 s ( 2 thread )
pthread 0.001 s ( 2 thread )
もっと正確に測んなきゃだめですね。 とにかく順当に早くなっていました。Pthreadのほうが欠かなきゃいけない量が多いのでOpenMPですね。
EN
I just wondered which is faster OpenMP or Pthread. Here is my measurement.
The test code is illustrated below. It's a simple double precision dot product using cblas written in C. The compiler used was gcc-4.7.0. I initially tested on my Netbook with AMD C50 APU and got follwing result
real:
serial 1.079 s
Omp 0.830 s ( 2 thread )
pthread 1.225 s ( 2 thread )
I didn't got what I expected somehow pthread was slower than serial ??? For validation, I also tested on my Uni's Cluster, which has sandy bridge generation Xeon, and got follwing result.
serial 0.19 s
Omp 0.001 s ( 2 thread )
pthread 0.001 s ( 2 thread )
Looks OK now though more accurate measurement is required. Pthread looks slightly slower for me, yet pthread required more longer codes. Then I'm for OpenMP.
以下テストコード
( serial, pthread and OpenMP performance comparison by Cblas dot product)
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<cblas.h>
#include<pthread.h>
#include<omp.h>
#define N 165535
//#define N 8
double *x_global;
double *y_global;
double tmp_global;
//--------------------------------------------------------
void init(
//--------------------------------------------------------
double *x,
double *y
){
size_t i;
for( i=0; i<N; i++ ){
x[i] = i;
y[i] = i;
}
}
//--------------------------------------------------------
void serial_dot(
//--------------------------------------------------------
double *x,
double *y
){
cblas_ddot( N, x, 1, y, 1 );
#ifdef SERIAL
printf( " %lf\n", cblas_ddot( N, x, 1, y, 1 ) );
#endif
}
//--------------------------------------------------------
void omp_dot(
//--------------------------------------------------------
double *x,
double *y
){
size_t i;
const size_t nprocs = omp_get_num_procs();
double tmp;
omp_set_num_threads(nprocs);
tmp = 0.0;
#pragma omp parallel for
for( i=0; i<nprocs; i++ ){
tmp += cblas_ddot( N/2, &x[i*N/2], 1, &y[i*N/2], 1 );
}
#ifdef OMP
printf("%lf\n",tmp );
#endif
}
//--------------------------------------------------------
void *pddot( void *arg ){
//--------------------------------------------------------
size_t i;
i = (size_t)arg;
#ifdef PTHREAD
printf("\tthread[%lu]\n",i);
#endif
tmp_global += cblas_ddot( N/2, &x_global[i*N/2], 1, &y_global[i*N/2], 1 );
}
//--------------------------------------------------------
void pthread_dot(
//--------------------------------------------------------
pthread_t threads[2],
double *x,
double *y
){
size_t i;
tmp_global = 0;
for( i=0; i<2; i++ ){
pthread_create( &threads[i], NULL, pddot, (void*)i );
}
for( i=0; i<2; i++ ){
pthread_join( threads[i], NULL );
}
#ifdef PTHREAD
printf("%lf\n",tmp_global);
#endif
}
//--------------------------------------------------------
int main(){
//--------------------------------------------------------
int i;
pthread_t threads[2];
double *x, *y;
x = (double*)malloc(sizeof(double)*N);
y = (double*)malloc(sizeof(double)*N);
x_global = (double*)malloc(sizeof(double)*N);
y_global = (double*)malloc(sizeof(double)*N);
init( x, y );
memcpy( x_global, x, sizeof(double)*N );
memcpy( y_global, y, sizeof(double)*N );
for( i=0; i<1000; i++ ){
//serial_dot( x, y );
//omp_dot( x, y );
pthread_dot( threads, x, y );
}
return 0;}
2012/11/01
並列VTK PVTUのサンプル ( paralell VTU )
並列VTK形式 (VTU形式)
xml形式のVTUデータは並列用にも出力できます。小さいのを手で書くのは簡単なので作ってみた。
以下に示すように
parallel.pvtu
material1.vtu
material2.vtu
のファイルを同じディレクトリに用意する。 あとはparaviewでparallel.pvtuを開いてあげれば表示できる。
------------------------------------------------------------------------------------------------------------------------------
paralell.pvtu
------------------------------------------------------------------------------------------------------------------------------
<?xml version="1.0"?>
<VTKFile type="PUnstructuredGrid" version="0.1" byte_order="LittleEndian">
<PUnstructuredGrid GhostLevel="0">
<PPoints>
<PDataArray type="Float32" Name="Position" NumberOfComponents="3"/>
</PPoints>
<PCells>
<PDataArray type="Int32" Name="connectivity" NumberOfComponents="1"/>
<PDataArray type="Int32" Name="offsets" NumberOfComponents="1"/>
<PDataArray type="UInt8" Name="types" NumberOfComponents="1"/>
</PCells>
<PCellData Scalars="Material">
<PDataArray type="Int32" Name="Material" NumberOfComponents="1"/>
</PCellData>
<Piece Source="material1.vtu"/>
<Piece Source="material2.vtu"/>
</PUnstructuredGrid>
</VTKFile>
------------------------------------------------------------------------------------------------------------------------------
material1.vtu
------------------------------------------------------------------------------------------------------------------------------
<?xml version="1.0"?>
<VTKFile type="UnstructuredGrid" version="0.1" byte_order="LittleEndian">
<UnstructuredGrid>
<Piece NumberOfPoints="3" NumberOfCells="1">
<Points>
<DataArray type="Float32" Name="Position" NumberOfComponents="3" format="ascii">
0.0 0.0 0.0
1.0 1.0 0.0
0.0 1.0 0.0
</DataArray>
</Points>
<Cells>
<DataArray type="Int32" Name="connectivity" NumberOfComponents="1" format="ascii">
0 1 2
</DataArray>
<DataArray type="Int32" Name="offsets" NumberOfComponents="1" format="ascii">
3
</DataArray>
<DataArray type="UInt8" Name="types" NumberOfComponents="1" format="ascii">
5
</DataArray>
</Cells>
<CellData Scalars="Material">
<DataArray type="Int32" Name="Material" NumberOfComponents="1" format="ascii">
1
</DataArray>
</CellData>
</Piece>
</UnstructuredGrid>
</VTKFile>
------------------------------------------------------------------------------------------------------------------------------
material2.vtu
------------------------------------------------------------------------------------------------------------------------------
<?xml version="1.0"?>
<VTKFile type="UnstructuredGrid" version="0.1" byte_order="LittleEndian">
<UnstructuredGrid>
<Piece NumberOfPoints="3" NumberOfCells="1">
<Points>
<DataArray type="Float32" Name="Position" NumberOfComponents="3" format="ascii">
0.0 0 0
1.0 0.0 0
1.0 1.0 0
</DataArray>
</Points>
<Cells>
<DataArray type="Int32" Name="connectivity" NumberOfComponents="1" format="ascii">
0 1 2
</DataArray>
<DataArray type="Int32" Name="offsets" NumberOfComponents="1" format="ascii">
3
</DataArray>
<DataArray type="UInt8" Name="types" NumberOfComponents="1" format="ascii">
5
</DataArray>
</Cells>
<CellData Scalars="Material">
<DataArray type="Int32" Name="Material" NumberOfComponents="1" format="ascii">
2
</DataArray>
</CellData>
</Piece>
</UnstructuredGrid>
</VTKFile>
こんなのが出るはず。青いのがmateral1で赤いのが2。
xml形式のVTUデータは並列用にも出力できます。小さいのを手で書くのは簡単なので作ってみた。
以下に示すように
parallel.pvtu
material1.vtu
material2.vtu
のファイルを同じディレクトリに用意する。 あとはparaviewでparallel.pvtuを開いてあげれば表示できる。
------------------------------------------------------------------------------------------------------------------------------
paralell.pvtu
------------------------------------------------------------------------------------------------------------------------------
<?xml version="1.0"?>
<VTKFile type="PUnstructuredGrid" version="0.1" byte_order="LittleEndian">
<PUnstructuredGrid GhostLevel="0">
<PPoints>
<PDataArray type="Float32" Name="Position" NumberOfComponents="3"/>
</PPoints>
<PCells>
<PDataArray type="Int32" Name="connectivity" NumberOfComponents="1"/>
<PDataArray type="Int32" Name="offsets" NumberOfComponents="1"/>
<PDataArray type="UInt8" Name="types" NumberOfComponents="1"/>
</PCells>
<PCellData Scalars="Material">
<PDataArray type="Int32" Name="Material" NumberOfComponents="1"/>
</PCellData>
<Piece Source="material1.vtu"/>
<Piece Source="material2.vtu"/>
</PUnstructuredGrid>
</VTKFile>
------------------------------------------------------------------------------------------------------------------------------
material1.vtu
------------------------------------------------------------------------------------------------------------------------------
<?xml version="1.0"?>
<VTKFile type="UnstructuredGrid" version="0.1" byte_order="LittleEndian">
<UnstructuredGrid>
<Piece NumberOfPoints="3" NumberOfCells="1">
<Points>
<DataArray type="Float32" Name="Position" NumberOfComponents="3" format="ascii">
0.0 0.0 0.0
1.0 1.0 0.0
0.0 1.0 0.0
</DataArray>
</Points>
<Cells>
<DataArray type="Int32" Name="connectivity" NumberOfComponents="1" format="ascii">
0 1 2
</DataArray>
<DataArray type="Int32" Name="offsets" NumberOfComponents="1" format="ascii">
3
</DataArray>
<DataArray type="UInt8" Name="types" NumberOfComponents="1" format="ascii">
5
</DataArray>
</Cells>
<CellData Scalars="Material">
<DataArray type="Int32" Name="Material" NumberOfComponents="1" format="ascii">
1
</DataArray>
</CellData>
</Piece>
</UnstructuredGrid>
</VTKFile>
------------------------------------------------------------------------------------------------------------------------------
material2.vtu
------------------------------------------------------------------------------------------------------------------------------
<?xml version="1.0"?>
<VTKFile type="UnstructuredGrid" version="0.1" byte_order="LittleEndian">
<UnstructuredGrid>
<Piece NumberOfPoints="3" NumberOfCells="1">
<Points>
<DataArray type="Float32" Name="Position" NumberOfComponents="3" format="ascii">
0.0 0 0
1.0 0.0 0
1.0 1.0 0
</DataArray>
</Points>
<Cells>
<DataArray type="Int32" Name="connectivity" NumberOfComponents="1" format="ascii">
0 1 2
</DataArray>
<DataArray type="Int32" Name="offsets" NumberOfComponents="1" format="ascii">
3
</DataArray>
<DataArray type="UInt8" Name="types" NumberOfComponents="1" format="ascii">
5
</DataArray>
</Cells>
<CellData Scalars="Material">
<DataArray type="Int32" Name="Material" NumberOfComponents="1" format="ascii">
2
</DataArray>
</CellData>
</Piece>
</UnstructuredGrid>
</VTKFile>
こんなのが出るはず。青いのがmateral1で赤いのが2。
登録:
投稿 (Atom)