gpgpu - OpenCL strange behavior -


good day,

i think i've tried figure out problem couldn't. have following code host:

cl_mem cl_distances = clcreatebuffer(context, cl_mem_read_write, 2 * sizeof(cl_uint), null, null); clsetkernelarg(kernel, 0, sizeof(cl_mem), &cl_distances);  cl_event event; clenqueuendrangekernel(command_queue, kernel, 1, null, &global_workers, &local_workers, 0, null, &event);  clwaitforevents(1, &event); 

and device:

__kernel void walk(__global uint *distance_results) {     uint global_size = get_global_size(0);     uint local_size = get_local_size(0);      uint global_id = get_global_id(0);     uint group_id = get_group_id(0);     uint local_id = get_local_id(0);      (uint step = 0; step < 500; step++) {         if (local_id == 0) {             distance_results[group_id] = 0;         }          barrier(clk_local_mem_fence);          (uint n = global_id; n < 1000; n += global_size) {             if (local_id == 0) {                 atomic_add(&distance_results[group_id], 1);             }         }          barrier(clk_global_mem_fence);          if (global_id == 0) {             (uint = 0; < (global_size / local_size); i++) {                 printf("step: %d; group: %d; data: %d\n", step, i, distance_results[i]);             }         }          barrier(clk_global_mem_fence);     } } 

so @ each "step" add 1 one distance[group_id] 1000 times each group. , read result thread global_id == 1. @ each step should have following text:

step: 59; group: 0; data: 500
step: 59; group: 1; data: 500

but there lot of strings wrong data:

step: 4; group: 0; data: 500
step: 4; group: 1; data: 210
step: 5; group: 0; data: 500
step: 5; group: 1; data: 214

if set global_workers 1 , local_workers 1 okay. if set global_workers 2 , local_workers 1 have strange behavior.

do have ideas why can happen?

there's couple things going on here, think core problem comes common misunderstanding opencl. call:

barrier(clk_global_mem_fence); 

this not global barrier. local barrier global memory fence. in other words, still synchronizes between work items in single work group, not between work items in other work groups.

the loop in code prints results have correct values work group 0, since run in work group 0. if want code work, loop prints results have in separate ndrange, proper synchronization between ndranges.

the memory fence controls types of memory writes committed memory. , in case, want global fences both, since trying fence global memory writes, not local memory writes.


Comments

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -