PyOpenCL을 사용하여 100 개의 정수를 GPU에 병렬로 100 개의 정수와 곱하는 방법은 무엇입니까?

-3

크기가 4 인 벡터에서 산술 연산을 수행하는 데 많은 PyPenCL 예제가 있습니다. Mac에서 AMD GPU를 사용하여 PyOpenCL을 통해 100 개의 정수를 한 번에 100 개의 정수로 곱해야하는 경우 누군가 코드를 제공하고 설명해 주실 수 있습니까? 최대 벡터 크기가 16 일 수 있기 때문에 16 개 이상의 정수를 병렬로 처리해야하는이 작업을 GPU에 어떻게 요청할 수 있는지 알고 싶습니다.PyOpenCL을 사용하여 100 개의 정수를 GPU에 병렬로 100 개의 정수와 곱하는 방법은 무엇입니까?

나는 firefox GPD를 가지고 있습니다. 모든 작업 항목 (스레드)이 독립적으로 작업을 수행합니까? 예를 들어 24 개의 계산 단위가 있고 각 계산 단위에 단일 차원에 255 개의 작업 항목이 있고 3 차원에 대해 255,255,255 개의 작업 항목이 있습니다. 내 GPU에 6120 개의 독립적 인 작업 항목이 있다는 것을 의미합니까?

출처

2016-10-13 Aseem Hegshetye

OpenCL의 메모리 모델에 대해서는 API로 사용하기 전에 반드시 읽어보아야합니다. – Dschoni

두 개의 1 차원 정수 배열을 엔트리 단위로 곱하는 간단한 예제를 만들었습니다. 100 개의 값을 곱하려고한다면 데이터를 복사하는 데 많은 오버 헤드가 있으므로 CPU에서 수행하는 것보다 빠르지 않습니다. PyOpenCL의 문서에 대해서는

import pyopencl as cl 
import numpy as np 

#this is compiled by the GPU driver and will be executed on the GPU 
kernelsource = """ 
__kernel void multInt( __global int* res, 
         __global int* a, 
         __global int* b){ 
    int i = get_global_id(0); 
    int N = get_global_size(0); //this is the dimension given as second argument in the kernel execution 
    res[i] = a[i] * b[i]; 
} 
""" 

device = cl.get_platforms()[0].get_devices()[0] 
context = cl.Context([device]) 
program = cl.Program(context, kernelsource).build() 
queue = cl.CommandQueue(context) 

#preparing input data in numpy arrays in local memory (i.e. accessible by the CPU) 
N = 100 
a_local = np.array(range(N)).astype(np.int32) 
b_local = (np.ones(N)*10).astype(np.int32) 

#preparing result buffer in local memory 
res_local = np.zeros(N).astype(np.int32) 

#copy input data to GPU-memory 
a_buf = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=a_local) 
b_buf = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=b_local) 
#prepare result buffer in GPU-memory 
res_buf = cl.Buffer(context, cl.mem_flags.WRITE_ONLY, res_local.nbytes) 
#execute previously compiled kernel on GPU 
program.multInt(queue,(N,), None, res_buf, a_buf, b_buf) 
#copy the result from GPU-memory to CPU-memory 
cl.enqueue_copy(queue, res_local, res_buf) 

print("result: {}".format(res_local))

: 당신은 GPGPU 프로그래밍 및 오픈 CL의 프로그래밍 개념의 작동 원리를 이해하면, PyOpenCL은 매우 간단합니다.

출처

2016-10-26 12:41:14 serbap

PyOpenCL을 사용하여 100 개의 정수를 GPU에 병렬로 100 개의 정수와 곱하는 방법은 무엇입니까?

답변

관련 문제