2016-07-11 12 views
2

clustor에 tensorflow를 성공적으로 설치할 때 즉시 mnist 데모를 실행하여 문제가 없는지 확인하지만 여기서 문제가 생겼습니다. 오류가 CUDAtensorflow가 cribas로 실행 중일 때

python3 -m tensorflow.models.image.mnist.convolutional 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 
Extracting data/train-images-idx3-ubyte.gz 
Extracting data/train-labels-idx1-ubyte.gz 
Extracting data/t10k-images-idx3-ubyte.gz 
Extracting data/t10k-labels-idx1-ubyte.gz 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K20m 
major: 3 minor: 5 memoryClockRate (GHz) 0.7055 
pciBusID 0000:03:00.0 
Total memory: 5.00GiB 
Free memory: 4.92GiB 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K20m, pci bus id: 0000:03:00.0) 
Initialized! 
E tensorflow/stream_executor/cuda/cuda_blas.cc:461] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED 
Traceback (most recent call last): 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 715, in _do_call 
return fn(*args) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 697, in _run_fn 
status, run_metadata) 
    File "/home/gpuusr/local/lib/python3.5/contextlib.py", line 66, in __exit__ 
next(self.gen) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status 
pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136 
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]] 
[[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
    File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main 
"__main__", mod_spec) 
    File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code 
exec(code, run_globals) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module> 
tf.app.run() 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run 
sys.exit(main(sys.argv)) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 294, in main 
feed_dict=feed_dict) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 372, in run 
run_metadata_ptr) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 636, in _run 
feed_dict_string, options, run_metadata) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 708, in _do_run 
target_list, options, run_metadata) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 728, in _do_call 
raise type(e)(node_def, op, message) 
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136 
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]] 
[[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] 
Caused by op 'MatMul', defined at: 
    File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main 
"__main__", mod_spec) 
    File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code 
exec(code, run_globals) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module> 
tf.app.run() 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run 
sys.exit(main(sys.argv)) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 221, in main 
logits = model(train_data_node, True) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 213, in model 
hidden = tf.nn.relu(tf.matmul(reshape, fc1_weights) + fc1_biases) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1209, in matmul 
name=name) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1178, in _mat_mul 
transpose_b=transpose_b, name=name) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op 
op_def=op_def) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2260, in create_op 
original_op=self._default_original_op, op_def=op_def) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1230, in __init__ 
self._traceback = _extract_stack() 

Segmentation fault (core dumped) 
+0

GPU 지원으로 TensorFlow를 구축하거나 실행하려면 NVIDIA의 Cuda Toolkit (> = 7.0)과 cuDNN (> = v2)을 모두 설치해야합니다. TensorFlow GPU 지원을 위해서는 NVidia Compute Capability> = 3.0 이상의 GPU 카드가 있어야합니다. 공식 설치 프로그램을 따라 가니? https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html – userfi

+0

절대적으로 예, 나의 cuda 버전은 7.5이고 cudnn 버전은 v4 –

+0

입니다. 그래픽 카드의 기능은 3.0 이상입니다. ? – userfi

답변

1

에서 오는 것처럼 LD_LIBRARY_PATH에 나는 7.5의 앞에 CUDA 5.5을 가지고 있기 때문에 나는 정확히 같은 오류가 있었다 나는이 모든 것 모르겠지만, 그것은 보인다. 5.5에서 7.5로 이동 한 후에는 모든 것이 잘 작동합니다.

관련 문제