alexnet 분산 텐서 흐름 성능

분산 텐서 흐름을 사용하여 Alexnet을 실행해도 이미지 수는 초당 확장되지 않습니다. 여기에 alexnet 모델을 사용하고 있습니다. alexnet_benchmark.py EC2 G2 (NVIDIA GRID K520) 인스턴스의 분산 교육을위한 약간의 수정과 함께, 단일 GPU 인에서 5 6 이미지/초를 처리 할 수 있습니다. 분산 코드가 없으면 단일 GPU에서 112 이미지/초를 처리 할 수 있습니다. 이것은 매우 이상하게 보입니다.이 코드를 배포 할 때 잘못되었을 수있는 내용을 검토하십시오. 매개 변수 서버는 GPU에서 실행되지 않고, 노동자alexnet 분산 텐서 흐름 성능

ps_hosts = FLAGS.ps_hosts.split(",") worker_hosts = FLAGS.worker_hosts.split(",") # Create a cluster from the parameter server and worker hosts. cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts}) # Create and start a server for the local task. server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) if FLAGS.job_name == "ps": server.join() elif FLAGS.job_name == "worker": gpu = FLAGS.task_index % 4 # Assigns ops to the local worker by default. with tf.device(tf.train.replica_device_setter( #'/gpu:%d' % i worker_device="/job:worker/task:%d" % FLAGS.task_index, #worker_device='/gpu:%d' % gpu, cluster=cluster)): summary_op = tf.merge_all_summaries() y, x = get_graph() y_ = tf.placeholder(tf.float32, [None, NUM_LABELS]) cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])) global_step = tf.Variable(0) gradient_descent_opt = tf.train.GradientDescentOptimizer(LEARNING_RATE) num_workers = len(worker_hosts) sync_rep_opt = tf.train.SyncReplicasOptimizer(gradient_descent_opt, replicas_to_aggregate=num_workers, replica_id=FLAGS.task_index, total_num_replicas=num_workers) train_op = sync_rep_opt.minimize(cross_entropy, global_step=global_step) init_token_op = sync_rep_opt.get_init_tokens_op() chief_queue_runner = sync_rep_opt.get_chief_queue_runner() #saver = tf.train.Saver() summary_op = tf.merge_all_summaries() init_op = tf.initialize_all_variables() saver = tf.train.Saver() is_chief=(FLAGS.task_index == 0) # Create a "supervisor", which oversees the training process. sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), #logdir="/tmp/train_logs", init_op=init_op, summary_op=summary_op, saver=saver, global_step=global_step) #save_model_secs=600) # The supervisor takes care of session initialization, restoring from # a checkpoint, and closing when done or an error occurs. with sv.managed_session(server.target) as sess: if is_chief: sv.start_queue_runners(sess, [chief_queue_runner]) sess.run(init_token_op) num_steps_burn_in = 1000 total_duration = 0 total_duration_squared = 0 step = 0 while step <= 40000: print('Iteration %d' % step) sys.stdout.flush() batch_xs, batch_ys = get_data(BATCH_SIZE) train_feed = {x: batch_xs, y_: batch_ys} start_time = time.time() _, step = sess.run([train_op, global_step], feed_dict=train_feed) duration = time.time() - start_time if step > num_steps_burn_in: total_duration += duration total_duration_squared += duration * duration if not step % 1000: iterations = step - num_steps_burn_in images_processed = BATCH_SIZE * iterations print('%s: step %d, images processed: %d, images per second: %.3f, time taken: %.2f' % (datetime.now(), iterations, images_processed, images_processed/total_duration, total_duration)) sys.stdout.flush() sv.stop()

출처

2016-10-04 Naveen Swamy

타임 라인을 수집하고 (https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659 의 행을 따라) 병목 현상을 확인할 수 있습니까? –

여기에 단일 기계에 대한 [타임 라인] (https://github.com/tensorflow/tensorflow/issues/4526#issuecomment-249014238)이 있습니다. AlexNet은 병렬 적 자원을 효율적으로 사용하기 어렵게하는 IO에 비해 계산량이 적음을 보여줍니다 –

귀하의 코드는 여기에서 염두에 두어야 할 몇 가지 포인트는 좋은 - 보이는 CUDA_VISIBLE_DEVICES 접두사를 사용하여 실행됩니다 단일 노드와 사이에

그래프 다중 노드는 서로 다르며 그와 관련된 일부 유사성을 가질 수 있습니다. 그라디언트 정보를 서버와 작업자간에 전송하기 위해 추가되는 대기열과 동기화가 추가되었습니다.
Alexnet은 상대적으로 빨리 감기와 뒤로 감기가 있기 때문에 서버와의 I/O 전송의 오버 헤드가 더 두드러지게됩니다. 이것은 시작 V3에서 나타날 수도 있고 나타나지 않을 수도 있습니다 (앞으로 기울어 져있을 수도 있음).
귀하의 게시물은 매개 변수 서버 및 작업자를 위해 별도의 EC2 인스턴스를 사용하고 있다고 언급했습니다. 이것이 최고의 구성입니다. 동일한 노드에서 작업자와 서버를 실행하면 성능에 많은 영향을 미칩니다.
증가하는 근로자의 경우 직원을 지원하는 서버의 수를 늘려야합니다. 처음에는 32 명의 독립 근로자가 발생하기 시작합니다.
약 16 명의 직원 이후에 수렴이 영향을받을 수 있다는 증거가 있습니다.

제 제안은 V3 배포를 시도하는 것입니다. 이 토폴로지는 단일 노드 카운터 파트에 비해 거의 완벽한 확장 성을 보여야합니다. 그렇다면 하드웨어 설정이 좋습니다. HW 구성을 두 번 확인하지 않으면.

확장 성 연구를 수행하는 경우 하나의 매개 변수 서버와 독립 실행 형 인스턴스에서 한 명의 작업자를 상대 성능 수집을 시작하는 것이 좋으며, 단일 노드 실행과 비교하면 유사합니다.

출처

2016-10-10 18:54:16

alexnet 분산 텐서 흐름 성능

답변

관련 문제