mxnet 모델 병렬 처리의 간단한 예제

mxnet에 대한 Guon 자습서의 간단한 예제는 mxnet을 처음 사용하는 사람들에게 매우 유용합니다. 아직 모델 병렬 처리에 대한 간단한 예는 없습니다. LSTM에 대한 모델 병렬 처리 예제 코드를 볼 수 있지만 mxnet에 익숙하지 않으며 좀 더 간소화 된 예제가 나와 도움이 될 것입니다. 그래서, gluon 튜토리얼의 회귀 예제를 실행하고 mxnet.gluon.Trainer의 일부 코드를 혼합하여 모델 병렬 처리 예제를 만들었습니다.mxnet 모델 병렬 처리의 간단한 예제

그러나 분명히 잘못된 것이 있습니다. 그라디언트가 업데이트 된 것 같지 않습니다. 문제를 확인하여 도움을받을 수 있습니까? 여기서 목표는 각각 다른 GPU에서 유지되는 3 개의 계층을 갖는 선형 회귀 모델을 만드는 것입니다. 모델 자체는 유용하지 않습니다. 예를 들어 사용자 정의 블록 및 명령형 프로그래밍을 사용할 때 모델 병렬 처리에 대해 초기화 및 교육이 수행되는 방법을 보여주는 예는 제외됩니다.

제가 알고 있듯이 Trainer()는 데이터 병렬 처리를 위해 작성되었습니다. 모델 병렬 처리에서는 모든 매개 변수가 모든 GPU에서 초기화되어야한다는 점에서 작동하지 않습니다.

import os 
import numpy as np 
import mxnet as mx 
from mxnet import nd, autograd, gluon 
from mxnet.gluon import Block 

# make some data 
num_inputs = 2 
num_outputs = 1 
num_examples = 10000 

def real_fn(X): 
    return 2 * X[:, 0] - 3.4 * X[:, 1] + 4.2 

X = np.random.normal(0,1, (num_examples, num_inputs)) 
noise = 0.001 * np.random.normal(0,1, (num_examples)) 
y = real_fn(X) + noise 
y = y.reshape(-1,1) 

# configuration 
hidden_layers = 2 
num_gpus = hidden_layers + 1 
ctxList = [mx.gpu(i) for i in range(num_gpus)] 
#ctxList = [mx.gpu() for i in range(num_gpus)] 

#os.environ["MXNET_ENGINE_TYPE"] = "NaiveEngine" 
print("\n") 

# ====================================================================== 
class myDenseBlock(Block): 
    """ 
    A custom layer 
    """ 
    def __init__(self, layer_number, size_input, size_output, **kwargs): 
     super(myDenseBlock, self).__init__(**kwargs) 

     self.layer_number = layer_number 
     self.size_input = size_input 
     self.size_output = size_output 

     with self.name_scope(): 
      # add parameters to the Block's ParameterDict. 
      self.w = self.params.get(
       'weight', 
       init= mx.init.Xavier(magnitude=2.24), 
       shape=(size_input, size_output), 
       grad_req = 'write') 

      self.b = self.params.get(
       'bias', 
       init= mx.init.Constant(0.5), 
       shape=(size_output,), 
       grad_req = 'write') 

    def forward(self, x): 
     x = x.as_in_context(ctxList[self.layer_number]) 
     with x.context: 
      linear = nd.dot(x, self.w.data()) + self.b.data() 
      return linear 

# ====================================================================== 

# create net 
net = gluon.nn.Sequential() 
with net.name_scope(): 
    # initial layer, with X as input 
    net.add(myDenseBlock(0, 
     size_input = 2, 
     size_output = 2)) 

    for ii in range(hidden_layers-1): 
     net.add(myDenseBlock(ii+1, 
      size_input = 2, 
      size_output = 2)) 

    # final block, Y is nx1 
    net.add(myDenseBlock(ii+2, 
     size_input = 2, 
     size_output = 1)) 


# ititialize paramerters for different blocks (layers) on different gpus. 
params = net.collect_params() 

""" 
The parameters are: 
sequential0_mydenseblock0_weight 
sequential0_mydenseblock0_bias 
sequential0_mydenseblock1_weight 
sequential0_mydenseblock1_bias 
sequential0_mydenseblock2_weight 
sequential0_mydenseblock2_bias 
""" 

print("\ninitializing:") 
for i, param in enumerate(params): 
    if 'mydenseblock0' in param: 
     params[param].initialize(ctx=ctxList[0]) 
    elif 'mydenseblock1' in param: 
     params[param].initialize(ctx=ctxList[1]) 
    elif 'mydenseblock2' in param: 
     params[param].initialize(ctx=ctxList[2]) 
    print(" ", i, param, " ", params[param].list_data()[0].context) 
print("\n") 

def square_loss(yhat, y): 
    return nd.mean((yhat - y) ** 2) 

def mytrainer(updaters, params, ignore_stale_grad=False): 
    #print("\n") 
    for i, param in enumerate(params): 
     #print(i, param, " ", len(params[param].list_data()), params[param].list_data()[0].context) 
     if params[param].grad_req == 'null': 
      continue 
     if not ignore_stale_grad: 
      for data in params[param].list_data(): 
       if not data._fresh_grad: 
        print(
         "`%s` on context %s has not been updated"%(params[param].name, str(data.context))) 
        assert False 

     for upd, arr, grad in zip(updaters, params[param].list_data(), params[param].list_grad()): 

      if not ignore_stale_grad or arr._fresh_grad: 
       upd(i, grad, arr) 
       arr._fresh_grad = False 
       #print ("grad= ", grad) 


batch_size = 100 
epochs = 100000 
iteration = -1 

opt = mx.optimizer.create('adam', learning_rate=0.001, rescale_grad = 1/batch_size) 
updaters = [mx.optimizer.get_updater(opt)] 

# the following definition for updaters does not work either 
#updaters = [mx.optimizer.get_updater(opt) for _ in ctxList] 

results = [] 
for e in range(epochs): 
    train_groups = np.array_split(np.arange(X.shape[0]), X.shape[0]/batch_size) 
    for ii, idx in enumerate(train_groups): 
     iteration += 1 
     xtrain, ytrain = X[idx,:], y[idx] 

     xtrain = nd.array(xtrain) 
     xtrain = xtrain.as_in_context(ctxList[0]) 

     ytrain = nd.array(ytrain).reshape((-1, 1)) 
     ytrain = ytrain.as_in_context(ctxList[0]) 

     with autograd.record(): 
      yhat = net(xtrain) 
      error = square_loss(yhat, ytrain.as_in_context(ctxList[-1])) 


      # Question: does the call to error.backward() go under the indent 
      # for autograd.record() or outside the indent? The gluon examples have 
      # it both ways 

     error.backward() 

     mytrainer(updaters, net.collect_params()) 

     if iteration%10 == 0: 

      results.append([iteration, error.asnumpy().item()]) 
      print(("epoch= {:5,d}, iter= {:6,d}, error= {:6.3E}").format(
       e, iteration, error.asnumpy().item()))

mytrainer()의 "if not data._fresh_grad"테스트에서 코드가 실패합니다. 출력은 :

initializing: 
    0 sequential0_mydenseblock0_weight gpu(0) 
    1 sequential0_mydenseblock0_bias gpu(0) 
    2 sequential0_mydenseblock1_weight gpu(1) 
    3 sequential0_mydenseblock1_bias gpu(1) 
    4 sequential0_mydenseblock2_weight gpu(2) 
    5 sequential0_mydenseblock2_bias gpu(2) 

`sequential0_mydenseblock0_weight` on context gpu(0) has not been updated

I는 연산 그래프 만 (2) 및 다른 GPU에 도달하지 GPU의 파라미터로 연장 mx.autograd.get_symbol(error).tojson()를 사용하여 검증 할 수있다.

출처

2017-10-31 John B

안녕하세요. mxnet의 어떤 버전을 사용하십니까? mxnet 버전 1.0.0을 사용하여 오류를 재현하려고 시도했지만 제대로 작동하는 것 같습니다. 테스트가 실패하지 않습니다. mxnet의 버전을 확인하고 버전 1.0.0으로 업데이트 할 수 있습니까? – Sergei

감사합니다. 세르게이. 이전 버전의 mxnet을 사용하고 있었고 이제이 테스트 코드가 1.0.0 버전을 사용하여 올바르게 작동하는지 확인할 수 있습니다. –

예, @ sergei의 의견에 따르면 v1.0.0으로 이동하면이 문제가 해결됩니다.

출처

2018-02-28 21:38:20

mxnet 모델 병렬 처리의 간단한 예제

답변

관련 문제