나는 다음과 같은 구성 (config.yaml)을 선택했을 때 메모리가 부족 얻기의 문제를 데에서 학습 https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/criteo_tft/config-large.yamlGoogle 클라우드 시스템 메모리
링크에서 1TB 데이터를 학습 할 수 있다고합니다. 나는 시험을주기 위해 매우 감동했다! !!
내 데이터 세트는 범주 형이므로 1 핫 인코딩 (크기가 520000 x 4000 인 2D numpy 배열) 후에 매우 큰 매트릭스를 생성합니다. 32GB 메모리를 사용하는 로컬 컴퓨터에서 데이터 세트를 학습 할 수는 있지만 클라우드에서 동일한 작업을 수행 할 수는 없습니다! 여기
내 오류 있습니다. "TensorFlow 백엔드를 사용"
ERROR 2017-12-18 12:57:37 +1100 worker-replica-1 Using TensorFlow
backend.
ERROR 2017-12-18 12:57:37 +1100 worker-replica-4 Using TensorFlow
backend.
INFO 2017-12-18 12:57:37 +1100 worker-replica-0 Running command:
python -m trainer.task --train-file gs://my_bucket/my_training_file.csv --
job-dir gs://my_bucket/my_bucket_20171218_125645
ERROR 2017-12-18 12:57:38 +1100 worker-replica-2 Using TensorFlow
backend.
ERROR 2017-12-18 12:57:40 +1100 worker-replica-0 Using TensorFlow
backend.
ERROR 2017-12-18 12:57:53 +1100 worker-replica-3 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:53 +1100 worker-replica-3 Module
completed; cleaning up.
INFO 2017-12-18 12:57:53 +1100 worker-replica-3 Clean up
finished.
ERROR 2017-12-18 12:57:56 +1100 worker-replica-4 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:56 +1100 worker-replica-4 Module
completed; cleaning up.
INFO 2017-12-18 12:57:56 +1100 worker-replica-4 Clean up
finished.
ERROR 2017-12-18 12:57:58 +1100 worker-replica-2 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:58 +1100 worker-replica-2 Module
completed; cleaning up.
INFO 2017-12-18 12:57:58 +1100 worker-replica-2 Clean up
finished.
ERROR 2017-12-18 12:57:59 +1100 worker-replica-1 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:59 +1100 worker-replica-1 Module
completed; cleaning up.
INFO 2017-12-18 12:57:59 +1100 worker-replica-1 Clean up finished.
ERROR 2017-12-18 12:58:01 +1100 worker-replica-0 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:58:01 +1100 worker-replica-0 Module
completed; cleaning up.
INFO 2017-12-18 12:58:01 +1100 worker-replica-0 Clean up finished.
ERROR 2017-12-18 12:58:43 +1100 service The replica worker 0 ran
out-of-memory and exited with a non-zero status of 247. The replica worker 1
ran out-of-memory and exited with a non-zero status of 247. The replica
worker 2 ran out-of-memory and exited with a non-zero status of 247. The
replica worker 3 ran out-of-memory and exited with a non-zero status of 247.
The replica worker 4 ran out-of-memory and exited with a non-zero status of
247. To find out more about why your job exited please check the logs:
https://console.cloud.google.com/logs/viewer?project=a_project_id........(link to to my cloud log)
INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Signal 15
(SIGTERM) was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Clean up finished.
INFO 2017-12-18 12:59:28 +1100 service Finished tearing down
TensorFlow.
INFO 2017-12-18 13:00:17 +1100 service Job failed.##
에 대해 걱정하지 마십시오 오류가 발생하여 다른 작은 데이터 세트에서도 교육 일은 성공적입니다.
메모리 부족 (오류 247)의 원인을 설명하고 이러한 문제를 피하기 위해 config.yaml 파일을 작성하고 클라우드에서 데이터를 교육 할 수 있습니까?