|
@@ -1,542 +0,0 @@
|
|
-<!--
|
|
|
|
- Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
- contributor license agreements. See the NOTICE file distributed with
|
|
|
|
- this work for additional information regarding copyright ownership.
|
|
|
|
- The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
- (the "License"); you may not use this file except in compliance with
|
|
|
|
- the License. You may obtain a copy of the License at
|
|
|
|
-
|
|
|
|
- http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
-
|
|
|
|
- Unless required by applicable law or agreed to in writing, software
|
|
|
|
- distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
- See the License for the specific language governing permissions and
|
|
|
|
- limitations under the License.
|
|
|
|
--->
|
|
|
|
-
|
|
|
|
-(Copied from https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator)
|
|
|
|
-
|
|
|
|
-CIFAR-10 is a common benchmark in machine learning for image recognition.
|
|
|
|
-
|
|
|
|
-http://www.cs.toronto.edu/~kriz/cifar.html
|
|
|
|
-
|
|
|
|
-Code in this directory focuses on how to use TensorFlow Estimators to train and
|
|
|
|
-evaluate a CIFAR-10 ResNet model on:
|
|
|
|
-
|
|
|
|
-* A single host with one CPU;
|
|
|
|
-* A single host with multiple GPUs;
|
|
|
|
-* Multiple hosts with CPU or multiple GPUs;
|
|
|
|
-
|
|
|
|
-Before trying to run the model we highly encourage you to read all the README.
|
|
|
|
-
|
|
|
|
-## Prerequisite
|
|
|
|
-
|
|
|
|
-1. [Install](https://www.tensorflow.org/install/) TensorFlow version 1.2.1 or
|
|
|
|
-later.
|
|
|
|
-
|
|
|
|
-2. Download the CIFAR-10 dataset and generate TFRecord files using the provided
|
|
|
|
-script. The script and associated command below will download the CIFAR-10
|
|
|
|
-dataset and then generate a TFRecord for the training, validation, and
|
|
|
|
-evaluation datasets.
|
|
|
|
-
|
|
|
|
-```shell
|
|
|
|
-python generate_cifar10_tfrecords.py --data-dir=${PWD}/cifar-10-data
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-After running the command above, you should see the following files in the
|
|
|
|
---data-dir (```ls -R cifar-10-data```):
|
|
|
|
-
|
|
|
|
-* train.tfrecords
|
|
|
|
-* validation.tfrecords
|
|
|
|
-* eval.tfrecords
|
|
|
|
-
|
|
|
|
-
|
|
|
|
-## Training on a single machine with GPUs or CPU
|
|
|
|
-
|
|
|
|
-Run the training on CPU only. After training, it runs the evaluation.
|
|
|
|
-
|
|
|
|
-```
|
|
|
|
-python cifar10_main.py --data-dir=${PWD}/cifar-10-data \
|
|
|
|
- --job-dir=/tmp/cifar10 \
|
|
|
|
- --num-gpus=0 \
|
|
|
|
- --train-steps=1000
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-Run the model on 2 GPUs using CPU as parameter server. After training, it runs
|
|
|
|
-the evaluation.
|
|
|
|
-```
|
|
|
|
-python cifar10_main.py --data-dir=${PWD}/cifar-10-data \
|
|
|
|
- --job-dir=/tmp/cifar10 \
|
|
|
|
- --num-gpus=2 \
|
|
|
|
- --train-steps=1000
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-Run the model on 2 GPUs using GPU as parameter server.
|
|
|
|
-It will run an experiment, which for local setting basically means it will run
|
|
|
|
-stop training
|
|
|
|
-a couple of times to perform evaluation.
|
|
|
|
-
|
|
|
|
-```
|
|
|
|
-python cifar10_main.py --data-dir=${PWD}/cifar-10-data \
|
|
|
|
- --job-dir=/tmp/cifar10 \
|
|
|
|
- --variable-strategy GPU \
|
|
|
|
- --num-gpus=2 \
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-There are more command line flags to play with; run
|
|
|
|
-`python cifar10_main.py --help` for details.
|
|
|
|
-
|
|
|
|
-## Run distributed training
|
|
|
|
-
|
|
|
|
-### (Optional) Running on Google Cloud Machine Learning Engine
|
|
|
|
-
|
|
|
|
-This example can be run on Google Cloud Machine Learning Engine (ML Engine),
|
|
|
|
-which will configure the environment and take care of running workers,
|
|
|
|
-parameters servers, and masters in a fault tolerant way.
|
|
|
|
-
|
|
|
|
-To install the command line tool, and set up a project and billing, see the
|
|
|
|
-quickstart [here](https://cloud.google.com/ml-engine/docs/quickstarts/command-line).
|
|
|
|
-
|
|
|
|
-You'll also need a Google Cloud Storage bucket for the data. If you followed the
|
|
|
|
-instructions above, you can just run:
|
|
|
|
-
|
|
|
|
-```
|
|
|
|
-MY_BUCKET=gs://<my-bucket-name>
|
|
|
|
-gsutil cp -r ${PWD}/cifar-10-data $MY_BUCKET/
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-Then run the following command from the `tutorials/image` directory of this
|
|
|
|
-repository (the parent directory of this README):
|
|
|
|
-
|
|
|
|
-```
|
|
|
|
-gcloud ml-engine jobs submit training cifarmultigpu \
|
|
|
|
- --runtime-version 1.2 \
|
|
|
|
- --job-dir=$MY_BUCKET/model_dirs/cifarmultigpu \
|
|
|
|
- --config cifar10_estimator/cmle_config.yaml \
|
|
|
|
- --package-path cifar10_estimator/ \
|
|
|
|
- --module-name cifar10_estimator.cifar10_main \
|
|
|
|
- -- \
|
|
|
|
- --data-dir=$MY_BUCKET/cifar-10-data \
|
|
|
|
- --num-gpus=4 \
|
|
|
|
- --train-steps=1000
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-
|
|
|
|
-### Set TF_CONFIG
|
|
|
|
-
|
|
|
|
-Considering that you already have multiple hosts configured, all you need is a
|
|
|
|
-`TF_CONFIG` environment variable on each host. You can set up the hosts manually
|
|
|
|
-or check [tensorflow/ecosystem](https://github.com/tensorflow/ecosystem) for
|
|
|
|
-instructions about how to set up a Cluster.
|
|
|
|
-
|
|
|
|
-The `TF_CONFIG` will be used by the `RunConfig` to know the existing hosts and
|
|
|
|
-their task: `master`, `ps` or `worker`.
|
|
|
|
-
|
|
|
|
-Here's an example of `TF_CONFIG`.
|
|
|
|
-
|
|
|
|
-```python
|
|
|
|
-cluster = {'master': ['master-ip:8000'],
|
|
|
|
- 'ps': ['ps-ip:8000'],
|
|
|
|
- 'worker': ['worker-ip:8000']}
|
|
|
|
-
|
|
|
|
-TF_CONFIG = json.dumps(
|
|
|
|
- {'cluster': cluster,
|
|
|
|
- 'task': {'type': master, 'index': 0},
|
|
|
|
- 'model_dir': 'gs://<bucket_path>/<dir_path>',
|
|
|
|
- 'environment': 'cloud'
|
|
|
|
- })
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-*Cluster*
|
|
|
|
-
|
|
|
|
-A cluster spec, which is basically a dictionary that describes all of the tasks
|
|
|
|
-in the cluster. More about it [here](https://www.tensorflow.org/deploy/distributed).
|
|
|
|
-
|
|
|
|
-In this cluster spec we are defining a cluster with 1 master, 1 ps and 1 worker.
|
|
|
|
-
|
|
|
|
-* `ps`: saves the parameters among all workers. All workers can
|
|
|
|
- read/write/update the parameters for model via ps. As some models are
|
|
|
|
- extremely large the parameters are shared among the ps (each ps stores a
|
|
|
|
- subset).
|
|
|
|
-
|
|
|
|
-* `worker`: does the training.
|
|
|
|
-
|
|
|
|
-* `master`: basically a special worker, it does training, but also restores and
|
|
|
|
- saves checkpoints and do evaluation.
|
|
|
|
-
|
|
|
|
-*Task*
|
|
|
|
-
|
|
|
|
-The Task defines what is the role of the current node, for this example the node
|
|
|
|
-is the master on index 0 on the cluster spec, the task will be different for
|
|
|
|
-each node. An example of the `TF_CONFIG` for a worker would be:
|
|
|
|
-
|
|
|
|
-```python
|
|
|
|
-cluster = {'master': ['master-ip:8000'],
|
|
|
|
- 'ps': ['ps-ip:8000'],
|
|
|
|
- 'worker': ['worker-ip:8000']}
|
|
|
|
-
|
|
|
|
-TF_CONFIG = json.dumps(
|
|
|
|
- {'cluster': cluster,
|
|
|
|
- 'task': {'type': worker, 'index': 0},
|
|
|
|
- 'model_dir': 'gs://<bucket_path>/<dir_path>',
|
|
|
|
- 'environment': 'cloud'
|
|
|
|
- })
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-*Model_dir*
|
|
|
|
-
|
|
|
|
-This is the path where the master will save the checkpoints, graph and
|
|
|
|
-TensorBoard files. For a multi host environment you may want to use a
|
|
|
|
-Distributed File System, Google Storage and DFS are supported.
|
|
|
|
-
|
|
|
|
-*Environment*
|
|
|
|
-
|
|
|
|
-By the default environment is *local*, for a distributed setting we need to
|
|
|
|
-change it to *cloud*.
|
|
|
|
-
|
|
|
|
-### Running script
|
|
|
|
-
|
|
|
|
-Once you have a `TF_CONFIG` configured properly on each host you're ready to run
|
|
|
|
-on distributed settings.
|
|
|
|
-
|
|
|
|
-#### Master
|
|
|
|
-Run this on master:
|
|
|
|
-Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for
|
|
|
|
-40000 steps. It will run evaluation a couple of times during training. The
|
|
|
|
-num_workers arugument is used only to update the learning rate correctly. Make
|
|
|
|
-sure the model_dir is the same as defined on the TF_CONFIG.
|
|
|
|
-
|
|
|
|
-```shell
|
|
|
|
-python cifar10_main.py --data-dir=gs://path/cifar-10-data \
|
|
|
|
- --job-dir=gs://path/model_dir/ \
|
|
|
|
- --num-gpus=4 \
|
|
|
|
- --train-steps=40000 \
|
|
|
|
- --sync \
|
|
|
|
- --num-workers=2
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-*Output:*
|
|
|
|
-
|
|
|
|
-```shell
|
|
|
|
-INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir/
|
|
|
|
-INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'master', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd16fb2be10>, '_model_dir': 'gs://path/model_dir/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': intra_op_parallelism_threads: 1
|
|
|
|
-gpu_options {
|
|
|
|
-}
|
|
|
|
-allow_soft_placement: true
|
|
|
|
-, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
|
|
|
|
- per_process_gpu_memory_fraction: 1.0
|
|
|
|
-}
|
|
|
|
-, '_evaluation_master': '', '_master': u'grpc://master-ip:8000'}
|
|
|
|
-...
|
|
|
|
-2017-08-01 19:59:26.496208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
|
|
|
|
-name: Tesla K80
|
|
|
|
-major: 3 minor: 7 memoryClockRate (GHz) 0.8235
|
|
|
|
-pciBusID 0000:00:04.0
|
|
|
|
-Total memory: 11.17GiB
|
|
|
|
-Free memory: 11.09GiB
|
|
|
|
-2017-08-01 19:59:26.775660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
|
|
|
|
-name: Tesla K80
|
|
|
|
-major: 3 minor: 7 memoryClockRate (GHz) 0.8235
|
|
|
|
-pciBusID 0000:00:05.0
|
|
|
|
-Total memory: 11.17GiB
|
|
|
|
-Free memory: 11.10GiB
|
|
|
|
-...
|
|
|
|
-2017-08-01 19:59:29.675171: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11)
|
|
|
|
-INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=1; total_num_replicas=1
|
|
|
|
-INFO:tensorflow:Create CheckpointSaverHook.
|
|
|
|
-INFO:tensorflow:Restoring parameters from gs://path/model_dir/model.ckpt-0
|
|
|
|
-2017-08-01 19:59:37.560775: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session 156fcb55fe6648d6 with config:
|
|
|
|
-intra_op_parallelism_threads: 1
|
|
|
|
-gpu_options {
|
|
|
|
- per_process_gpu_memory_fraction: 1
|
|
|
|
-}
|
|
|
|
-allow_soft_placement: true
|
|
|
|
-
|
|
|
|
-INFO:tensorflow:Saving checkpoints for 1 into gs://path/model_dir/model.ckpt.
|
|
|
|
-INFO:tensorflow:loss = 1.20682, step = 1
|
|
|
|
-INFO:tensorflow:loss = 1.20682, learning_rate = 0.1
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11)
|
|
|
|
-INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=2; total_num_replicas=2
|
|
|
|
-INFO:tensorflow:Starting evaluation at 2017-08-01-20:00:14
|
|
|
|
-2017-08-01 20:00:15.745881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)
|
|
|
|
-2017-08-01 20:00:15.745949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:05.0)
|
|
|
|
-2017-08-01 20:00:15.745958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:06.0)
|
|
|
|
-2017-08-01 20:00:15.745964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:07.0)
|
|
|
|
-2017-08-01 20:00:15.745969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:00:08.0)
|
|
|
|
-2017-08-01 20:00:15.745975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:00:09.0)
|
|
|
|
-2017-08-01 20:00:15.745987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:00:0a.0)
|
|
|
|
-2017-08-01 20:00:15.745997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:0b.0)
|
|
|
|
-INFO:tensorflow:Restoring parameters from gs://path/model_dir/model.ckpt-10023
|
|
|
|
-INFO:tensorflow:Evaluation [1/100]
|
|
|
|
-INFO:tensorflow:Evaluation [2/100]
|
|
|
|
-INFO:tensorflow:Evaluation [3/100]
|
|
|
|
-INFO:tensorflow:Evaluation [4/100]
|
|
|
|
-INFO:tensorflow:Evaluation [5/100]
|
|
|
|
-INFO:tensorflow:Evaluation [6/100]
|
|
|
|
-INFO:tensorflow:Evaluation [7/100]
|
|
|
|
-INFO:tensorflow:Evaluation [8/100]
|
|
|
|
-INFO:tensorflow:Evaluation [9/100]
|
|
|
|
-INFO:tensorflow:Evaluation [10/100]
|
|
|
|
-INFO:tensorflow:Evaluation [11/100]
|
|
|
|
-INFO:tensorflow:Evaluation [12/100]
|
|
|
|
-INFO:tensorflow:Evaluation [13/100]
|
|
|
|
-...
|
|
|
|
-INFO:tensorflow:Evaluation [100/100]
|
|
|
|
-INFO:tensorflow:Finished evaluation at 2017-08-01-20:00:31
|
|
|
|
-INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step = 1, loss = 630.425
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-#### Worker
|
|
|
|
-
|
|
|
|
-Run this on worker:
|
|
|
|
-Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for
|
|
|
|
-40000 steps. It will run evaluation a couple of times during training. Make sure
|
|
|
|
-the model_dir is the same as defined on the TF_CONFIG.
|
|
|
|
-
|
|
|
|
-```shell
|
|
|
|
-python cifar10_main.py --data-dir=gs://path/cifar-10-data \
|
|
|
|
- --job-dir=gs://path/model_dir/ \
|
|
|
|
- --num-gpus=4 \
|
|
|
|
- --train-steps=40000 \
|
|
|
|
- --sync
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-*Output:*
|
|
|
|
-
|
|
|
|
-```shell
|
|
|
|
-INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir/
|
|
|
|
-INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600,
|
|
|
|
-'_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'worker',
|
|
|
|
-'_is_chief': False, '_cluster_spec':
|
|
|
|
-<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f6918438e10>,
|
|
|
|
-'_model_dir': 'gs://<path>/model_dir/',
|
|
|
|
-'_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000,
|
|
|
|
-'_session_config': intra_op_parallelism_threads: 1
|
|
|
|
-gpu_options {
|
|
|
|
-}
|
|
|
|
-allow_soft_placement: true
|
|
|
|
-, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1,
|
|
|
|
-'_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
|
|
|
|
- per_process_gpu_memory_fraction: 1.0
|
|
|
|
- }
|
|
|
|
-...
|
|
|
|
-2017-08-01 19:59:26.496208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
|
|
|
|
-name: Tesla K80
|
|
|
|
-major: 3 minor: 7 memoryClockRate (GHz) 0.8235
|
|
|
|
-pciBusID 0000:00:04.0
|
|
|
|
-Total memory: 11.17GiB
|
|
|
|
-Free memory: 11.09GiB
|
|
|
|
-2017-08-01 19:59:26.775660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
|
|
|
|
-name: Tesla K80
|
|
|
|
-major: 3 minor: 7 memoryClockRate (GHz) 0.8235
|
|
|
|
-pciBusID 0000:00:05.0
|
|
|
|
-Total memory: 11.17GiB
|
|
|
|
-Free memory: 11.10GiB
|
|
|
|
-...
|
|
|
|
-2017-08-01 19:59:29.675171: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64)
|
|
|
|
-INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11)
|
|
|
|
-INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=2; total_num_replicas=2
|
|
|
|
-INFO:tensorflow:Create CheckpointSaverHook.
|
|
|
|
-2017-07-31 22:38:04.629150: I
|
|
|
|
-tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting
|
|
|
|
-for response from worker: /job:master/replica:0/task:0
|
|
|
|
-2017-07-31 22:38:09.263492: I
|
|
|
|
-tensorflow/core/distributed_runtime/master_session.cc:999] Start master
|
|
|
|
-session cc58f93b1e259b0c with config:
|
|
|
|
-intra_op_parallelism_threads: 1
|
|
|
|
-gpu_options {
|
|
|
|
-per_process_gpu_memory_fraction: 1
|
|
|
|
-}
|
|
|
|
-allow_soft_placement: true
|
|
|
|
-INFO:tensorflow:loss = 5.82382, step = 0
|
|
|
|
-INFO:tensorflow:loss = 5.82382, learning_rate = 0.8
|
|
|
|
-INFO:tensorflow:Average examples/sec: 1116.92 (1116.92), step = 10
|
|
|
|
-INFO:tensorflow:Average examples/sec: 1233.73 (1377.83), step = 20
|
|
|
|
-INFO:tensorflow:Average examples/sec: 1485.43 (2509.3), step = 30
|
|
|
|
-INFO:tensorflow:Average examples/sec: 1680.27 (2770.39), step = 40
|
|
|
|
-INFO:tensorflow:Average examples/sec: 1825.38 (2788.78), step = 50
|
|
|
|
-INFO:tensorflow:Average examples/sec: 1929.32 (2697.27), step = 60
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2015.17 (2749.05), step = 70
|
|
|
|
-INFO:tensorflow:loss = 37.6272, step = 79 (19.554 sec)
|
|
|
|
-INFO:tensorflow:loss = 37.6272, learning_rate = 0.8 (19.554 sec)
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2074.92 (2618.36), step = 80
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2132.71 (2744.13), step = 90
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2183.38 (2777.21), step = 100
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2224.4 (2739.03), step = 110
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2240.28 (2431.26), step = 120
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2272.12 (2739.32), step = 130
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2300.68 (2750.03), step = 140
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2325.81 (2745.63), step = 150
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2347.14 (2721.53), step = 160
|
|
|
|
-INFO:tensorflow:Average examples/sec: 2367.74 (2754.54), step = 170
|
|
|
|
-INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec)
|
|
|
|
-...
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-#### PS
|
|
|
|
-
|
|
|
|
-Run this on ps:
|
|
|
|
-The ps will not do training so most of the arguments won't affect the execution
|
|
|
|
-
|
|
|
|
-```shell
|
|
|
|
-python cifar10_main.py --job-dir=gs://path/model_dir/
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-*Output:*
|
|
|
|
-
|
|
|
|
-```shell
|
|
|
|
-INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir/
|
|
|
|
-INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'ps', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f48f1addf90>, '_model_dir': 'gs://path/model_dir/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': intra_op_parallelism_threads: 1
|
|
|
|
-gpu_options {
|
|
|
|
-}
|
|
|
|
-allow_soft_placement: true
|
|
|
|
-, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
|
|
|
|
- per_process_gpu_memory_fraction: 1.0
|
|
|
|
-}
|
|
|
|
-, '_evaluation_master': '', '_master': u'grpc://master-ip:8000'}
|
|
|
|
-2017-07-31 22:54:58.928088: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-ip:8000}
|
|
|
|
-2017-07-31 22:54:58.928153: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:8000}
|
|
|
|
-2017-07-31 22:54:58.928160: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-ip:8000}
|
|
|
|
-2017-07-31 22:54:58.929873: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-## Visualizing results with TensorBoard
|
|
|
|
-
|
|
|
|
-When using Estimators you can also visualize your data in TensorBoard, with no
|
|
|
|
-changes in your code. You can use TensorBoard to visualize your TensorFlow
|
|
|
|
-graph, plot quantitative metrics about the execution of your graph, and show
|
|
|
|
-additional data like images that pass through it.
|
|
|
|
-
|
|
|
|
-You'll see something similar to this if you "point" TensorBoard to the
|
|
|
|
-`job dir` parameter you used to train or evaluate your model.
|
|
|
|
-
|
|
|
|
-Check TensorBoard during training or after it. Just point TensorBoard to the
|
|
|
|
-model_dir you chose on the previous step.
|
|
|
|
-
|
|
|
|
-```shell
|
|
|
|
-tensorboard --log-dir="<job dir>"
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-## Warnings
|
|
|
|
-
|
|
|
|
-When runninng `cifar10_main.py` with `--sync` argument you may see an error
|
|
|
|
-similar to:
|
|
|
|
-
|
|
|
|
-```python
|
|
|
|
-File "cifar10_main.py", line 538, in <module>
|
|
|
|
- tf.app.run()
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
|
|
|
|
- _sys.exit(main(_sys.argv[:1] + flags_passthrough))
|
|
|
|
-File "cifar10_main.py", line 518, in main
|
|
|
|
- hooks), run_config=config)
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run
|
|
|
|
- return _execute_schedule(experiment, schedule)
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule
|
|
|
|
- return task()
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 501, in train_and_evaluate
|
|
|
|
- hooks=self._eval_hooks)
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 681, in _call_evaluate
|
|
|
|
- hooks=hooks)
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 292, in evaluate
|
|
|
|
- name=name)
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 638, in _evaluate_model
|
|
|
|
- features, labels, model_fn_lib.ModeKeys.EVAL)
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 545, in _call_model_fn
|
|
|
|
- features=features, labels=labels, **kwargs)
|
|
|
|
-File "cifar10_main.py", line 331, in _resnet_model_fn
|
|
|
|
- gradvars, global_step=tf.train.get_global_step())
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 252, in apply_gradients
|
|
|
|
- variables.global_variables())
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped
|
|
|
|
- return _add_should_use_warning(fn(*args, **kwargs))
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning
|
|
|
|
- wrapped = TFShouldUseWarningWrapper(x)
|
|
|
|
-File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__
|
|
|
|
- stack = [s.strip() for s in traceback.format_stack()]
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-This should not affect your training, and should be fixed on the next releases.
|
|
|