|
@@ -19,6 +19,8 @@
|
|
|
Must:
|
|
|
|
|
|
- Apache Hadoop 2.7 or above.
|
|
|
+- TonY library 0.3.2 or above. You could download latest TonY jar from
|
|
|
+https://github.com/linkedin/TonY/releases.
|
|
|
|
|
|
Optional:
|
|
|
|
|
@@ -149,9 +151,106 @@ java org.apache.hadoop.yarn.submarine.client.cli.Cli job run --name tf-job-001 \
|
|
|
--worker_resources memory=3G,vcores=2 \
|
|
|
--num_ps 2 \
|
|
|
--ps_resources memory=3G,vcores=2 \
|
|
|
- --worker_launch_cmd "venv.zip/venv/bin/python --steps 1000 --data_dir /tmp/data --working_dir /tmp/mode" \
|
|
|
- --ps_launch_cmd "venv.zip/venv/bin/python --steps 1000 --data_dir /tmp/data --working_dir /tmp/mode" \
|
|
|
- --container_resources /home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.2-all.jar
|
|
|
+ --worker_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py --steps 1000 --data_dir /tmp/data --working_dir /tmp/mode" \
|
|
|
+ --ps_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py --steps 1000 --data_dir /tmp/data --working_dir /tmp/mode" \
|
|
|
+ --insecure
|
|
|
+ --conf tony.containers.resources=PATH_TO_VENV_YOU_CREATED/venv.zip#archive,PATH_TO_MNIST_EXAMPLE/mnist_distributed.py, \
|
|
|
+PATH_TO_TONY_CLI_JAR/tony-cli-0.3.2-all.jar
|
|
|
+
|
|
|
+```
|
|
|
+You should then be able to see links and status of the jobs from command line:
|
|
|
+
|
|
|
+```
|
|
|
+2019-04-22 20:30:42,611 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: RUNNING
|
|
|
+2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: RUNNING
|
|
|
+2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: RUNNING
|
|
|
+2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for ps 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi
|
|
|
+2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi
|
|
|
+2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 1 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi
|
|
|
+2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: FINISHED
|
|
|
+2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: FINISHED
|
|
|
+2019-04-22 20:30:44,626 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: FINISHED
|
|
|
+
|
|
|
+```
|
|
|
+
|
|
|
+### With Docker
|
|
|
+
|
|
|
+```
|
|
|
+CLASSPATH=$(hadoop classpath --glob): \
|
|
|
+./hadoop-submarine-core/target/hadoop-submarine-core-0.2.0-SNAPSHOT.jar: \
|
|
|
+./hadoop-submarine-yarnservice-runtime/target/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar: \
|
|
|
+./hadoop-submarine-tony-runtime/target/hadoop-submarine-tony-runtime-0.2.0-SNAPSHOT.jar: \
|
|
|
+/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.2-all.jar \
|
|
|
+
|
|
|
+java org.apache.hadoop.yarn.submarine.client.cli.Cli job run --name tf-job-001 \
|
|
|
+ --docker_image hadoopsubmarine/tf-1.8.0-cpu:0.0.3 \
|
|
|
+ --input_path hdfs://pi-aw:9000/dataset/cifar-10-data \
|
|
|
+ --worker_resources memory=3G,vcores=2 \
|
|
|
+ --worker_launch_cmd "export CLASSPATH=\$(/hadoop-3.1.0/bin/hadoop classpath --glob) && cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --variable-strategy=CPU --num-gpus=0 --sync" \
|
|
|
+ --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
|
|
|
+ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
|
|
|
+ --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
|
|
|
+ --env HADOOP_HOME=/hadoop-3.1.0 \
|
|
|
+ --env HADOOP_YARN_HOME=/hadoop-3.1.0 \
|
|
|
+ --env HADOOP_COMMON_HOME=/hadoop-3.1.0 \
|
|
|
+ --env HADOOP_HDFS_HOME=/hadoop-3.1.0 \
|
|
|
+ --env HADOOP_CONF_DIR=/hadoop-3.1.0/etc/hadoop \
|
|
|
+ --conf tony.containers.resources=--conf tony.containers.resources=/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.2-all.jar
|
|
|
+```
|
|
|
+
|
|
|
+
|
|
|
+### Launch PyToch Application:
|
|
|
+
|
|
|
+#### Commandline
|
|
|
+
|
|
|
+### Without Docker
|
|
|
+
|
|
|
+You need:
|
|
|
+* Build a Python virtual environment with PyTorch 0.4.* installed
|
|
|
+* A cluster with Hadoop 2.7 or above.
|
|
|
+
|
|
|
+### Building a Python virtual environment with PyTorch
|
|
|
+
|
|
|
+TonY requires a Python virtual environment zip with PyTorch and any needed Python libraries already installed.
|
|
|
+
|
|
|
+```
|
|
|
+wget https://files.pythonhosted.org/packages/33/bc/fa0b5347139cd9564f0d44ebd2b147ac97c36b2403943dbee8a25fd74012/virtualenv-16.0.0.tar.gz
|
|
|
+tar xf virtualenv-16.0.0.tar.gz
|
|
|
+
|
|
|
+python virtualenv-16.0.0/virtualenv.py venv
|
|
|
+. venv/bin/activate
|
|
|
+pip install pytorch==0.4.0
|
|
|
+zip -r venv.zip venv
|
|
|
+```
|
|
|
+
|
|
|
+### PyTorch version
|
|
|
+
|
|
|
+ - Version 0.4.0+
|
|
|
+
|
|
|
+
|
|
|
+### Installing Hadoop
|
|
|
+
|
|
|
+TonY only requires YARN, not HDFS. Please see the [open-source documentation](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html) on how to set YARN up.
|
|
|
+
|
|
|
+### Get the training examples
|
|
|
+
|
|
|
+Get mnist_distributed.py from https://github.com/linkedin/TonY/tree/master/tony-examples/mnist-pytorch
|
|
|
+
|
|
|
+
|
|
|
+```
|
|
|
+CLASSPATH=$(hadoop classpath --glob): \
|
|
|
+./hadoop-submarine-core/target/hadoop-submarine-core-0.2.0-SNAPSHOT.jar: \
|
|
|
+./hadoop-submarine-yarnservice-runtime/target/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar: \
|
|
|
+./hadoop-submarine-tony-runtime/target/hadoop-submarine-tony-runtime-0.2.0-SNAPSHOT.jar: \
|
|
|
+/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.2-all.jar \
|
|
|
+
|
|
|
+java org.apache.hadoop.yarn.submarine.client.cli.Cli job run --name tf-job-001 \
|
|
|
+ --num_workers 2 \
|
|
|
+ --worker_resources memory=3G,vcores=2 \
|
|
|
+ --num_ps 2 \
|
|
|
+ --ps_resources memory=3G,vcores=2 \
|
|
|
+ --worker_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py" \
|
|
|
+ --ps_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py" \
|
|
|
--insecure
|
|
|
--conf tony.containers.resources=PATH_TO_VENV_YOU_CREATED/venv.zip#archive,PATH_TO_MNIST_EXAMPLE/mnist_distributed.py, \
|
|
|
PATH_TO_TONY_CLI_JAR/tony-cli-0.3.2-all.jar
|