title: Spark in Kubernetes with OzoneFS menu: main:
parent: Recipes
This recipe shows how Ozone object store can be used from Spark using:
Download latest Spark and Ozone distribution and extract them. This method is
tested with the spark-2.4.0-bin-hadoop2.7
distribution.
You also need the following:
First of all create a docker image with the Spark image creator. Execute the following from the Spark distribution
./bin/docker-image-tool.sh -r myrepo -t 2.4.0 build
Note: if you use Minikube add the -m
flag to use the docker daemon of the Minikube image:
./bin/docker-image-tool.sh -m -r myrepo -t 2.4.0 build
./bin/docker-image-tool.sh
is an official Spark tool to create container images and this step will create multiple Spark container images with the name myrepo/spark. The first container will be used as a base container in the following steps.
Create a new directory for customizing the created docker image.
Copy the ozone-site.xml
from the cluster:
kubectl cp om-0:/opt/hadoop/etc/hadoop/ozone-site.xml .
And create a custom core-site.xml
:
<configuration>
<property>
<name>fs.o3fs.impl</name>
<value>org.apache.hadoop.fs.ozone.BasicOzoneFileSystem</value>
</property>
</configuration>
Note: You may also use org.apache.hadoop.fs.ozone.OzoneFileSystem
without the Basic
prefix. The Basic
version doesn't support FS statistics and encryption zones but can work together with older hadoop versions.
Copy the ozonefs.jar
file from an ozone distribution (use the legacy version!)
kubectl cp om-0:/opt/hadoop/share/ozone/lib/hadoop-ozone-filesystem-lib-legacy-0.4.0-SNAPSHOT.jar .
Create a new Dockerfile and build the image:
FROM myrepo/spark:2.4.0
ADD core-site.xml /opt/hadoop/conf/core-site.xml
ADD ozone-site.xml /opt/hadoop/conf/ozone-site.xml
ENV HADOOP_CONF_DIR=/opt/hadoop/conf
ENV SPARK_EXTRA_CLASSPATH=/opt/hadoop/conf
ADD hadoop-ozone-filesystem-lib-legacy-0.4.0-SNAPSHOT.jar /opt/hadoop-ozone-filesystem-lib-legacy.jar
docker build -t myrepo/spark-ozone
For remote kubernetes cluster you may need to push it:
docker push myrepo/spark-ozone
Download any text file and put it to the /tmp/alice.txt
first.
kubectl port-forward s3g-0 9878:9878
aws s3api --endpoint http://localhost:9878 create-bucket --bucket=test
aws s3api --endpoint http://localhost:9878 put-object --bucket test --key alice.txt --body /tmp/alice.txt
kubectl exec -it scm-0 ozone s3 path test
The output of the last command is something like this:
Volume name for S3Bucket is : s3asdlkjqiskjdsks
Ozone FileSystem Uri is : o3fs://test.s3asdlkjqiskjdsks
Write down the ozone filesystem uri as it should be used with the spark-submit command.
kubectl create serviceaccount spark -n yournamespace
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=yournamespace:spark --namespace=yournamespace
Execute the following spark-submit command, but change at least the following values:
location of the input file (o3fs://...), use the string which is identified earlier with the ozone s3 path <bucketname>
command
bin/spark-submit \
--master k8s://https://kubernetes:6443 \
--deploy-mode cluster \
--name spark-word-count \
--class org.apache.spark.examples.JavaWordCount \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.namespace=yournamespace \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=myrepo/spark-ozone \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--jars /opt/hadoop-ozone-filesystem-lib-legacy.jar \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar \
o3fs://bucket.volume/alice.txt
Check the available spark-word-count-...
pods with kubectl get pod
Check the output of the calculation with kubectl logs spark-word-count-1549973913699-driver
You should see the output of the wordcount job. For example:
...
name: 8
William: 3
this,': 1
SOUP!': 1
`Silence: 1
`Mine: 1
ordered.: 1
considering: 3
muttering: 3
candle: 2
...