Running Apache Spark workloads

What is Spark?

Apache Spark is an open-source, distributed, multi-language data processing system. It can quickly perform processing tasks by utilization of in-memory caching, query execution optimization and other techniques while distributing workloads among multiple managed workers.

More information could be found at https://spark.apache.org/

Running on the platform

We allow our customers to run Spark workloads in various approaches:

as a collection of platform Jobs using neuro-flow (see dedicated repository example)
as a PySpark application with Driver within the Jupyter server (seeJupyter + PySpark + Kubernetes)
as a PySpark application managed by spark-on-k8s-operator (example yet to be added)

In later two cases, one should contact cluster manager in order to make required configurations an obtain Kubernetes credentials needed for Spark Driver. Those credentials are supplied as kubectl config file and should be mounted into the Driver job in expected path (~/.kube/config).

Jupyter + PySpark + Kubernetes

This section contains example of a user-flow for submitting workloads from Jupyter with PySpark into underlying Kubernetes cluster.

Prerequisite: kubectl credentials file is obtained from the cluster manager and saved as KUBECONFIG platform secret. The secret name could be different, just make sure you select a proper secret during the first step.

Launch Jupyter Lab (or Notebook) application from the Dashboard using pyspark-jupyter container image, attach kubectl config at /root/.kube/config.

Open Notebook and adjust PySpark context configuration with namespace, service account name and Kubernetes API endpoint provided by cluster manager.

Now you could start deploying Spark workloads to the underlying Kubernetes cluster.

Do not forget to tear down workers with spark.stop()

PreviousDistributed Training in PyTorch NextCI with GitHub Actions

Last updated 5 months ago