Understanding Spark: Part 4: Operations

1. How to specify the resources needed while submitting the spark applicaiton?

  • While submitting an application, it is necessary to specify which mode it will run on, using --master parameter.

    • yarn
    • mesos
    • local
    • spark
  • Once deployment mode is specified, number of executors can be specified using

    • --num-executors n
  • CPU and memory requirements for executors can be specified.

    • --executor-memory ( for example 20G )
    • --total-executor-cores n

      -Driver cpu and memory requirements can be specified using

    • --driver-memory ( for example 2G )
    • --driver-cores n
  • More examples can be found at http://spark.apache.org/docs/latest/submitting-applications.html

2. What is difference between deploy-mode, client and cluster?

  • deploy-mode is used to specify the location where the driver is going to run.

    • --deploy-mode client/cluster
  • For client mode, the driver will run on client machine, mostly from the machine from where the spark application is deployed. This is the default mode and only mode when using interactive mode.
  • For cluster mode, the driver will run on one of the machines/workers on the cluster. This is the recommended mode when running in batch mode.

3. Can the resources be allocated to Spark applications dynamically?

  • Yes, spark application can be configured to use resources dynamically. In this case a minimumm and maximum number of executors need to be specified. So, the spark application can gracefully decommission executors if not in use, but can not go below the minimum number of executors. Similarily, it can request for more executors but can not exceed maximum number of executors configured. To begin with an initial number of executors are also defined.
  • The following properties need to be configured for dynamic allocation

    • spark.dynamicAllocation.enabled = true
    • spark.shuffle.service.enabled = true
    • spark.dynamicAllocation.initialExecutors = 3 (Initial number of executors to run if dynamic allocation is enabled, this is same as "spark.dynamicAllocation.minExecutors")
    • spark.dynamicAllocation.minExecutors = 3 (executors number will come to this number if executors are not in use, after 60 sec(default), controlled by "spark.dynamicAllocation. executorIdleTimeout")
    • spark.dynamicAllocation.maxExecutors = 30 (maximum executors that job can request) ***
  • In dynamic allocation, spark request for new resources when the backlog time exceeds the time configured by spark.dynamicAllocation.schedulerBacklogTimeout. Also continues to trigger requests for new resources if the backlog continues and can be defined by spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
  • The resources requested by spark increases exponentially, for example the pplication will add 1 executor in the first round, and then 2, 4, 8 and so on executors in the subsequent rounds.
  • More detailed explanation is given at https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

4. On which port the web UI for spark application run for monitoring the spark applications?

  • Every spark application has an web UI port. The UI port is opened on 4040 port of the machine where the drier is running. If the port 4040 is not available, then it will try to open at subsequent port 4041, 4042 etc.
  • collect() on an RDD will collect all the paritions of the RDD to the driver. And the complete RDD will be availabe in driver as a single buffer. Driver can then feed this data to an external library like pandas, matplotlib for further processing or visualizing. This is a very useful technique.
  • But the drawback of the call is if RDD size is very large, then driver may crash becuase of Out Of Memory error, there by killing the whole spark applicaiton.
  • So, the collect() call should be used judiciously to only bring a small set of data onto driver memory. To overcome any unexpected error, spark.driver.maxResultSize can be set to limit the size of data that can be collected.

6. How to pass additional jar files, data files or python files to the spark application?

  • Spark application may need additional thirdy party libraries in terms of jar files or python files. The spark application may be using these libraries and their APIs and hence need to have these files maded available to all the executors during run time.
  • jars can be passed using *--jars option in spark submit.
  • --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
  • All the JAR file and additional files passed to the spark application are copied to the working directory for each SparkContext on the executor nodes. This can take a significant amount of space over a long period of time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the spark.worker.cleanup.appDataTtl property.

8. Some useful commands for managing applications on YARN?

  • List all applications

    • yarn application -list
  • application status

    • yarn application -status <Application-ID>
  • kill an application

    • yarn application -kill <Application-ID>
  • application status for specific states

    • yarn application -appStates
    • The valid application state can be one of the following: ALL, NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
  • Check logs for an application

    • yarn logs -applicationId <Application-ID>

9. How to submit a spark application to a specific queue in Yarn?

  • To submit a spark application to a Yarn queue, submit the spark application or start spark shell with Yarn master mode and provide --queue <queue name> . The queue should be already configured and available in Yarn and the user submitting the application must have access to deploy application in the queue. For this you need to check the Yarn configurations.

10. Is there a mechanism to restart spark driver if the driver fails?

  • Yes, but only when spark runs on cluster mode . By default, Yarn will restart the container running the driver. In spark and mesos mode, you can provide --suprervise option to restart driver if it fails.
In [ ]: