Use the spark-submit CLI to submit a Spark job - E-MapReduce

This topic describes how to use the spark-submit CLI to submit a Spark job after connecting E-MapReduce (EMR) Serverless Spark to Elastic Compute Service (ECS).

Prerequisites

Java Development Kit (JDK) V1.8 or later is installed.
If you use a RAM user to submit a Spark job, you must add the RAM user to the Serverless Spark workspace and grant the developer role or a higher role to the RAM user. For more information, see Manage users and roles.

Procedure

Step 1: Download and install the EMR Serverless spark-submit tool

Click emr-serverless-spark-tool-0.6.3-SNAPSHOT-bin.zip to download the installation package.
Upload the installation package to your ECS instance. For more information, see Upload or download files.
Run the following command to decompress the installation package and install the spark-submit tool:
```
unzip emr-serverless-spark-tool-0.6.3-SNAPSHOT-bin.zip
```

Step 2: Configure parameters

Important

In an environment where Spark is installed, if the SPARK_CONF_DIR environment variable is set, you must place the configuration file in the directory specified by SPARK_CONF_DIR. For example, in an EMR cluster, this directory is typically /etc/taihao-apps/spark-conf. Otherwise, an error will occur.

Run the following command to modify the configuration of the connection.properties file:
```
vim emr-serverless-spark-tool-0.6.3-SNAPSHOT/conf/connection.properties
```

We recommend that you configure the file as follows. The parameter format is key=value. Example:

accessKeyId=
accessKeySecret=
regionId=cn-hangzhou
endpoint=emr-serverless-spark.cn-hangzhou.aliyuncs.com
workspaceId=w-xxxxxxxxxxxx

The following table describes the parameters:

Parameter	Required	Description
accessKeyId	Yes	The AccessKey ID and AccessKey secret of the Alibaba Cloud account or RAM user that is used to run the Spark job. Important When you configure the `accessKeyId` and `accessKeySecret` parameters, make sure that the user corresponding to the AccessKey has the read and write permissions on the OSS bucket that is bound to the workspace. You can view the OSS bucket that is bound to the workspace on the Spark page by clicking Details in the Actions column of the workspace.
accessKeySecret	Yes
regionId	Yes	The region ID of the disks. In this example, the China (Hangzhou) region is used.
endpoint	Yes	The endpoint of EMR Serverless Spark. For more information about endpoints, see Service registration. In this example, the public endpoint of the China (Hangzhou) region is used. The parameter value is `emr-serverless-spark.cn-hangzhou.aliyuncs.com`. Note If the ECS instance cannot access the Internet, you must use the virtual private cloud (VPC) endpoint of EMR Serverless Spark.
workspaceId	Yes	The ID of the EMR Serverless Spark workspace.

Step 3: Submit a Spark job

Run the following command to go to the directory of the spark-submit tool:
```
cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
```

Select a job submission method based on the job type.

When you submit a job, you must specify the file resources (such as JAR packages or Python scripts) on which the job depends. These file resources can be stored in OSS or locally, depending on your scenario and requirements. In this topic, OSS resources are used as examples.

Use spark-submit

spark-submit is a general job submission tool provided by Spark. It is suitable for jobs launched from Java/Scala and PySpark.

Spark job launched from Java or Scala

In this example, spark-examples_2.12-3.3.1.jar is used. You can click spark-examples_2.12-3.3.1.jar to download the test JAR package, and then upload the JAR package to OSS. The JAR package is a simple example provided by Spark to calculate the value of pi (π).

./bin/spark-submit  --name SparkPi \
--queue dev_queue  \
--num-executors 5 \
--driver-memory 1g \
--executor-cores 2 \
--executor-memory 2g \
--class org.apache.spark.examples.SparkPi \
 oss:///path/to/spark-examples_2.12-3.3.1.jar \
10000

Spark job launched from PySpark

In this example, DataFrame.py and employee.csv are used. You can click DataFrame.py and employee.csv to download the test files, and then upload the test files to OSS.

Note

The DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.
The employee.csv file contains data such as employee names, departments, and salaries.

./bin/spark-submit --name PySpark \
--queue dev_queue  \
--num-executors 5 \
--driver-memory 1g \
--executor-cores 2 \
--executor-memory 2g \
--conf spark.tags.key=value \
oss:///path/to/DataFrame.py \
oss:///path/to/employee.csv

Field description:

Supported open source parameters

Parameter	Example	Description
--name	SparkPi	The application name of the Spark job, which is used to identify the job.
--class	org.apache.spark.examples.SparkPi	The entry class of the Spark job. This parameter is required only if the Spark job is launched from Java or Scala.
--num-executors	5	The number of executors of the Spark job.
--driver-cores	1	The number of driver cores of the Spark job.
--driver-memory	1g	The size of driver memory of the Spark job.
--executor-cores	2	The number of executor cores of the Spark job.
--executor-memory	2g	The size of executor memory of the Spark job.
--files	oss:///file1,oss:///file2	The resource files that the Spark job needs to reference. The resource files can be OSS resources or local files. Multiple files are separated by commas (,).
--py-files	oss:///file1.py,oss:///file2.py	The Python scripts that the Spark job needs to reference. The Python scripts can be OSS resources or local files. Multiple files are separated by commas (,). This parameter is required only if the Spark job is launched from PySpark.
--jars	oss:///file1.jar,oss:///file2.jar	The JAR package resources that the Spark job needs to reference. The JAR package resources can be OSS resources or local files. Multiple files are separated by commas (,).
--archives	oss:///archive.tar.gz#env,oss:///archive2.zip	The archive package resources that the Spark job needs to reference. The archive package resources can be OSS resources or local files. Multiple files are separated by commas (,).
--queue	root_queue	The name of the queue on which the Spark job runs. The queue name must be the same as that in the EMR Serverless Spark workspace.
--proxy-user	test	The value that you set overwrites the `HADOOP_USER_NAME` environment variable. The behavior is the same as that of the open source version.
--conf	spark.tags.key=value	The custom parameter of the Spark job.
--status	jr-8598aa9f459d****	The status of the Spark job.
--kill	jr-8598aa9f459d****	Terminates the Spark job.

Non-open source enhanced parameters

Parameter	Example	Description
--detach	N/A	Exits the spark-submit tool. If you use this parameter, you do not need to wait for the tool to return the job status. The spark-submit tool immediately exits after the Spark job is submitted.
--detail	jr-8598aa9f459d****	The details of the Spark job.
--release-version	esr-4.1.1 (Spark 3.5.2, Scala 2.12, Java Runtime)	The specified Spark version. Configure this parameter based on the engine version displayed in the console.
--enable-template	No filling required	Enables the template feature. The job uses the default configuration template of the workspace. If you have created a Configuration Template in Configuration Management, you can specify the template ID by using the `spark.emr.serverless.templateId` parameter in `--conf`. The job directly applies the specified template ID. For more information about how to create a template, see Manage configurations. If you specify only `--enable-template`, the job automatically applies the default configuration template of the workspace. If you specify only the template ID by using `--conf`, the job directly applies the specified template ID. If you specify both `--enable-template` and `--conf`, and if you specify both `--enable-template` and `--conf spark.emr.serverless.templateId`, the template ID in `--conf` overwrites the default template. If you specify neither parameter, and if you specify neither `--enable-template` nor `--conf spark.emr.serverless.templateId`, the job does not apply any template configuration.
--timeout	60	The timeout period of the job. Unit: seconds.

Unsupported open source parameters
- --deploy-mode
- --master
- --repositories
- --keytab
- --principal
- --total-executor-cores
- --driver-library-path
- --driver-class-path
- --supervise
- --verbose

Use the spark-sql method

spark-sql is a tool specifically used to run SQL queries or scripts. It is suitable for scenarios where SQL statements are directly executed.

Example 1: Directly execute an SQL statement
```
spark-sql -e "SHOW TABLES"
```
This command lists all tables in the current database.

Example 2: Run an SQL script file

spark-sql -f oss:///path/to/your/example.sql

In this example, example.sql is used. You can click example.sql to download the test file, and then upload the test file to OSS.

example.sql file content example

CREATE TABLE IF NOT EXISTS employees (
    id INT,
    name STRING,
    age INT,
    department STRING
);

INSERT INTO employees VALUES
(1, 'Alice', 30, 'Engineering'),
(2, 'Bob', 25, 'Marketing'),
(3, 'Charlie', 35, 'Sales');

SELECT * FROM employees;

The following table describes the parameters:

Parameter	Example	Description
`-e ""`	`-e "SELECT * FROM table"`	Executes SQL statements directly in the CLI.
`-f`	`-f oss://path/script.sql`	Execute the SQL script file in the specified path.

Step 4: Query a Spark job

Use the CLI

Query the status of a Spark job

cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
./bin/spark-submit --status

Query the details of a Spark job

cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
./bin/spark-submit --detail

Use the UI

On the EMR Serverless Spark page, click Job History in the left-side navigation pane.
On the Job History page, click the Development Jobs tab to view the submitted jobs.

(Optional) Step 5: Terminate a Spark job

cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
./bin/spark-submit --kill

Note

You can terminate only a job that is in the Running state.