This topic describes how to use the spark-submit CLI to submit a Spark job after connecting E-MapReduce (EMR) Serverless Spark to Elastic Compute Service (ECS).
Prerequisites
Java Development Kit (JDK) V1.8 or later is installed.
If you use a RAM user to submit a Spark job, you must add the RAM user to the Serverless Spark workspace and grant the developer role or a higher role to the RAM user. For more information, see Manage users and roles.
Procedure
Step 1: Download and install the EMR Serverless spark-submit tool
Click emr-serverless-spark-tool-0.6.3-SNAPSHOT-bin.zip to download the installation package.
Upload the installation package to your ECS instance. For more information, see Upload or download files.
Run the following command to decompress the installation package and install the spark-submit tool:
unzip emr-serverless-spark-tool-0.6.3-SNAPSHOT-bin.zip
Step 2: Configure parameters
In an environment where Spark is installed, if the SPARK_CONF_DIR
environment variable is set, you must place the configuration file in the directory specified by SPARK_CONF_DIR
. For example, in an EMR cluster, this directory is typically /etc/taihao-apps/spark-conf
. Otherwise, an error will occur.
Run the following command to modify the configuration of the connection.properties file:
vim emr-serverless-spark-tool-0.6.3-SNAPSHOT/conf/connection.properties
We recommend that you configure the file as follows. The parameter format is
key=value
. Example:accessKeyId=
accessKeySecret= regionId=cn-hangzhou endpoint=emr-serverless-spark.cn-hangzhou.aliyuncs.com workspaceId=w-xxxxxxxxxxxx The following table describes the parameters:
Parameter
Required
Description
accessKeyId
Yes
The AccessKey ID and AccessKey secret of the Alibaba Cloud account or RAM user that is used to run the Spark job.
ImportantWhen you configure the
accessKeyId
andaccessKeySecret
parameters, make sure that the user corresponding to the AccessKey has the read and write permissions on the OSS bucket that is bound to the workspace. You can view the OSS bucket that is bound to the workspace on the Spark page by clicking Details in the Actions column of the workspace.accessKeySecret
Yes
regionId
Yes
The region ID of the disks. In this example, the China (Hangzhou) region is used.
endpoint
Yes
The endpoint of EMR Serverless Spark. For more information about endpoints, see Service registration.
In this example, the public endpoint of the China (Hangzhou) region is used. The parameter value is
emr-serverless-spark.cn-hangzhou.aliyuncs.com
.NoteIf the ECS instance cannot access the Internet, you must use the virtual private cloud (VPC) endpoint of EMR Serverless Spark.
workspaceId
Yes
The ID of the EMR Serverless Spark workspace.
Step 3: Submit a Spark job
Run the following command to go to the directory of the spark-submit tool:
cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
Select a job submission method based on the job type.
When you submit a job, you must specify the file resources (such as JAR packages or Python scripts) on which the job depends. These file resources can be stored in OSS or locally, depending on your scenario and requirements. In this topic, OSS resources are used as examples.
Use spark-submit
spark-submit
is a general job submission tool provided by Spark. It is suitable for jobs launched from Java/Scala and PySpark.Spark job launched from Java or Scala
In this example, spark-examples_2.12-3.3.1.jar is used. You can click spark-examples_2.12-3.3.1.jar to download the test JAR package, and then upload the JAR package to OSS. The JAR package is a simple example provided by Spark to calculate the value of pi (π).
./bin/spark-submit --name SparkPi \ --queue dev_queue \ --num-executors 5 \ --driver-memory 1g \ --executor-cores 2 \ --executor-memory 2g \ --class org.apache.spark.examples.SparkPi \ oss://
/path/to/spark-examples_2.12-3.3.1.jar \ 10000 Spark job launched from PySpark
In this example, DataFrame.py and employee.csv are used. You can click DataFrame.py and employee.csv to download the test files, and then upload the test files to OSS.
NoteThe DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.
The employee.csv file contains data such as employee names, departments, and salaries.
./bin/spark-submit --name PySpark \ --queue dev_queue \ --num-executors 5 \ --driver-memory 1g \ --executor-cores 2 \ --executor-memory 2g \ --conf spark.tags.key=value \ oss://
/path/to/DataFrame.py \ oss:// /path/to/employee.csv Field description:
Supported open source parameters
Parameter
Example
Description
--name
SparkPi
The application name of the Spark job, which is used to identify the job.
--class
org.apache.spark.examples.SparkPi
The entry class of the Spark job. This parameter is required only if the Spark job is launched from Java or Scala.
--num-executors
5
The number of executors of the Spark job.
--driver-cores
1
The number of driver cores of the Spark job.
--driver-memory
1g
The size of driver memory of the Spark job.
--executor-cores
2
The number of executor cores of the Spark job.
--executor-memory
2g
The size of executor memory of the Spark job.
--files
oss://
/file1,oss:// /file2 The resource files that the Spark job needs to reference. The resource files can be OSS resources or local files. Multiple files are separated by commas (,).
--py-files
oss://
/file1.py,oss:// /file2.py The Python scripts that the Spark job needs to reference. The Python scripts can be OSS resources or local files. Multiple files are separated by commas (,). This parameter is required only if the Spark job is launched from PySpark.
--jars
oss://
/file1.jar,oss:// /file2.jar The JAR package resources that the Spark job needs to reference. The JAR package resources can be OSS resources or local files. Multiple files are separated by commas (,).
--archives
oss://
/archive.tar.gz#env,oss:// /archive2.zip The archive package resources that the Spark job needs to reference. The archive package resources can be OSS resources or local files. Multiple files are separated by commas (,).
--queue
root_queue
The name of the queue on which the Spark job runs. The queue name must be the same as that in the EMR Serverless Spark workspace.
--proxy-user
test
The value that you set overwrites the
HADOOP_USER_NAME
environment variable. The behavior is the same as that of the open source version.--conf
spark.tags.key=value
The custom parameter of the Spark job.
--status
jr-8598aa9f459d****
The status of the Spark job.
--kill
jr-8598aa9f459d****
Terminates the Spark job.
Non-open source enhanced parameters
Parameter
Example
Description
--detach
N/A
Exits the spark-submit tool. If you use this parameter, you do not need to wait for the tool to return the job status. The spark-submit tool immediately exits after the Spark job is submitted.
--detail
jr-8598aa9f459d****
The details of the Spark job.
--release-version
esr-4.1.1 (Spark 3.5.2, Scala 2.12, Java Runtime)
The specified Spark version. Configure this parameter based on the engine version displayed in the console.
--enable-template
No filling required
Enables the template feature. The job uses the default configuration template of the workspace.
If you have created a Configuration Template in Configuration Management, you can specify the template ID by using the
spark.emr.serverless.templateId
parameter in--conf
. The job directly applies the specified template ID. For more information about how to create a template, see Manage configurations.If you specify only
--enable-template
, the job automatically applies the default configuration template of the workspace.If you specify only the template ID by using
--conf
, the job directly applies the specified template ID.If you specify both
--enable-template
and--conf
, and if you specify both--enable-template
and--conf spark.emr.serverless.templateId
, the template ID in--conf
overwrites the default template.If you specify neither parameter, and if you specify neither
--enable-template
nor--conf spark.emr.serverless.templateId
, the job does not apply any template configuration.
--timeout
60
The timeout period of the job. Unit: seconds.
Unsupported open source parameters
--deploy-mode
--master
--repositories
--keytab
--principal
--total-executor-cores
--driver-library-path
--driver-class-path
--supervise
--verbose
Use the spark-sql method
spark-sql
is a tool specifically used to run SQL queries or scripts. It is suitable for scenarios where SQL statements are directly executed.Example 1: Directly execute an SQL statement
spark-sql -e "SHOW TABLES"
This command lists all tables in the current database.
Example 2: Run an SQL script file
spark-sql -f oss://
/path/to/your/example.sql In this example, example.sql is used. You can click example.sql to download the test file, and then upload the test file to OSS.
The following table describes the parameters:
Parameter
Example
Description
-e "
" -e "SELECT * FROM table"
Executes SQL statements directly in the CLI.
-f
-f oss://path/script.sql
Execute the SQL script file in the specified path.
Step 4: Query a Spark job
Use the CLI
Query the status of a Spark job
cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
./bin/spark-submit --status
Query the details of a Spark job
cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
./bin/spark-submit --detail
Use the UI
On the EMR Serverless Spark page, click Job History in the left-side navigation pane.
On the Job History page, click the Development Jobs tab to view the submitted jobs.
(Optional) Step 5: Terminate a Spark job
cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
./bin/spark-submit --kill
You can terminate only a job that is in the Running state.