All Products
Search
Document Center

E-MapReduce:Use the spark-submit CLI to submit a Spark job

Last Updated:Jun 14, 2025

This topic describes how to use the spark-submit CLI to submit a Spark job after connecting E-MapReduce (EMR) Serverless Spark to Elastic Compute Service (ECS).

Prerequisites

  • Java Development Kit (JDK) V1.8 or later is installed.

  • If you use a RAM user to submit a Spark job, you must add the RAM user to the Serverless Spark workspace and grant the developer role or a higher role to the RAM user. For more information, see Manage users and roles.

Procedure

Step 1: Download and install the EMR Serverless spark-submit tool

  1. Click emr-serverless-spark-tool-0.6.3-SNAPSHOT-bin.zip to download the installation package.

  2. Upload the installation package to your ECS instance. For more information, see Upload or download files.

  3. Run the following command to decompress the installation package and install the spark-submit tool:

    unzip emr-serverless-spark-tool-0.6.3-SNAPSHOT-bin.zip

Step 2: Configure parameters

Important

In an environment where Spark is installed, if the SPARK_CONF_DIR environment variable is set, you must place the configuration file in the directory specified by SPARK_CONF_DIR. For example, in an EMR cluster, this directory is typically /etc/taihao-apps/spark-conf. Otherwise, an error will occur.

  1. Run the following command to modify the configuration of the connection.properties file:

    vim emr-serverless-spark-tool-0.6.3-SNAPSHOT/conf/connection.properties
  2. We recommend that you configure the file as follows. The parameter format is key=value. Example:

    accessKeyId=
    accessKeySecret=
    regionId=cn-hangzhou
    endpoint=emr-serverless-spark.cn-hangzhou.aliyuncs.com
    workspaceId=w-xxxxxxxxxxxx

    The following table describes the parameters:

    Parameter

    Required

    Description

    accessKeyId

    Yes

    The AccessKey ID and AccessKey secret of the Alibaba Cloud account or RAM user that is used to run the Spark job.

    Important

    When you configure the accessKeyId and accessKeySecret parameters, make sure that the user corresponding to the AccessKey has the read and write permissions on the OSS bucket that is bound to the workspace. You can view the OSS bucket that is bound to the workspace on the Spark page by clicking Details in the Actions column of the workspace.

    accessKeySecret

    Yes

    regionId

    Yes

    The region ID of the disks. In this example, the China (Hangzhou) region is used.

    endpoint

    Yes

    The endpoint of EMR Serverless Spark. For more information about endpoints, see Service registration.

    In this example, the public endpoint of the China (Hangzhou) region is used. The parameter value is emr-serverless-spark.cn-hangzhou.aliyuncs.com.

    Note

    If the ECS instance cannot access the Internet, you must use the virtual private cloud (VPC) endpoint of EMR Serverless Spark.

    workspaceId

    Yes

    The ID of the EMR Serverless Spark workspace.

Step 3: Submit a Spark job

  1. Run the following command to go to the directory of the spark-submit tool:

    cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
  2. Select a job submission method based on the job type.

    When you submit a job, you must specify the file resources (such as JAR packages or Python scripts) on which the job depends. These file resources can be stored in OSS or locally, depending on your scenario and requirements. In this topic, OSS resources are used as examples.

    Use spark-submit

    spark-submit is a general job submission tool provided by Spark. It is suitable for jobs launched from Java/Scala and PySpark.

    Spark job launched from Java or Scala

    In this example, spark-examples_2.12-3.3.1.jar is used. You can click spark-examples_2.12-3.3.1.jar to download the test JAR package, and then upload the JAR package to OSS. The JAR package is a simple example provided by Spark to calculate the value of pi (π).

    ./bin/spark-submit  --name SparkPi \
    --queue dev_queue  \
    --num-executors 5 \
    --driver-memory 1g \
    --executor-cores 2 \
    --executor-memory 2g \
    --class org.apache.spark.examples.SparkPi \
     oss:///path/to/spark-examples_2.12-3.3.1.jar \
    10000

    Spark job launched from PySpark

    In this example, DataFrame.py and employee.csv are used. You can click DataFrame.py and employee.csv to download the test files, and then upload the test files to OSS.

    Note
    • The DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.

    • The employee.csv file contains data such as employee names, departments, and salaries.

    ./bin/spark-submit --name PySpark \
    --queue dev_queue  \
    --num-executors 5 \
    --driver-memory 1g \
    --executor-cores 2 \
    --executor-memory 2g \
    --conf spark.tags.key=value \
    oss:///path/to/DataFrame.py \
    oss:///path/to/employee.csv

    Field description:

    • Supported open source parameters

      Parameter

      Example

      Description

      --name

      SparkPi

      The application name of the Spark job, which is used to identify the job.

      --class

      org.apache.spark.examples.SparkPi

      The entry class of the Spark job. This parameter is required only if the Spark job is launched from Java or Scala.

      --num-executors

      5

      The number of executors of the Spark job.

      --driver-cores

      1

      The number of driver cores of the Spark job.

      --driver-memory

      1g

      The size of driver memory of the Spark job.

      --executor-cores

      2

      The number of executor cores of the Spark job.

      --executor-memory

      2g

      The size of executor memory of the Spark job.

      --files

      oss:///file1,oss:///file2

      The resource files that the Spark job needs to reference. The resource files can be OSS resources or local files. Multiple files are separated by commas (,).

      --py-files

      oss:///file1.py,oss:///file2.py

      The Python scripts that the Spark job needs to reference. The Python scripts can be OSS resources or local files. Multiple files are separated by commas (,). This parameter is required only if the Spark job is launched from PySpark.

      --jars

      oss:///file1.jar,oss:///file2.jar

      The JAR package resources that the Spark job needs to reference. The JAR package resources can be OSS resources or local files. Multiple files are separated by commas (,).

      --archives

      oss:///archive.tar.gz#env,oss:///archive2.zip

      The archive package resources that the Spark job needs to reference. The archive package resources can be OSS resources or local files. Multiple files are separated by commas (,).

      --queue

      root_queue

      The name of the queue on which the Spark job runs. The queue name must be the same as that in the EMR Serverless Spark workspace.

      --proxy-user

      test

      The value that you set overwrites the HADOOP_USER_NAME environment variable. The behavior is the same as that of the open source version.

      --conf

      spark.tags.key=value

      The custom parameter of the Spark job.

      --status

      jr-8598aa9f459d****

      The status of the Spark job.

      --kill

      jr-8598aa9f459d****

      Terminates the Spark job.

    • Non-open source enhanced parameters

      Parameter

      Example

      Description

      --detach

      N/A

      Exits the spark-submit tool. If you use this parameter, you do not need to wait for the tool to return the job status. The spark-submit tool immediately exits after the Spark job is submitted.

      --detail

      jr-8598aa9f459d****

      The details of the Spark job.

      --release-version

      esr-4.1.1 (Spark 3.5.2, Scala 2.12, Java Runtime)

      The specified Spark version. Configure this parameter based on the engine version displayed in the console.

      --enable-template

      No filling required

      Enables the template feature. The job uses the default configuration template of the workspace.

      If you have created a Configuration Template in Configuration Management, you can specify the template ID by using the spark.emr.serverless.templateId parameter in --conf. The job directly applies the specified template ID. For more information about how to create a template, see Manage configurations.

      • If you specify only --enable-template, the job automatically applies the default configuration template of the workspace.

      • If you specify only the template ID by using --conf, the job directly applies the specified template ID.

      • If you specify both --enable-template and --conf, and if you specify both --enable-template and --conf spark.emr.serverless.templateId, the template ID in --conf overwrites the default template.

      • If you specify neither parameter, and if you specify neither --enable-template nor --conf spark.emr.serverless.templateId, the job does not apply any template configuration.

      --timeout

      60

      The timeout period of the job. Unit: seconds.

    • Unsupported open source parameters

      • --deploy-mode

      • --master

      • --repositories

      • --keytab

      • --principal

      • --total-executor-cores

      • --driver-library-path

      • --driver-class-path

      • --supervise

      • --verbose

    Use the spark-sql method

    spark-sql is a tool specifically used to run SQL queries or scripts. It is suitable for scenarios where SQL statements are directly executed.

    • Example 1: Directly execute an SQL statement

      spark-sql -e "SHOW TABLES"

      This command lists all tables in the current database.

    • Example 2: Run an SQL script file

      spark-sql -f oss:///path/to/your/example.sql

      In this example, example.sql is used. You can click example.sql to download the test file, and then upload the test file to OSS.

      example.sql file content example

      CREATE TABLE IF NOT EXISTS employees (
          id INT,
          name STRING,
          age INT,
          department STRING
      );
      
      INSERT INTO employees VALUES
      (1, 'Alice', 30, 'Engineering'),
      (2, 'Bob', 25, 'Marketing'),
      (3, 'Charlie', 35, 'Sales');
      
      SELECT * FROM employees;
      

    The following table describes the parameters:

    Parameter

    Example

    Description

    -e ""

    -e "SELECT * FROM table"

    Executes SQL statements directly in the CLI.

    -f

    -f oss://path/script.sql

    Execute the SQL script file in the specified path.

Step 4: Query a Spark job

Use the CLI

Query the status of a Spark job

cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
./bin/spark-submit --status 

Query the details of a Spark job

cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
./bin/spark-submit --detail 

Use the UI

  1. On the EMR Serverless Spark page, click Job History in the left-side navigation pane.

  2. On the Job History page, click the Development Jobs tab to view the submitted jobs.

    image

(Optional) Step 5: Terminate a Spark job

cd emr-serverless-spark-tool-0.6.3-SNAPSHOT
./bin/spark-submit --kill 
Note

You can terminate only a job that is in the Running state.