Write data to the Firestore database

This page describes the second stage of the migration process where you set up a Dataflow pipeline and begin a concurrent data move from the Cloud Storage bucket into your destination Firestore with MongoDB compatibility database. This operation will run concurrently with the Datastream stream.

Build and deploying the Dataflow template

Clone the repository and check out the master branch:

git clone https://github.com/GoogleCloudPlatform/DataflowTemplates.git
cd DataflowTemplates
git checkout main

Next, build and deploy the template to the Cloud Storage bucket:

mvn clean package -PtemplatesStage \
-DskipTests \
-DprojectId="$PROJECT_ID" \
-DbucketName="$GCS_BUCKET_NAME" \
-DstagePrefix="$GCS_BUCKET_TEMPLATE_PATH" \
-DtemplateName="Cloud_Datastream_MongoDB_to_Firestore" \
-pl v2/datastream-mongodb-to-firestore -am

After this command completes, it will emit a message similar to:

INFO: Flex Template was staged! gs://

The Cloud Storage location starting with the gs:// prefix must exactly match the value of the TEMPLATE_FILE_GCS_LOCATION variable that you've set earlier.

Start the Dataflow pipeline

The following command starts a new, uniquely named, Dataflow pipeline.

DATAFLOW_START_TIME="$(date +'%Y%m%d%H%M%S')"

gcloud dataflow flex-template run "dataflow-mongodb-to-firestore-$DATAFLOW_START_TIME" \
--template-file-gcs-location $TEMPLATE_FILE_GCS_LOCATION \
--region $LOCATION \
--num-workers $NUM_WORKERS \
--temp-location $TEMP_OUTPUT_LOCATION \
--additional-user-labels "" \
--parameters inputFilePattern=$INPUT_FILE_LOCATION,\
inputFileFormat=avro,\
rfcStartDateTime=$START_TIME,\
fileReadConcurrency=10,\
connectionUri=$FIRESTORE_CONNECTION_URI,\
databaseName=$FIRESTORE_DATABASE_NAME,\
shadowCollectionPrefix=shadow_,\
batchSize=500,\
deadLetterQueueDirectory=$DLQ_LOCATION,\
dlqRetryMinutes=10,\
dlqMaxRetryCount=500,\
processBackfillFirst=false,\
useShadowTablesForBackfill=true,\
runMode=regular,\
directoryWatchDurationInMinutes=20,\
streamName=$DATASTREAM_NAME,\
stagingLocation=$STAGING_LOCATION,\
autoscalingAlgorithm=THROUGHPUT_BASED,\
maxNumWorkers=$MAX_WORKERS,\
workerMachineType=$WORKER_TYPE

For more information about monitoring the Dataflow pipeline, see Troubleshooting.

What's next

Proceed to Migrate traffic to Firestore.