Write data to the Firestore database
This page describes the second stage of the migration process where you set up a Dataflow pipeline and begin a concurrent data move from the Cloud Storage bucket into your destination Firestore with MongoDB compatibility database. This operation will run concurrently with the Datastream stream.
Build and deploying the Dataflow template
Clone the repository and check out the master branch:
git clone https://github.com/GoogleCloudPlatform/DataflowTemplates.git
cd DataflowTemplates
git checkout main
Next, build and deploy the template to the Cloud Storage bucket:
mvn clean package -PtemplatesStage \
-DskipTests \
-DprojectId="$PROJECT_ID" \
-DbucketName="$GCS_BUCKET_NAME" \
-DstagePrefix="$GCS_BUCKET_TEMPLATE_PATH" \
-DtemplateName="Cloud_Datastream_MongoDB_to_Firestore" \
-pl v2/datastream-mongodb-to-firestore -am
After this command completes, it will emit a message similar to:
INFO: Flex Template was staged! gs://
The Cloud Storage location starting with the gs://
prefix must
exactly match the value of the TEMPLATE_FILE_GCS_LOCATION
variable
that you've set earlier.
Start the Dataflow pipeline
The following command starts a new, uniquely named, Dataflow pipeline.
DATAFLOW_START_TIME="$(date +'%Y%m%d%H%M%S')"
gcloud dataflow flex-template run "dataflow-mongodb-to-firestore-$DATAFLOW_START_TIME" \
--template-file-gcs-location $TEMPLATE_FILE_GCS_LOCATION \
--region $LOCATION \
--num-workers $NUM_WORKERS \
--temp-location $TEMP_OUTPUT_LOCATION \
--additional-user-labels "" \
--parameters inputFilePattern=$INPUT_FILE_LOCATION,\
inputFileFormat=avro,\
rfcStartDateTime=$START_TIME,\
fileReadConcurrency=10,\
connectionUri=$FIRESTORE_CONNECTION_URI,\
databaseName=$FIRESTORE_DATABASE_NAME,\
shadowCollectionPrefix=shadow_,\
batchSize=500,\
deadLetterQueueDirectory=$DLQ_LOCATION,\
dlqRetryMinutes=10,\
dlqMaxRetryCount=500,\
processBackfillFirst=false,\
useShadowTablesForBackfill=true,\
runMode=regular,\
directoryWatchDurationInMinutes=20,\
streamName=$DATASTREAM_NAME,\
stagingLocation=$STAGING_LOCATION,\
autoscalingAlgorithm=THROUGHPUT_BASED,\
maxNumWorkers=$MAX_WORKERS,\
workerMachineType=$WORKER_TYPE
For more information about monitoring the Dataflow pipeline, see Troubleshooting.
What's next
Proceed to Migrate traffic to Firestore.