SlideShare a Scribd company logo
How to build unified Batch & Streaming
Pipelines with Apache Beam and Dataflow 1
Welcome to ServerlessToronto.org
Serverless Evolution (since FaaS started)
2
Serverless is New Agile & Mindset
#1 We started as Back-
end FaaS (Serverless)
Developers who enjoyed
“gluing” other people’s
APIs and Managed
Services)
#3 We're obsessed by
creating business value
(meaningful MVPs,
Products), focusing on
Outcomes/Impact –
NOT Outputs
#2 We build bridges
between Serverless
Community (“Dev leg”),
and Front-end, Voice-First
& UX folks (“UX leg”)
#4 Achieve agility NOT by
“sprinting” faster (like in
Scrum) but working smarter
(by using bigger building
blocks and less Ops)
Disconnect between IT & Business needs
3
Our group became dedicated to reducing the
Businesses & IT Gap!
Technology is not the point => We are here to create Value
Adopting Serverless Mindset allowed us to shift the focus from “pimping up our
cars” (infrastructure/code), towards “driving” (the business) forward.
≠
Upcoming ServerlessToronto.org Meetups?
4
We’ll remain online in 2022. We’re working on:
• “What, Why, Who & How of CDPs” w/ SEGMENT
• Joe Emison – his new book coming soon!
• Lak’s 2nd Ed book “DS on the GCP” out on May 3
• AWS Cloud: “Event-driven integration patterns”
Focusing more on Data Engineering, Modern Data
Stack, Agility, Leadership and helping Startups, see:
• http://youtube.serverlesstoronto.org/
Your presentations regardless of how big or small ☺
Please rate us on Meetup & tell others about #ServerlessTO user group
Knowledge Sponsor
1. Go to www.manning.com
2. Select *any* e-Book, Video course, or liveProject you want!
3. Add it to your shopping cart (no more than 1 item in the cart)
4. Raffle winners will send me the emails (used in Manning portal),
5. So the publisher can move it to your Dashboard – as if purchased.
Fill out the Survey to win: bit.ly/slsto
6
Feature Presentation
Cloud Dataflow
Overview
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
GCP Data
BigQuery
Cloud
Dataflow
Cloud
Dataproc
Cloud
Datalab
Cloud
Pub/Sub
Genomics
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Everyone.
Who Wants Real-time Data?
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Mobile
Devices
Tens of Thousands Events/sec
Tens of Billions Events/month
Hundreds of Billions Events/year
The Lambda Model
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Mobile
Devices
Apache Beam
+
Tens of Thousands Events/sec
Tens of Billions Events/month
Hundreds of Billions Events/year
A Unified Model
or
or
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Mobile
Devices
Cloud Pub/Sub Cloud Dataflow BigQuery
Tens of Thousands Events/sec
Tens of Billions Events/month
Hundreds of Billions Events/year
A Unified Model on Google Cloud Platform
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Compute and Storage
Unbounded
Bounded
Resource Management
Resource Auto-scaler
Dynamic Work
Rebalancer
Work Scheduler
Monitoring
Log Collection
Graph Optimization
Auto-Healing
Intelligent Watermarking
S
O
U
R
C
E
S
I
N
K
What is Cloud Dataflow?
Here’s a simple graphic showing how Dataflow can integrate and transform data from two sources.
One discrete job
Endless
incoming data
Cloud
Dataflow
What is Cloud Dataflow?
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Deploy
Schedule & Monitor
Autoscaling mid-job
Fully-managed and auto-configured
Auto graph-optimized for best execution path
Dynamic Work Rebalancing mid-job
1
2
3
4
Why Use Cloud Dataflow?
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Autoscaling mid-job
Fully-managed and auto-configured
Auto graph-optimized for best execution path
Dynamic Work Rebalancing mid-job
1
2
3
4
C D
C+D
C D
C+D
A GBK + B
A+ GBK + B
Why Use Cloud Dataflow?
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
800 RPS 1200 RPS 5000 RPS 50 RPS
*means 100% cluster utilization by definition
Autoscaling mid-job
Fully-managed and auto-configured
Auto graph-optimized for best execution path
Dynamic Work Rebalancing mid-job
1
2
3
4
Why Use Cloud Dataflow?
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Autoscaling mid-job
Fully-managed and auto-configured
Auto graph-optimized for best execution path
Dynamic Work Rebalancing mid-job
1
2
3
4
100 mins. 65 mins.
vs.
Why Use Cloud Dataflow?
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Start off with 3 workers,
things are looking okay
10 minutes
3 days
Re-estimation shows there’s
orders of magnitude more work:
need 100 workers!
Idle
You have 100 workers
but you don’t have 100 pieces of work!
...and that’s really the most important part
Autoscaling at Work
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Now scaling up (and down) is no big deal!
Add workers
Work distributes itself
Job starts with 3 workers, scales up to 1000.
When all work is done, scale down
Autoscaling + dynamic rebalancing
Waves of splitting
Upscaling cycles
and VM startup
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
End-user's pipeline
Libraries: transforms, sources/sinks etc.
Language-specific SDK
Beam model (ParDo, GBK, Windowing…)
Runner
Execution environment
Java ...
Python
The “Stack”
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
pipeline
.apply(PubsubIO.Read.named("read from PubSub")
.topic(String.format("projects/%s/topics/%s",
options.getSourceProject(), options.getSourceTopic()))
.timestampLabel("ts")
.withCoder(TableRowJsonCoder.of()))
.apply("window 1s",
Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply("mark rides", MapElements.via(new MarkRides()))
.apply("count similar", Count.perKey())
.apply("format rides", MapElements.via(new TransformRides()))
.apply(PubsubIO.Write.named("WriteToPubsub")
.topic(String.format("projects/%s/topics/%s",
options.getSinkProject(), options.getSinkTopic()))
.withCoder(TableRowJsonCoder.of()));
Read from Pubsub
Window of 1 second
Create KV pairs
Count them by key
Format for output
Write to Pubsub
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Using Dataflow Templates
Launching a Simple Pipeline
Ingest
Cloud
Pub/Sub
Pipelines
Cloud
Dataflow
Analytics
BigQuery
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
Pub/Sub to BigQuery
Dataflow templates let you stage
your job’s artifacts in Google
Cloud Storage.
Launch template jobs via REST
API, or Cloud Console.
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
public static void main(String[] args) {
…
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadLines")
.from(options.getInputFile()))
.apply(new CountWords())
.apply(ParDo.of(new FormatAsTextFn()))
.apply(TextIO.Write.named("WriteCounts")
.to(options.getOutput()));
p.run();
}
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
©Google Inc. or its affiliates. All rights reserved. Do not distribute.
For further reading/watching
NYC Taxi Tycoon Codelab
Google-provided Dataflow Templates
The World Beyond Batch - Streaming 101
What is the watermark heuristic for PubsubIO on GCP?
Spotify’s Event Delivery Pipeline (Part 1 of 3)
Thank you!
Thank you!
www.ServerlessToronto.org
Reducing the gap between IT and Business needs

More Related Content

Similar to How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow (20)

Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
AdobeMarketingCloud
 
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
SAP Cloud Platform
 
Google Cloud Platform
Google Cloud Platform
Francesco Marchitelli
 
Build & Deploy Scalable Cloud Applications in Record Time
Build & Deploy Scalable Cloud Applications in Record Time
RightScale
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Splunk und Multi-Cloud
Splunk und Multi-Cloud
Splunk
 
Discover How Volvo Cars Uses a Time Series Database to Become Data-Driven
Discover How Volvo Cars Uses a Time Series Database to Become Data-Driven
DevOps.com
 
GCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native Architectures
nine
 
Gradle(the innovation continues)
Gradle(the innovation continues)
Sejong Park
 
Splunk and Multicloud
Splunk and Multicloud
Splunk
 
Splunk and Multicloud
Splunk and Multicloud
Splunk
 
How to Architect and Develop Cloud Native Applications
How to Architect and Develop Cloud Native Applications
Sufyaan Kazi
 
ServerTemplate Deep Dive
ServerTemplate Deep Dive
RightScale
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
A intro to (hosted) Shiny Apps
A intro to (hosted) Shiny Apps
Daniel Koller
 
Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021
NeerajKumar1965
 
Exploring Google APIs with Python
Exploring Google APIs with Python
wesley chun
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
DataKitchen
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
AdobeMarketingCloud
 
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
SAP Cloud Platform
 
Build & Deploy Scalable Cloud Applications in Record Time
Build & Deploy Scalable Cloud Applications in Record Time
RightScale
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Splunk und Multi-Cloud
Splunk und Multi-Cloud
Splunk
 
Discover How Volvo Cars Uses a Time Series Database to Become Data-Driven
Discover How Volvo Cars Uses a Time Series Database to Become Data-Driven
DevOps.com
 
GCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native Architectures
nine
 
Gradle(the innovation continues)
Gradle(the innovation continues)
Sejong Park
 
Splunk and Multicloud
Splunk and Multicloud
Splunk
 
Splunk and Multicloud
Splunk and Multicloud
Splunk
 
How to Architect and Develop Cloud Native Applications
How to Architect and Develop Cloud Native Applications
Sufyaan Kazi
 
ServerTemplate Deep Dive
ServerTemplate Deep Dive
RightScale
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
A intro to (hosted) Shiny Apps
A intro to (hosted) Shiny Apps
Daniel Koller
 
Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021
NeerajKumar1965
 
Exploring Google APIs with Python
Exploring Google APIs with Python
wesley chun
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
DataKitchen
 

More from Daniel Zivkovic (20)

'The Art & Science of LLM Reliability - Building Trustworthy AI Systems' by M...
'The Art & Science of LLM Reliability - Building Trustworthy AI Systems' by M...
Daniel Zivkovic
 
AI - Your Startup Sidekick (Leveraging AI to Bootstrap a Lean Startup).pdf
AI - Your Startup Sidekick (Leveraging AI to Bootstrap a Lean Startup).pdf
Daniel Zivkovic
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Daniel Zivkovic
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Daniel Zivkovic
 
Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
Daniel Zivkovic
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
Daniel Zivkovic
 
Conversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui Costa
Daniel Zivkovic
 
Gojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applications
Daniel Zivkovic
 
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Daniel Zivkovic
 
What's new in Serverless at AWS?
What's new in Serverless at AWS?
Daniel Zivkovic
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
 
Empowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare Heroes
Daniel Zivkovic
 
Get started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google Cloud
Daniel Zivkovic
 
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Daniel Zivkovic
 
Smart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoT
Daniel Zivkovic
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Daniel Zivkovic
 
This is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill Shock
Daniel Zivkovic
 
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Daniel Zivkovic
 
Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?
Daniel Zivkovic
 
Serverless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless Toronto
Daniel Zivkovic
 
'The Art & Science of LLM Reliability - Building Trustworthy AI Systems' by M...
'The Art & Science of LLM Reliability - Building Trustworthy AI Systems' by M...
Daniel Zivkovic
 
AI - Your Startup Sidekick (Leveraging AI to Bootstrap a Lean Startup).pdf
AI - Your Startup Sidekick (Leveraging AI to Bootstrap a Lean Startup).pdf
Daniel Zivkovic
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Daniel Zivkovic
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Daniel Zivkovic
 
Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
Daniel Zivkovic
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
Daniel Zivkovic
 
Conversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui Costa
Daniel Zivkovic
 
Gojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applications
Daniel Zivkovic
 
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Daniel Zivkovic
 
What's new in Serverless at AWS?
What's new in Serverless at AWS?
Daniel Zivkovic
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
 
Empowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare Heroes
Daniel Zivkovic
 
Get started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google Cloud
Daniel Zivkovic
 
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Daniel Zivkovic
 
Smart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoT
Daniel Zivkovic
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Daniel Zivkovic
 
This is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill Shock
Daniel Zivkovic
 
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Daniel Zivkovic
 
Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?
Daniel Zivkovic
 
Serverless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless Toronto
Daniel Zivkovic
 
Ad

Recently uploaded (20)

Top 11 Fleet Management Software Providers in 2025 (2).pdf
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Trackobit
 
FME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable Insights
Safe Software
 
OpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native Barcelona
Imma Valls Bernaus
 
Top 5 Task Management Software to Boost Productivity in 2025
Top 5 Task Management Software to Boost Productivity in 2025
Orangescrum
 
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
BradBedford3
 
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...
SheenBrisals
 
Software Engineering Process, Notation & Tools Introduction - Part 4
Software Engineering Process, Notation & Tools Introduction - Part 4
Gaurav Sharma
 
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdf
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdf
QuickBooks Training
 
14 Years of Developing nCine - An Open Source 2D Game Framework
14 Years of Developing nCine - An Open Source 2D Game Framework
Angelo Theodorou
 
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
Insurance Tech Services
 
AI and Deep Learning with NVIDIA Technologies
AI and Deep Learning with NVIDIA Technologies
SandeepKS52
 
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow
Aarno Aukia
 
Plooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your way
Plooma
 
Online Queue Management System for Public Service Offices [Focused on Municip...
Online Queue Management System for Public Service Offices [Focused on Municip...
Rishab Acharya
 
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
kalichargn70th171
 
Providing Better Biodiversity Through Better Data
Providing Better Biodiversity Through Better Data
Safe Software
 
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
victordsane
 
Topic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptx
marutnand8
 
Design by Contract - Building Robust Software with Contract-First Development
Design by Contract - Building Robust Software with Contract-First Development
Par-Tec S.p.A.
 
Agile Software Engineering Methodologies
Agile Software Engineering Methodologies
Gaurav Sharma
 
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Trackobit
 
FME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable Insights
Safe Software
 
OpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native Barcelona
Imma Valls Bernaus
 
Top 5 Task Management Software to Boost Productivity in 2025
Top 5 Task Management Software to Boost Productivity in 2025
Orangescrum
 
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
BradBedford3
 
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...
SheenBrisals
 
Software Engineering Process, Notation & Tools Introduction - Part 4
Software Engineering Process, Notation & Tools Introduction - Part 4
Gaurav Sharma
 
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdf
How to Generate Financial Statements in QuickBooks Like a Pro (1).pdf
QuickBooks Training
 
14 Years of Developing nCine - An Open Source 2D Game Framework
14 Years of Developing nCine - An Open Source 2D Game Framework
Angelo Theodorou
 
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
Insurance Tech Services
 
AI and Deep Learning with NVIDIA Technologies
AI and Deep Learning with NVIDIA Technologies
SandeepKS52
 
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow
Aarno Aukia
 
Plooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your way
Plooma
 
Online Queue Management System for Public Service Offices [Focused on Municip...
Online Queue Management System for Public Service Offices [Focused on Municip...
Rishab Acharya
 
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
kalichargn70th171
 
Providing Better Biodiversity Through Better Data
Providing Better Biodiversity Through Better Data
Safe Software
 
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
victordsane
 
Topic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptx
marutnand8
 
Design by Contract - Building Robust Software with Contract-First Development
Design by Contract - Building Robust Software with Contract-First Development
Par-Tec S.p.A.
 
Agile Software Engineering Methodologies
Agile Software Engineering Methodologies
Gaurav Sharma
 
Ad

How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow

  • 1. How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow 1 Welcome to ServerlessToronto.org
  • 2. Serverless Evolution (since FaaS started) 2 Serverless is New Agile & Mindset #1 We started as Back- end FaaS (Serverless) Developers who enjoyed “gluing” other people’s APIs and Managed Services) #3 We're obsessed by creating business value (meaningful MVPs, Products), focusing on Outcomes/Impact – NOT Outputs #2 We build bridges between Serverless Community (“Dev leg”), and Front-end, Voice-First & UX folks (“UX leg”) #4 Achieve agility NOT by “sprinting” faster (like in Scrum) but working smarter (by using bigger building blocks and less Ops)
  • 3. Disconnect between IT & Business needs 3 Our group became dedicated to reducing the Businesses & IT Gap! Technology is not the point => We are here to create Value Adopting Serverless Mindset allowed us to shift the focus from “pimping up our cars” (infrastructure/code), towards “driving” (the business) forward. ≠
  • 4. Upcoming ServerlessToronto.org Meetups? 4 We’ll remain online in 2022. We’re working on: • “What, Why, Who & How of CDPs” w/ SEGMENT • Joe Emison – his new book coming soon! • Lak’s 2nd Ed book “DS on the GCP” out on May 3 • AWS Cloud: “Event-driven integration patterns” Focusing more on Data Engineering, Modern Data Stack, Agility, Leadership and helping Startups, see: • http://youtube.serverlesstoronto.org/ Your presentations regardless of how big or small ☺ Please rate us on Meetup & tell others about #ServerlessTO user group
  • 5. Knowledge Sponsor 1. Go to www.manning.com 2. Select *any* e-Book, Video course, or liveProject you want! 3. Add it to your shopping cart (no more than 1 item in the cart) 4. Raffle winners will send me the emails (used in Manning portal), 5. So the publisher can move it to your Dashboard – as if purchased. Fill out the Survey to win: bit.ly/slsto
  • 8. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. GCP Data BigQuery Cloud Dataflow Cloud Dataproc Cloud Datalab Cloud Pub/Sub Genomics
  • 9. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Everyone. Who Wants Real-time Data?
  • 10. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Mobile Devices Tens of Thousands Events/sec Tens of Billions Events/month Hundreds of Billions Events/year The Lambda Model
  • 11. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Mobile Devices Apache Beam + Tens of Thousands Events/sec Tens of Billions Events/month Hundreds of Billions Events/year A Unified Model or or
  • 12. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Mobile Devices Cloud Pub/Sub Cloud Dataflow BigQuery Tens of Thousands Events/sec Tens of Billions Events/month Hundreds of Billions Events/year A Unified Model on Google Cloud Platform
  • 13. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Compute and Storage Unbounded Bounded Resource Management Resource Auto-scaler Dynamic Work Rebalancer Work Scheduler Monitoring Log Collection Graph Optimization Auto-Healing Intelligent Watermarking S O U R C E S I N K What is Cloud Dataflow?
  • 14. Here’s a simple graphic showing how Dataflow can integrate and transform data from two sources. One discrete job Endless incoming data Cloud Dataflow What is Cloud Dataflow?
  • 15. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Deploy Schedule & Monitor Autoscaling mid-job Fully-managed and auto-configured Auto graph-optimized for best execution path Dynamic Work Rebalancing mid-job 1 2 3 4 Why Use Cloud Dataflow?
  • 16. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Autoscaling mid-job Fully-managed and auto-configured Auto graph-optimized for best execution path Dynamic Work Rebalancing mid-job 1 2 3 4 C D C+D C D C+D A GBK + B A+ GBK + B Why Use Cloud Dataflow?
  • 17. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. 800 RPS 1200 RPS 5000 RPS 50 RPS *means 100% cluster utilization by definition Autoscaling mid-job Fully-managed and auto-configured Auto graph-optimized for best execution path Dynamic Work Rebalancing mid-job 1 2 3 4 Why Use Cloud Dataflow?
  • 18. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Autoscaling mid-job Fully-managed and auto-configured Auto graph-optimized for best execution path Dynamic Work Rebalancing mid-job 1 2 3 4 100 mins. 65 mins. vs. Why Use Cloud Dataflow?
  • 19. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Start off with 3 workers, things are looking okay 10 minutes 3 days Re-estimation shows there’s orders of magnitude more work: need 100 workers! Idle You have 100 workers but you don’t have 100 pieces of work! ...and that’s really the most important part Autoscaling at Work
  • 20. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Now scaling up (and down) is no big deal! Add workers Work distributes itself Job starts with 3 workers, scales up to 1000. When all work is done, scale down Autoscaling + dynamic rebalancing Waves of splitting Upscaling cycles and VM startup
  • 21. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. End-user's pipeline Libraries: transforms, sources/sinks etc. Language-specific SDK Beam model (ParDo, GBK, Windowing…) Runner Execution environment Java ... Python The “Stack”
  • 22. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. pipeline .apply(PubsubIO.Read.named("read from PubSub") .topic(String.format("projects/%s/topics/%s", options.getSourceProject(), options.getSourceTopic())) .timestampLabel("ts") .withCoder(TableRowJsonCoder.of())) .apply("window 1s", Window.into(FixedWindows.of(Duration.standardSeconds(1)))) .apply("mark rides", MapElements.via(new MarkRides())) .apply("count similar", Count.perKey()) .apply("format rides", MapElements.via(new TransformRides())) .apply(PubsubIO.Write.named("WriteToPubsub") .topic(String.format("projects/%s/topics/%s", options.getSinkProject(), options.getSinkTopic())) .withCoder(TableRowJsonCoder.of())); Read from Pubsub Window of 1 second Create KV pairs Count them by key Format for output Write to Pubsub
  • 23. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Using Dataflow Templates Launching a Simple Pipeline Ingest Cloud Pub/Sub Pipelines Cloud Dataflow Analytics BigQuery
  • 24. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. Pub/Sub to BigQuery Dataflow templates let you stage your job’s artifacts in Google Cloud Storage. Launch template jobs via REST API, or Cloud Console.
  • 25. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. public static void main(String[] args) { … Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.named("ReadLines") .from(options.getInputFile())) .apply(new CountWords()) .apply(ParDo.of(new FormatAsTextFn())) .apply(TextIO.Write.named("WriteCounts") .to(options.getOutput())); p.run(); }
  • 26. ©Google Inc. or its affiliates. All rights reserved. Do not distribute.
  • 27. ©Google Inc. or its affiliates. All rights reserved. Do not distribute.
  • 28. ©Google Inc. or its affiliates. All rights reserved. Do not distribute. For further reading/watching NYC Taxi Tycoon Codelab Google-provided Dataflow Templates The World Beyond Batch - Streaming 101 What is the watermark heuristic for PubsubIO on GCP? Spotify’s Event Delivery Pipeline (Part 1 of 3)
  • 30. www.ServerlessToronto.org Reducing the gap between IT and Business needs