Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

Drilling down into Stackdriver Service Monitoring

Monday, July 30, 2018

SLOs: Internally at Google, our Site Reliability Engineering team (SRE) only alert themselves on customer-facing symptoms of problems, and not all potential causes. This better aligns them to customer interests, lowers their toil, frees them to do value-added reliability engineering, and increases job satisfaction. Stackdriver Service Monitoring lets you to set, monitor, and alert on SLOs. Because Istio and App Engine are instrumented in an opinionated way, we know exactly what the transaction counts, error counts, and latency distributions are between services. All you need to do is set your targets for availability and performance and we automatically generate the graphs for service level indicators (SLIs), compliance to your targets over time, and your remaining error budget. You can configure the maximum allowed drop rate for your error budget; if that rate is exceeded, we notify you and create an incident so that you can take action. To learn more about SLO concepts including error budget, we encourage you to read the SLO chapter of the SRE book.

Service Dashboard: At some point, you will need to dig deeper into a service’s signals. Maybe you received an SLO alert and there’s no obvious upstream cause. Maybe the service is implicated by the service graph as a possible cause for another service’s SLO alert. Maybe you have a customer complaint outside of an SLO alert that you need to investigate. Or, maybe you want to see how the rollout of a new version of code is going.

The service dashboard provides a single coherent display of all signals for a specific service, all of them scoped to the same timeframe with a single control, providing you the fastest possible way to get to the bottom of a problem with your service. Service monitoring lets you dig deep into the service’s behavior across all signals without having to bounce between different products, tools, or web pages for metrics, logs, and traces. The dashboard gives you a view of the SLOs in one tab, the service metrics (transaction rates, error rates, and latencies) in a second tab, and diagnostics (traces, error reports, and logs) in the third tab.

Once you’ve validated an error budget drop in the first tab and isolated anomalous traffic in the second tab, you can drill down further in the diagnostics tab. For performance issues, you can drill down into long tail traces, and from there easily get into Stackdriver Profiler if your app is instrumented for it. For availability issues you can drill down into logs and error reports, examine stack traces, and open the Stackdriver Debugger, if the app is instrumented for it.

Stackdriver Service Monitoring gives you a whole new way to view your application architecture, reason about its customer-facing behaviors, and get to the root of any problems that arise. It takes advantage of infrastructure software enhancements that Google has championed in the open source-world, and leverages the hard-won knowledge of our SRE teams. We think this will fundamentally transform the ops experience of cloud native and microservice development and operations teams. To learn more see the presentation and demo with Descartes Labs at GCP Next last week. We hope you will sign up to try it out and share your feedback.
Share on Google+ Share on Twitter Share on Facebook
Google
Labels: Compute , Management Tools , Stackdriver
  

Free Trial

Free Trial

GCP Blogs

  • Big Data & Machine Learning
  • Kubernetes
  • GCP Japan Blog
  • Firebase Blog
  • Apigee Blog

Popular Posts

  • Understanding Cloud Pricing
  • World's largest event dataset now publicly available in BigQuery
  • A look inside Google’s Data Center Networks
  • Fans come on stage in Azealia Banks’ new interactive video, built on Google Cloud Platform
  • New in Google Cloud Storage: auto-delete, regional buckets and faster uploads

Labels


  • Announcements 193
  • Big Data & Machine Learning 134
  • Compute 271
  • Containers & Kubernetes 92
  • CRE 27
  • Customers 107
  • Developer Tools & Insights 151
  • Events 38
  • Infrastructure 44
  • Management Tools 87
  • Networking 43
  • Open 1
  • Open Source 135
  • Partners 102
  • Pricing 28
  • Security & Identity 85
  • Solutions 24
  • Stackdriver 24
  • Storage & Databases 164
  • Weekly Roundups 20

Feed

Subscribe by email

Demonstrate your proficiency to design, build and manage solutions on Google Cloud Platform.

Learn More
Technical questions? Check us out on Stack Overflow.
Subscribe to our monthly newsletter.
Googleon Google+
Follow
Follow

Company-wide

  • Official Google Blog
  • Enterprise Blog
  • Student Blog

Products

  • Official Android Blog
  • Chrome Blog
  • Lat Long Blog

Developers

  • Ads Developer Blog
  • Android Developers Blog
  • Developers Blog
  • Google
  • Privacy
  • Terms