The error qualifier in Go simply means that the function returns an error object if it fails. The underlying code implementing this interface for a Juniper line card varies significantly from implementation on the Cisco line card, but the caller of the function is insulated from the implementation. The upper level code imports the library, and when it operates on a line card, it can only perform one of those three actions we specified above.
We then realized that we could apply the same interface to many hardware components—for example, a fan. For certain vendors, the Online() and Offline() functions did nothing, because those vendors didn't support turning a fan off, so we just used the interface to check the status.
type Fan interface {
Online() error
Offline() error
Status() error
}
Building upon this line of thought, we realized that we could generalize this interface to define a common interface for all hardware components within a device.
By structuring the code this way, anyone can add a device from a new vendor. Moreover, anyone can add any type of new component as a library. Once the library implements this common interface, it can be registered as a handler for that specific vendor and component.
Deciding what to automate
The system needed to interact with humans at various stages of the automation. To decide what to automate, we drew a flow chart of the normal human-based repair sequence and drew boxes around stages we believed we could replace with automation. We used the task of replacing a vendor control plane board as an example. Many of the steps have self-explanatory names, but these are definitions of some of the more complex ones:
Determine control plane: Find faulty control plane unit.
Determine state: Is it the master or the backup?
Copy image to control plane: Copy the appropriate software image to the master control plane.
Offline control plane: Send the backup control plane offline.
Toggle mastership: Make the replaced control plane the new master.
Figure 1: Manual workflow for replacing a vendor control plane board
When we needed to carry out this workflow, a Google network engineer performed each step in Figure 1, with the exception of pulling out and replacing the failed control plane, which was performed by someone on-site at a data center location.
Once we had defined this task, we created an automated workflow. The goal of the new system was to provide a UI for our hardware engineers in a data center that allowed them to perform one of those operations at a specific time under specific conditions and with various automated safety checks, followed by an entire device audit at the end of the operation. Previously, a human had performed all of these steps, but now a human only needed to perform the step “hardware gets replaced” in Figure 2—the hardware replacement.
Figure 2: Automated workflow for replacing a vendor control plane board
Automation, before and after
Figure 3: High-level system view.
You can see in Figure 3 what the system looked like after automation. Before automating this workflow, there would have been a lot of manual work. When an alert initially came in, an engineer would have stopped traffic to the device, and offlined by hand the bad component. Our network operations center (NOC) team would then work with the vendor—for example, Juniper or Cisco— to get a replacement part on-site. Next, we would file a change request in our change management system, noting the date of the operation.
On the day of the operation:
The data center technician would click “start” on the change management system to begin the repair.
Our system picks up this change and is ready to begin the repair.
The technician clicks “start” on our UI.
An “offline” state machine starts proceeding through the various steps to take the component offline safely.
The UI notifies the user each step of the way.
Once the state machine has completed, it notifies the technician, who can safely replace the component.
Once the component is replaced and re-cabled, the technician returns to the UI and begins the “online” state machine, which safely returns the component into production.
When we reviewed our original automation design, we noticed there would be a lot of work involved in building the various systems needed to implement the automated workflow. To facilitate collaboration, we created ticket items for each component of the system, so multiple engineers could work on the project in parallel.
Automation lessons learned
We used an iterative approach in our planning and execution. We first focused on replacing the line card for one vendor, then moved on to multiple vendors and multiple components. Due to the modular design of the code base and the interacting systems, adding more modules and scaling the code horizontally was easy.
For example, adding a new library that handled fan replacements meant simply creating the code to handle this and ensuring it implemented the above interface. Then it registered itself in the main function.
We had the option to extend or repurpose existing automation systems owned by our software management teams to meet our needs. We had to carefully consider whether to use those systems or build our own, potentially duplicating work if we chose the latter. Ultimately, we built our own automation because the other systems were understaffed. Trying to extend their tools would have disrupted other teams' project work and delayed our own project.
What worked well
Leveraging multiple engineers to automate our internal part of the workflow allowed us to take the project from design to implementation within a short period—about one year.
What didn’t
We haven't yet fully automated our hardware replacement workflow. Doing so involves troubleshooting hardware issues with vendors and persuading them that each individual failure merits a device or component replacement. We work around this gap in our automation by keeping spares on site for use with our repair automation, and handling the vendor workflow portion of the process separately and mostly manually through our NOC. We are currently working toward a fully automated vendor interaction with our vendor partners.
Measuring automation success
We can measure the hours our automation saves engineers using Google's production change logging service, which all internal tools use to record changes made to the production environment. The service logs changes made by tools manually invoked by engineers as well as tools that provide end-to-end automation without manual input. Thus we can compare how long each network repair action used to take when performed manually vs. the number of repair actions that are undertaken by today's fully automated system. These two data sets allow us to calculate the total time savings from automation. As shown in Figure 4, network hardware repair automation saves us hundreds of hours every month.
Tips for reducing toil through automation
While strategies for eliminating toil must be tailored to your individual environment and use cases, some approaches are universal. Based upon our own experience eliminating toil by automating network repair tasks, we recommend the following:
Measure your toil.
Tackle the biggest sources of toil first, and don't try to solve all problems at once.
Carefully consider whether to enhance existing tools or build new ones. Even if you can partially repurpose another team's work, would creating a tool from scratch actually make more sense cost- or resource-wise?
Take a design-driven approach. Iterate on the design, starting small and iterating quickly. Don't try to design the perfect approach from the start.
Measure your time savings to determine your return on investment.
Automation has proved useful for our team of network site reliability engineers at GCP. Learn more about the practice of SRE and how you might apply its principles to your own network projects.
At this point, you might simply try to run the application by navigating to the main method and clicking the play button:
… and configuring the input arguments to translate some text from English to French by editing the newly created run configuration:
Run the program again, and this time you get the following error:
As you may have already guessed, you’re missing authentication rights to access the Cloud Translation API from your local machine. To overcome this, you’d normally have to go through the following steps:
Create a new service account with the appropriate roles for accessing the service
Update your local run configuration with the necessary environment variables to access the service
Thankfully, the Cloud Tools for IntelliJ plugin can help. In IntelliJ, navigate to the Cloud Tools menu item under “Tools > Google Cloud Tools > Add Cloud libraries …”:
Select the Cloud Translation API and your GCP project, and click “Add Cloud Libraries”:
In the confirmation window that appears, you can see that Cloud Tools for IntelliJ takes care of enabling the API and creating the service account for you:
Lastly, select the run configuration that you created earlier so that the plugin can inject the necessary environment variables for accessing the Cloud Translation service from your local machine:
Run the program again and your input text is successfully translated from English to French using the Cloud Translation service:
The Cloud Tools for IntelliJ plugin also assists with the following:
Adding Java client libraries to your Maven pom.xml if they are not already present
Writing a Bill of Materials (BOM) to your pom.xml to help avoid dependency version conflicts
Detecting and acting on potential misconfigurations, including a missing BOM, through pom.xml file inspections with quick-fixes
The Cloud Tools for IntelliJ plugin provides many more features to help optimize your development workflow including support for Google App Engine, Stackdriver Debugger, Cloud Repositories, and Cloud Storage. For more information and to leave feedback please visit the official documentation and GitHub pages:
“We needed a consistent platform to deploy and manage containers on-premise and in the cloud. As Kubernetes has become the industry standard, it was natural for us to adopt Kubernetes Engine on GCP to reduce the risk and cost of our deployments.”
- Dinesh KESWANI, Global Chief Technology Officer at HSBC
Cloud Services Platform is technologically and architecturally aligned with the joint hybrid cloud products we've been developing and bringing to market with our partner, Cisco, with whom we have been collaborating closely. Our joint solution, Cisco Hybrid Cloud Platform for Google Cloud, will be generally available next month and is now certified to be consistent with Kubernetes Engine, enabling GCP out of the box.
Today, let’s take a look at aspects of the Cloud Services Platform, and how it lays a foundation for a fully realized cloud infrastructure.
Modernizing application architecture with Istio
Last year, we took a step toward helping organizations move from reactive IT management to proactive service operations—the idea of managing at a higher layer of the stack, enabling greater application awareness and control. In collaboration with several industry partners, we announced Istio, an open-source service mesh that gives operators the controls they need to manage microservices at scale. We are excited to say that open-source Istio will move to version 1.0 shortly, making it ready for production deployments.
Building on that open-source foundation, we are announcing a managed Istio service that you can use to manage services within a Kubernetes Engine cluster. Managed Istio, in alpha, is an Istio-powered service mesh available in Kubernetes Engine, complete with enterprise support. Managed Istio accelerates your journey to service operations with three high-level capabilities:
Service discovery and intelligent traffic management—Managed Istio surfaces all the services running in your cluster and manages network traffic between them. Using application-level load balancing and sophisticated traffic routing for container and VM workloads, it also provides health checks, plus canary and blue/green deployments, enabling fault tolerant applications with circuit breaking and timeouts.
Secure, authenticated communications—Managed Istio offers segmentation and granular policy for endpoints, compliance and detecting anomalous behavior, and traffic encryption by default using mTLS.
Monitoring and management—Understand and troubleshoot the system of services running across Managed Istio, including integration with Stackdriver, our suite of monitoring and management tools.
It's still early days, but we are very excited about Istio and Managed Istio, foundational technologies that will drive the use of containers and microservices, while helping to make your environment much more manageable, scalable and available.
Enterprise-grade Kubernetes, wherever you go
A great path to well-managed applications is undoubtedly containers and microservices, and having a common Kubernetes management layer can help get you there that much faster. Four years ago, we released Kubernetes, and the resulting Kubernetes Engine managed service is battle-tested and growing by leaps and bounds: In 2017 Kubernetes Engine core-hours grew 9X year over year.
Today, we are excited to bring that same managed Kubernetes Engine experience to your on-premise infrastructure. GKE On-Prem, soon to be in alpha, is Google-configured Kubernetes that you can deploy in the environment of your choice. GKE On-Prem makes it easy to install and upgrade Kubernetes and provides access to the following capabilities across GCP and on-premise:
Unified multi-cluster registration and upgrade management
Centralized monitoring and logging with Stackdriver integration
Hybrid Identity and Access Management
GCP Marketplace for Kubernetes applications
Unified cluster management for GCP and on-premise
Professional services and enterprise-grade support
Now, with GKE On-Prem, you can begin to modernize existing applications on-premise, without necessarily moving to the cloud. You gain control of your journey to the cloud at your own pace.
Automatically take control of your Kubernetes workloads
When it comes to managing clusters at scale, it’s imperative to have the right security controls in place and ensure your policies can be easily managed and enforced. Today, we’re pleased to announce GKE Policy Management which delivers centralized capabilities that make it far easier for administrators to configure Kubernetes (wherever it may be running).
With GKE Policy Management, Kubernetes administrators create a single source of truth for their policies that automatically syncs with any enrolled cluster. GKE Policy Management supports policies stored as definitions in a repository, and can also use your existing Google Cloud IAM policies to make it simple to secure your clusters. GKE Policy Management is coming soon to alpha; sign up here to express interest.
A service-centric view of your environment
More than simply making it easier to migrate workloads to the cloud, the technologies found in Cloud Services Platform lay the groundwork for improving service operations, by providing administrators with a service-centric view of their infrastructure, rather than infrastructure views of services. Today, we are announcing Stackdriver Service Monitoring, which provides the following new views:
Service graph: A real-time bird’s-eye visualization of the entire environment—see all your microservices, how they communicate, and their dependencies.
Service level objective (SLO) monitoring: Monitor and alert in the same customer-centric, low-toil manner as Google Site Reliability Engineers (SRE) do for our own services.
Service dashboard: All your signals for a given service are in a single place so that you can debug faster and easier than ever before and lower your mean-time-to-resolution (MTTR).
Stackdriver Service Monitoring is designed for workloads running on opinionated Istio infrastructure, as well as App Engine.
When microservices become APIs
Microservices provide a simple, compelling way for organizations to accelerate moving workloads to the cloud, serving as a path towards a larger cloud strategy. Istio enables service discovery, connection and management for microservices. But as soon as those services are needed for internal groups, partners or developers outside of the enterprise, they quickly cross the line and become APIs.
Just as organizations need services management for microservices, they need API management for their APIs. Apigee API Management complements Istio with the robust features of Google Cloud's Apigee API management platform, Apigee Edge, by extending API management natively into the microservices stack. Apigee Edge features include API usage, access, productization, catalog and discovery, plus a developer portal to create a smooth experience for developers and increase API consumption.
Making cloud all it could be
Here at Google, we could never have done what we do today without containers and Kubernetes, but taking a service-oriented view of our operations has been equally critical. In addition to the core capabilities mentioned above, Cloud Services Platform provides access to other new areas of functionality:
GKE serverless add-on lets you run serverless workloads on Kubernetes Engine with a one-step deploy. You can go from source to containers amazingly fast, auto-scale your stateless container-based workloads, and even scale down to zero. Sign up for an early preview for the GKE serverless add-on here.
Knative (pronounced kay-nay-tiv), open-source serverless components from the same technology that enables the GKE serverless add-on. Knative lets you create modern, container-based and cloud-native applications by providing building blocks you need to build and deploy container-based serverless applications anywhere on Kubernetes.
Cloud Build is a fully-managed Continuous Integration/Continuous Delivery (CI/CD) platform that lets you build, test, and deploy software quickly, at scale.
Now, with Cloud Services Platform, we’re excited to bring the full potential of the cloud to you, wherever your workloads may be. For more on Cloud Services Platform, you can read about how it relates to serverless computing.
Log in to Apigee Edge user interface with their credentials
Create a new API proxy, configure backend target, add policies
Add a callout policy to select the appropriate business integration process
Save and deploy the API proxy
Access Google Cloud services from the Apigee Edge user interface
API developers want to easily access and connect with Google Cloud services like Cloud Firestore, Cloud Pub/Sub, Cloud Storage, and Cloud Spanner. In each case, there are a few steps to perform to deal with security, data formats, request/response transformation, and even wire protocols for those systems.
Apigee Edge includes a new feature that simplifies interacting with these services and enables connectivity to them through a first-class policy interface that an API developer can simply pick from the policy palette and use. Once configured, these can be reused across all API proxies.
We’re working to expand this feature to cover more Google Cloud services. Simultaneously, we’re working with Informatica to include connections to other software-as-a-service (SaaS) applications and legacy services like hosted databases.
Publish business integration processes as managed APIs
Integration architects, working to connect data and applications across the enterprise, play an important role in packaging and publishing business integration processes as great API products. Working with Informatica, we’ve made this possible within Informatica’s Integration Cloud.
Integration architects that use Informatica's Integration Cloud for Apigee can now author composite services using business integration processes to orchestrate data services and applications, and directly publish them as managed APIs to Apigee Edge. This pattern is useful when the final destination of the API call is an Informatica business integration process.
To use this feature, integration architects need to execute the following steps:
Log in to their Informatica Integration Cloud user interface
Create a new business integration process or modify an existing one
Create a new service of type (“Apigee”), select options (policies) presented on the wizard, and publish the process as an API proxy
Apply additional policies to the generated API proxy by logging in to the Apigee Edge user interface.
API documentation can be generated and published on a developer portal, and the API endpoint can be shared with app developers and partners.
APIs are an increasingly central part of organizations’ digital strategy. By working with Informatica, we hope to make APIs even more powerful and pervasive. Click here for more on our partnership with Informatica.
Kubernetes and Docker send Linux signals to your application inside the container to stop it. They send those signals to the process with the process identifier (PID) 1. If you want your application to stop gracefully when needed, you need to properly handle those signals.
Docker can cache layers of your images to accelerate later builds. This is a very useful feature, but it introduces some behaviors that you need to take into account when writing your Dockerfiles. For example, you should add the source code of your application as late as possible in your Dockerfile so that the base image and your application’s dependencies get cached and aren’t rebuilt on every build.
Take this Dockerfile as example:
FROM python:3.5
COPY my_code/ /src
RUN pip install my_requirements
You should swap the last two lines:
FROM python:3.5
RUN pip install my_requirements
COPY my_code/ /src
In the new version, the result of the pip command will be cached and will not be rerun each time the source code changes.
Reducing the attack surface of your host system is always a good idea, and it’s much easier to do with containers than with traditional systems. Remove everything that the application doesn’t need from your container. Or better yet, include just your application in a distroless or scratch image. You should also, if possible, make the filesystem of the container read-only. This should get you some excellent feedback from your security team during your performance review.
Who likes to download hundreds of megabytes of useless data? Aim to have the smallest images possible. This decreases download times, cold start times, and disk usage. You can use several strategies to achieve that: start with a minimal base image, leverage common layers between images and make use of Docker’s multi-stage build feature.
Tags are how the users choose which version of your image they want to use. There are two main ways to tag your images: Semantic Versioning, or using the Git commit hash of your application. Whichever your choose, document it and clearly set the expectations that the users of the image should have. Be careful: while users expect some tags —like the “latest” tag— to move from one image to another, they expect other tags to be immutable, even if they are not technically so. For example, once you have tagged a specific version of your image, with something like “1.2.3”, you should never move this tag.
7. Carefully consider whether to use a public image
Using public images can be a great way to start working with a particular piece of software. However, using them in production can come with a set of challenges, especially in a high-constraint environment. You might need to control what’s inside them, or you might not want to depend on an external repository, for example. On the other hand, building your own images for every piece of software you use is not trivial, particularly because you need to keep up with the security updates of the upstream software. Carefully weigh the pros and cons of each for your particular use-case, and make a conscious decision.