Monday, January 18, 2021

 

 Risks of Using AI in the Public Cloud

Introduction

A reporter once asked John Dillinger why he robbed banks. His answer was plain and simple. “That’s where the money is”. The bank being a centralized location where many different customer’s valuables are co-located under one roof. To some a bank sounds like a great idea, as a much higher level of security can be implemented (i.e.- armed guards, alarm systems, and a stronger safe), than any of the individual depositors could ever provide on their own. Almost a hundred years later a very similar scenario exists for the public cloud. But instead of the prospect of money being stolen, it is now corporate, medical and government data, vital intellectual property, and software assets. But unlike the banking industry of yesteryear, where the Federal Deposit Insurance Corporation (FDIC) was created to reimburse citizens should the bank be robbed, there is no corresponding US Government entity to compensate you if your valuable information is stolen, or even worse, altered without your knowledge. This last scenario would most assuredly lead to unreliable machine learning models being built and put into production.


Carnegie Endowment Study

In August of 2020 the Carnegie Endowment published an in-depth study relating to the security and reliability of major cloud service providers (CSP). It is entitled Cloud Security: A Primer for Policy Makers. Some of the key findings of the eighty-three (83) page report are as follows:

-          Cloud security thus far is a series of potential catastrophes narrowly averted

-          The downside of creating a Fort Knox is that it becomes an inspiration and dream for criminals to target and that once breached, the impact could be devastating and widespread

-          A downside of concentration is that incidents disrupting such CSP services have a much broader effect than narrower potential outages affecting one of their clients might

-          Responsibility for risk in the cloud is inherently shared between customers and CSPs

-          Vulnerabilities in hypervisors are a crucial issue for CSPs to address. The bad news is that, like all software, hypervisors have vulnerabilities

-          Many cloud incidents are not caused by malicious adversaries but rather by human error as well as natural phenomena

-          Complexities in the nested set of cloud-provided services thus made CSPs vulnerable to outages in critical regions, despite extensive efforts to build global networks and CSPs’ insistence that each region is logically separate and independent from all others

 

Risk Categories

Utilizing the cloud as a platform upon which to build and deploy artificial intelligence applications subjects your organization to a high degree of risk that many would find unacceptable, especially in industries that rely upon accurate split-second decisions being made. These inherent risks fall into three broad categories. The first being a history of cyber attacks specifically targeting cloud service providers. The second being critical software vulnerabilities that are continually being discovered which apply to server CPU’s instruction set and firmware, as well as the various layers of software infrastructure that are used pervasively throughout all public clouds. The last are human errors, and the very force of nature itself, that have continually led to widespread disruption of cloud based service offerings. To better gauge and understand the potential impact that these matters may have for your organization, all we need to do is look at prior, well documented, cases within each of these three core areas of risk.

 

Cyber-Security Breaches

The highly centralization storage of data, intellectual property, and software assets within the public cloud, create a very attractive target for both cyber criminals, as well as foreign state sponsored bad actors. Especially considering the vast “surface area” (i.e. – hardware, network infrastructure, virtualization layers, software stacks, etc.)  that is exposed, as well as the multitude of different means of attack (aka vectors). The first major cybersecurity breach in the public cloud, appropriately dubbed “Cloud Hopper”, was perpetrated by Chinese sponsored state actors, and involved cloud offerings from both IBM and DXC (formerly the services arm of Hewlett Packard Enterprise). It occurred over a four-year long period between 2014 and 2018, and involved massive amounts of data being exfiltrated (stolen) not only from the vendors themselves but from some of their leading clients as well.

 In 2019 there was a significant breach within the Amazon AWS cloud involving the theft of large amounts of financial from Capital One including millions of credit card applications complete with security numbers. The perpetrator gained access through a simple change to a cloud based configuration file which enabled them to take advantage of a known exploit relating to data storage. What is interesting here, is that from a legal perspective Capital One customers also named Amazon AWS as a responsible party, in addition to the bank itself, within a civil lawsuit. A federal judge promptly denied Amazon’s motion to dismiss the case, saying its “negligent conduct” probably “made the attack possible.”

Most recently, in December of 2020 it was revealed that Russian sponsored bad actors had breached Microsoft Azure’s security mechanisms. In this incident these hackers stole emails from at least one, if not more, private sector companies. In a Washington Post article about the break-in, written by Ellen Nakashima, she states “The intrusions appear to have occurred via a Microsoft corporate partner that handles cloud-access services, those familiar with the matter said. They did not identify either the Microsoft business partner or the company known to have had emails stolen. Like others, these people spoke on the condition of anonymity to discuss what remains a highly sensitive subject.”

Furthermore, a January 29th, 2021 Wall Street Journal article, written by Robert McMillan and Dustin Volz, further revealed that “The incident demonstrated how sophisticated attackers could leapfrog from one cloud-computing account to another by taking advantage of little-known idiosyncrasies in the ways that software authenticates itself on the Microsoft service. In many of the break-ins, the SolarWinds hackers took advantage of known Microsoft configuration issues to trick systems into giving them access to emails and documents stored on the cloud”

 

Vulnerabilities in Hardware and Software Infrastructure

Across many industries product defects, introduced during the manufacturing process, are simply unavoidable. This statement rings ever truer within the semiconductor, computer, and software fields. These “in-silicone” and software based defects pose an immense risk to users, when applied to server based CPU’s, or the multiple layers of software infrastructure, that power all major cloud service provider’s offerings. The reason for this being that in the public cloud, both physical hardware and software infrastructure are shared between multiple customers simultaneously. This business model is called “multi-tenancy”, and it is pervasive throughout the entire public cloud service industry. If you are unlucky enough to be sharing a physical machine with a criminal, or a state sponsored bad actor, these defects may enable an attacker to expatriate, or subtly alter, your data, intellectual property, or software assets. In the field of deep learning specifically, it has already been proven that changing just one single pixel of an image, within a training dataset, can result in the predictive model failing to correctly recognize or classify a similar picture within a production environment.

 

CPU Flaws

In 2017 security researchers discovered two major vulnerabilities within microprocessor instruction sets that would enable attackers to steal other user’s data, such as passwords and other sensitive information, by exploiting a defect in the CPU itself. The fault could be used to a very significant extent within a multi-tenant public cloud environment. These exploits were named Spectre and Meltdown. They both violate one of the most fundamental safe computing premises, that of process “isolation” between two applications running at the same time within an operating system. After the vulnerabilities were publicly announced, no security reports were published indicating that the vulnerabilities had been leveraged in any attack. Bearing in mind that any formal disclosure would have immediately undermined customer’s confidence in the continued use of the public cloud. But we will never really know for sure if the exploits were ever used, as many significant cyber attacks go completely unreported. The reason for this being that the victimized party does not want any “bad press”. As doing so may influence customers to discontinue the use of their compromised services or products.

 

Critical Defects in Hypervisors

Hypervisors subdivide and allocate physical hardware, such as CPUs and random access memory (RAM), in order to support the creation of virtual machines (VM). These VMs are run by multiple customers simultaneously on the very same server within the CSP’s data center. In 2019 the winners of the white-hat hacking competition found several vulnerabilities in virtualization software from VMWare, including one that allowed code execution in the hypervisor, that would essentially allow a threat actor to escape the confines of the VM to take control of the host machine (server) itself.


Critical Issues with Docker Engine and Pre-Built Containers

Many modern day applications are comprised of modular sets of services that are deployed and run within containerized environments. Docker being the leading infrastructure technology to accomplish this. Very often predictive models are deployed within a Docker container, to be consumed by AI applications using RESTful web services. in 2019 two critical vulnerabilities were published relating to flaws within the Docker environment, one of which enabled a user to gain higher level privileges, and the second actually allowed the attacker to overwrite the host’s run binary and gain root access to the machine on which the Docker environment itself was installed. Furthermore in 2020 study  by Sonotype, the company found  that over 51% of the four million Docker images being stored within Docker Hub, the leading repository of publicly available images for download, had at least one or more critical software vulnerability incorporated into their “pre-built” software stacks.

 

Infrastructure Outages

Perhaps the greatest risks posed to organizations using the public cloud relates not to cyber attacks, which oftentimes go completely unreported, but rather to the basic reliability and survivability of the massively complex environment. Think for a moment of what happened to the unsinkable Titanic, and the possibilities that simple human oversights and errors, as well as the elemental destructive forces of nature, can bring about. These constantly occurring incidents of notable service interruptions have been experienced by millions of public cloud users over the course of the past five years. Every organization must decide for themselves the amount of downtime, or lack of service unavailability, that could be tolerated.  If your appetite for public cloud system failure is less than a couple of hours per incident, you may want to explore other options for building and deploying AI applications out in the fog (at the very edge of the computing network).  Some of the more notable examples of failures within CSP infrastructure, as well as a detailed rational explanation of the root cause, as provided by a major cloud vendor, are below.

 

Amazon AWS

Feb. 2017 - AWS outage in US-EAST-1 region caused failures in many online platforms and organizations, including Airbnb, Signal, Slack, and the U.S. Securities and Exchange Commission over a five-hour period. One firm later estimated that the downtime caused a loss of $150 million for the S&P 500 companies affected.

Amazon attributed the root cause for this outage to be simple human error by saying that “The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended” In addition Amazon went on to state “While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable. ”

 

Microsoft Azure

June 2018 - Azure customers in Northern Europe experienced an outage for many hours due to temperatures that were a bit warmer than expected, leading to automated infrastructure shutdowns. The outage was covered adeptly by an article on the theregister.com which stated “Amid forecasts of heat and fears of water shortage in Ireland on Monday, Microsoft was about to confront a drought of a different kind: An Azure service outage. The disruption, which lasted from 1744 UTC on Tuesday, June 19 to 0430 UTC on Wednesday, June 20, downed a slew of services, as we previously reported. What was not disclosed at the time was the cause of the eleven-hour failure. Microsoft has now revealed that source of its troubles: mildly warm weather. That day in Dublin, where Microsoft's North Europe data center resides, the high temperature reached a pleasant 18°C or about 64°F in Freedom Units. Not exactly a scorcher, and while folks in the Emerald Isle enjoyed a warmer-than-expected evening, it nonetheless all proved too much for the company's kit.” 

September 2018 - Lightning strikes caused failure at an Azure data center in Texas, affecting customers using storage in the local region as well as some Azure services globally. The local region was offline for about four hours.

 

Google Cloud

December 2020 – A catastrophic failure in Google Cloud’s identity management and authentication system resulted in the disruption of both their platform and software as a service offerings. This included their compute engine capabilities as well. The outage was global in scope and lasted about an hour.

Conclusion

Unfortunately, criminals and foreign supported state actors are both tenacious and merciless in their never ending quest to steal data, intellectual property, and software assets from the public cloud. Even more so, the chance remains that these bad actors may also alter information during long running cyber-security breaches as well. Negatively affecting the decisions, predictions, and recommendations, made by machine learning models deployed into production. Like all mass manufactured software products, the very hypervisors and containerization engines that power the public cloud, are themselves fraught with vulnerabilities that would allow attackers to take control of shared servers or other resources. Ironically some of the most severe outages within the public cloud over the past few years were caused by human error, such as an administrator issuing the wrong command from the keyboard, or temperatures outside of a data center being just a bit higher than usual for a given time of year. Taking this all into account, the continued use of public cloud based artificial intelligence platforms, tools, services, and programming interfaces, pose a clear and present danger to any organizations. But these factors when taken together, can be especially damaging for those corporations, medical institutions, governmental agencies, and military units, that need to process, and then incorporate into broader AI based system, real-time events (including data, voice, and video) being generated by sensors, monitors, and other devices at the very edge of the computing network.