Risks of Using AI in the Public Cloud
A reporter once asked John Dillinger why he robbed banks. His
answer was plain and simple. “That’s where the money is”. The bank being a
centralized location where many different customer’s valuables are co-located
under one roof. To some a bank sounds like a great idea, as a much higher level
of security can be implemented (i.e.- armed guards, alarm systems, and a
stronger safe), than any of the individual depositors could ever provide on
their own. Almost a hundred years later a very similar scenario exists for the
public cloud. But instead of the prospect of money being stolen, it is now
corporate, medical and government data, vital intellectual property, and
software assets. But unlike the banking industry of yesteryear, where the
Federal Deposit Insurance Corporation (FDIC) was created to reimburse citizens should
the bank be robbed, there is no corresponding US Government entity to compensate
you if your valuable information is stolen, or even worse, altered without your
knowledge. This last scenario would most assuredly lead to unreliable machine
learning models being built and put into production.
In August of 2020 the Carnegie Endowment published an
in-depth study relating to the security and reliability of major cloud service
providers (CSP). It is entitled Cloud Security: A Primer for Policy Makers.
Some of the key findings of the eighty-three (83) page report are as follows:
-
Cloud security thus far is a series of potential catastrophes narrowly
averted
-
The
downside of creating a Fort Knox is that it becomes an inspiration and dream
for criminals to target and that once breached, the impact could be devastating
and widespread
-
A downside of concentration is that incidents disrupting such CSP
services have a much broader effect than narrower potential outages affecting
one of their clients might
-
Responsibility
for risk in the cloud is inherently shared between customers and CSPs
-
Vulnerabilities in hypervisors are a crucial issue for CSPs to address.
The bad news is that, like all software, hypervisors have vulnerabilities
-
Many
cloud incidents are not caused by malicious adversaries but rather by human error as well as
natural phenomena
-
Complexities in the nested set of cloud-provided services thus made CSPs
vulnerable to outages in critical regions, despite extensive efforts to build
global networks and
CSPs’ insistence that each region is logically separate and independent from
all others
Utilizing the cloud as a platform upon which to build and
deploy artificial intelligence applications subjects your organization to a high
degree of risk that many would find unacceptable, especially in industries that
rely upon accurate split-second decisions being made. These inherent risks fall
into three broad categories. The first being a history of cyber attacks
specifically targeting cloud service providers. The second being critical
software vulnerabilities that are continually being discovered which apply to
server CPU’s instruction set and firmware, as well as the various layers of
software infrastructure that are used pervasively throughout all public clouds.
The last are human errors, and the very force of nature itself, that have
continually led to widespread disruption of cloud based service offerings. To
better gauge and understand the potential impact that these matters may have
for your organization, all we need to do is look at prior, well documented, cases
within each of these three core areas of risk.
The highly centralization storage of data, intellectual
property, and software assets within the public cloud, create a very attractive
target for both cyber criminals, as well as foreign state sponsored bad actors.
Especially considering the vast “surface area” (i.e. – hardware, network
infrastructure, virtualization layers, software stacks, etc.) that is exposed, as well as the multitude of
different means of attack (aka vectors). The first major cybersecurity breach
in the public cloud, appropriately dubbed “Cloud Hopper”, was perpetrated by
Chinese sponsored state actors, and involved cloud offerings from both IBM and
DXC (formerly the services arm of Hewlett Packard Enterprise). It occurred over
a four-year long period between 2014 and 2018, and involved massive amounts of
data being exfiltrated (stolen) not only from the vendors themselves but from
some of their leading clients as well.
In 2019 there was a
significant breach within the Amazon AWS cloud involving the theft of large
amounts of financial from Capital One including millions of credit card applications
complete with security numbers. The perpetrator gained access through a simple
change to a cloud based configuration file which enabled them to take advantage
of a known exploit relating to data storage. What is interesting here, is that from
a legal perspective Capital One customers also named Amazon AWS as a responsible
party, in addition to the bank itself, within a civil lawsuit. A federal judge promptly
denied Amazon’s motion to dismiss the case, saying its “negligent conduct”
probably “made the attack possible.”
Most recently, in December of 2020 it was revealed that
Russian sponsored bad actors had breached Microsoft Azure’s security
mechanisms. In this incident these hackers stole emails from at least one, if
not more, private sector companies. In a Washington Post article about the
break-in, written by Ellen Nakashima, she states “The intrusions appear to have
occurred via a Microsoft corporate partner that handles cloud-access services,
those familiar with the matter said. They did not identify either the Microsoft
business partner or the company known to have had emails stolen. Like others,
these people spoke on the condition of anonymity to discuss what remains a
highly sensitive subject.”
Furthermore, a January 29th, 2021 Wall Street
Journal article, written by Robert McMillan and Dustin Volz, further revealed
that “The incident demonstrated how sophisticated attackers could leapfrog from
one cloud-computing account to another by taking advantage of little-known
idiosyncrasies in the ways that software authenticates itself on the Microsoft
service. In many of the break-ins, the SolarWinds hackers took advantage of
known Microsoft configuration issues to trick systems into giving them access
to emails and documents stored on the cloud”
Across many industries product defects, introduced during the
manufacturing process, are simply unavoidable. This statement rings ever truer within
the semiconductor, computer, and software fields. These “in-silicone” and
software based defects pose an immense risk to users, when applied to server based
CPU’s, or the multiple layers of software infrastructure, that power all major cloud
service provider’s offerings. The reason for this being that in the public
cloud, both physical hardware and software infrastructure are shared between
multiple customers simultaneously. This business model is called
“multi-tenancy”, and it is pervasive throughout the entire public cloud service
industry. If you are unlucky enough to be sharing a physical machine with a
criminal, or a state sponsored bad actor, these defects may enable an attacker
to expatriate, or subtly alter, your data, intellectual property, or software
assets. In the field of deep learning specifically, it has already been proven
that changing just one single pixel of an image, within a training dataset, can
result in the predictive model failing to correctly recognize or classify a
similar picture within a production environment.
In 2017 security researchers discovered two major
vulnerabilities within microprocessor instruction sets that would enable
attackers to steal other user’s data, such as passwords and other sensitive
information, by exploiting a defect in the CPU itself. The fault could be used
to a very significant extent within a multi-tenant public cloud environment.
These exploits were named Spectre and Meltdown. They both violate one of the
most fundamental safe computing premises, that of process “isolation” between
two applications running at the same time within an operating system. After the
vulnerabilities were publicly announced, no security reports were published indicating
that the vulnerabilities had been leveraged in any attack. Bearing in mind that
any formal disclosure would have immediately undermined customer’s confidence
in the continued use of the public cloud. But we will never really know for
sure if the exploits were ever used, as many significant cyber attacks go
completely unreported. The reason for this being that the victimized party does
not want any “bad press”. As doing so may influence customers to discontinue
the use of their compromised services or products.
Hypervisors subdivide and allocate physical hardware, such as
CPUs and random access memory (RAM), in order to support the creation of
virtual machines (VM). These VMs are run by multiple customers simultaneously
on the very same server within the CSP’s data center. In 2019 the winners of
the white-hat hacking competition found several vulnerabilities in
virtualization software from VMWare, including one that allowed code execution
in the hypervisor, that would essentially allow a threat actor to escape the
confines of the VM to take control of the host machine (server) itself.
Many modern day applications are comprised of modular sets of
services that are deployed and run within containerized environments. Docker being
the leading infrastructure technology to accomplish this. Very often predictive
models are deployed within a Docker container, to be consumed by AI
applications using RESTful web services. in 2019 two critical vulnerabilities
were published relating to flaws within the Docker environment, one of which
enabled a user to gain higher level privileges, and the second actually allowed
the attacker to overwrite the host’s run binary and gain root access to the
machine on which the Docker environment itself was installed. Furthermore in
2020 study by Sonotype, the company
found that over 51% of the four million
Docker images being stored within Docker Hub, the leading repository of
publicly available images for download, had at least one or more critical
software vulnerability incorporated into their “pre-built” software stacks.
Perhaps the greatest risks posed to organizations using the
public cloud relates not to cyber attacks, which oftentimes go completely
unreported, but rather to the basic reliability and survivability of the massively
complex environment. Think for a moment of what happened to the unsinkable
Titanic, and the possibilities that simple human oversights and errors, as well
as the elemental destructive forces of nature, can bring about. These
constantly occurring incidents of notable service interruptions have been
experienced by millions of public cloud users over the course of the past five
years. Every organization must decide for themselves the amount of downtime, or
lack of service unavailability, that could be tolerated. If your appetite for public cloud system
failure is less than a couple of hours per incident, you may want to explore
other options for building and deploying AI applications out in the fog (at the
very edge of the computing network). Some
of the more notable examples of failures within CSP infrastructure, as well as
a detailed rational explanation of the root cause, as provided by a major cloud
vendor, are below.
Feb. 2017 - AWS outage in US-EAST-1 region caused failures in
many online platforms and organizations, including Airbnb, Signal, Slack, and
the U.S. Securities and Exchange Commission over a five-hour period. One firm
later estimated that the downtime caused a loss of $150 million for the S&P
500 companies affected.
Amazon attributed the root cause for this outage to be simple
human error by saying that “The Amazon Simple Storage Service (S3) team was
debugging an issue causing the S3 billing system to progress more slowly than
expected. At 9:37AM PST, an authorized S3 team member using an established
playbook executed a command which was intended to remove a small number of
servers for one of the S3 subsystems that is used by the S3 billing process.
Unfortunately, one of the inputs to the command was entered incorrectly and a
larger set of servers was removed than intended” In addition Amazon went on to
state “While these subsystems were being restarted, S3 was unable to service
requests. Other AWS services in the US-EAST-1 Region that rely on S3 for
storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new
instance launches, Amazon Elastic Block Store (EBS) volumes (when data was
needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs
were unavailable. ”
June 2018 - Azure customers in Northern Europe experienced an
outage for many hours due to temperatures that were a bit warmer than expected,
leading to automated infrastructure shutdowns. The outage was covered adeptly
by an article on the theregister.com which stated “Amid forecasts of heat and
fears of water shortage in Ireland on Monday, Microsoft was about to confront a
drought of a different kind: An Azure service outage. The disruption, which
lasted from 1744 UTC on Tuesday, June 19 to 0430 UTC on Wednesday, June 20,
downed a slew of services, as we previously reported. What was not disclosed at
the time was the cause of the eleven-hour failure. Microsoft has now revealed
that source of its troubles: mildly warm weather. That day in Dublin, where
Microsoft's North Europe data center resides, the high temperature reached a
pleasant 18°C or about 64°F in Freedom Units. Not exactly a scorcher, and while
folks in the Emerald Isle enjoyed a warmer-than-expected evening, it
nonetheless all proved too much for the company's kit.”
September 2018 - Lightning strikes caused failure at an Azure
data center in Texas, affecting customers using storage in the local region as
well as some Azure services globally. The local region was offline for about
four hours.
December 2020 – A catastrophic failure in Google
Cloud’s identity management and authentication system resulted in the
disruption of both their platform and software as a service offerings. This
included their compute engine capabilities as well. The outage was global in
scope and lasted about an hour.
Unfortunately,
criminals and foreign supported state actors are both tenacious and merciless
in their never ending quest to steal data, intellectual property, and software
assets from the public cloud. Even more so, the chance remains that these bad
actors may also alter information during long running cyber-security breaches
as well. Negatively affecting the decisions, predictions, and recommendations,
made by machine learning models deployed into production. Like all mass
manufactured software products, the very hypervisors and containerization engines
that power the public cloud, are themselves fraught with vulnerabilities that
would allow attackers to take control of shared servers or other resources.
Ironically some of the most severe outages within the public cloud over the
past few years were caused by human error, such as an administrator issuing the
wrong command from the keyboard, or temperatures outside of a data center being
just a bit higher than usual for a given time of year. Taking this all into
account, the continued use of public cloud based artificial intelligence
platforms, tools, services, and programming interfaces, pose a clear and
present danger to any organizations. But these factors when taken together, can
be especially damaging for those corporations, medical institutions, governmental
agencies, and military units, that need to process, and then incorporate into
broader AI based system, real-time events (including data, voice, and video)
being generated by sensors, monitors, and other devices at the very edge of the
computing network.