Ulf Mattsson: The Past, Present, and Future of Big Data Security
July 2014 by
While Apache Hadoop and the craze around Big Data have seemed to explode out into the market, there are still a lot more questions than answers about this new environment. One of the biggest concerns, second perhaps only to ROI, remains security.
This is primarily due to the fact that many have yet to grasp the paradigm shift from traditional database platforms to Hadoop. Traditional security tools address a separation of duties, access control, encryption options, and more, but they are designed for a structured, limited environment, where data is carefully collected and cultivated. Hadoop, on the other hand, is a massively scalable environment with limited structure, high ingestion volume, massive scalability and redundancy, designed for access to a vast pool of multi-structured data. What have been missing are new security tools to match. Another challenge with securing Hadoop comes from the rapid expansion of the environment itself. Since its initial development, new tools and modules have been coming out not only from Apache, but nearly every third party vendor as well. As soon as security is tested and implemented for one module, three more have come out. This makes it very difficult to create an overall security architecture for the entire Hadoop ecosystem as it continues to grow. However, some security tools have been released over the last few years, including Kerberos, which provides strong authentication. But Kerberos does little to protect data flowing in and out of Hadoop, or prevent privileged users such as DBA’s or SA’s from abusing the data. While authentication remains an important part of the data security structure in Hadoop, on its own it falls short of adequate data protection.
Another development was the addition of coarse-grained volume or disk encryption, usually provided by data security vendors. This solved one problem, protecting data at rest, but considering one of the primary goals behind Hadoop is using the data, one might suggest that it provided little in the grand scheme of Big Data security. Sensitive data in use for analytics, traveling between nodes, sent to other systems, or even just being viewed is subject to full exposure.
Big Data technology vendors up until recently have often left data security up to customers to protect their environments, as they too feel the burden of limited options. Today, vendors such as Teradata, Hortonworks, and Cloudera, have partnered with data security vendors to help fill the security gap. What they’re seeking is advanced functionality equal to the task of balancing security and regulatory compliance with data insights and “big answers”.
The key to this balance lies not in protecting the growing ecosystem, or blanketing entire nodes with volume encryption, but targeting the sensitive data itself at a very fine-grained level, with flexible, transparent security. Applying this security through a comprehensive policy-based system can provide further control and additional options to protect sensitive data, including multiple levels of access to various users or processes. Once secured, the data can travel throughout the Hadoop ecosystem and even to outside systems and remain protected.
The options for fine-grained data security in Hadoop now include encryption (AES or format-preserving), masking, and Vaultless Tokenization. Typically, encryption is the least desirable option, as standard strong encryption produces values that are unreadable to the tools and modules in Hadoop, format-preserving encryption is typically much slower than masking or Vaultless Tokenization, and both require complicated cryptographic key management across tens or even hundreds of nodes.
Masking was developed for non-production systems and testing, and has found a home in Hadoop’s early, experimental phase. Individual data elements are either replaced with random values or generalized so that they are no longer identifiable. It is fast, produces values that are readable to systems and processes, and requires no key management. However, because masking was designed for non-production, it is usually not reversible, and is therefore not ideal for any situations where the original data may be needed sometime after the data is masked.
Vaultless Tokenization, similar to masking, also replaces data elements with random values of the same data type and length. It is also much faster than format-preserving encryption, virtually eliminates key management, and is transparent to processes. The added benefit comes from the ability to perform both one-way protection and reversible security. This provides ideal protection for test/dev environments and can also allow retrieval of the original data when required by authorized users or processes.
Due to the read-only nature of the Hadoop environment (data can only be written, read, or deleted), application of these fine-grained protection methods requires a unique approach. This is typically performed in one of two ways. The first is a secured gateway, situated in front of Hadoop, which parses incoming data to identify sensitive data elements, and applies the selected protection method before passing the data on to Hadoop. The second is a secured landing zone, which may be a node or partition within Hadoop that is protected with coarse-grained encryption. Files arrive in the landing zone, and are then parsed by one of the processing applications in Hadoop (MapReduce, Hive, Pig, etc.), identifying and protecting sensitive data elements before ingesting the data into the main Hadoop cluster. This method utilizes the massively parallel processing of Hadoop to efficiently protect data.
As the Hadoop ecosystem expands, security vendors will need to create new tools and integrations to meet them. Although fine-grained security persists throughout the Hadoop ecosystem, in some cases, users or processes will need sensitive information in the clear, which will require the ability to unprotect the data. Integrations within Hadoop applications will allow such functions to occur securely within the Hadoop framework, rather than feeding the information in and out of a gateway. There are already integrations with MapReduce, Hive, Pig, and other applications, and data security vendors will need to innovate new solutions for the emerging standards of Hadoop 2.0 and beyond.
In the next five years, the generation of data by more and more people and devices will continue to drive the companies towards Hadoop and other Big Data platforms. The requirements for handling extreme levels of volume, velocity, variety, and veracity will only increase, and Big Data will assume more and more critical business functions. As the environment becomes more established, usability and enterprise integration will improve, new data exchange protocols will be used, and a set of security tools will be standardized and made native to platforms.
Laws and regulations relating to privacy and security will also continue to increase, and security will become an even more vital component in Big Data. Companies will be unable to harness the massive amounts of machine-generated data from the Internet of Things without implementing comprehensive data security. First in the area of industrial environment (power grids, etc.) and later on consumer use (healthcare, etc.). Security will become viewed not only in terms of loss-prevention, but value creation, enabling compliant data collection, use, analysis, and monetization. Big Data security itself will evolve, becoming increasingly intelligent and data-driven in its own right. We will see more tools that can translate security event statistics into actionable information. Data security policies will be intricately designed, and likely multi-layered, utilizing a combination of coarse- and fine-grained security methods, access control, authentication, and monitoring.
In the exciting near future, the data is only getting bigger, but we must not allow it to outgrow security.