Secure Communication in Hadoop without Hurting Performance

Apache Hadoop is used for processing big data at many enterprises. A Hadoop cluster is formed by assembling a large number of commodity machines, and it enables the distributed processing of data. Enterprises store lots of important data on the cluster. Different users and teams process this data to obtain summary information, generate insights, and gain other very useful information.

Figure 1 diagram of a Typical Hadoop cluster with clients communicating from all sides
Figure 1. Typical Hadoop cluster with clients communicating from all sides

Depending on the size of the enterprise and the domain, there can be lots of data stored on the cluster and a lot of clients working on the data. Depending on the type of client, its network location, and the data in transit, the security requirements of the communication channel between the client and the servers can vary. Some of the interactions need to be authenticated. Some of the interactions need to be authenticated and protected from eavesdropping. How to enable different qualities of protection for different cluster interactions forms the core of this blog post.

The requirement for different qualities of protection

At eBay, we have multiple large clusters that hold hundreds of petabytes of data and are growing by many terabytes daily. We have different types of data. We have thousands of clients interacting with the cluster. Our clusters are kerberized, which means clients have to authenticate via Kerberos.

Some clients are within the same network zone and interact directly with the cluster. In most cases, there is no requirement to encrypt the communication for these clients. In some cases, it is required to encrypt the communication because of the sensitive nature of the data. Some clients are outside the firewall, and it is required to encrypt all communication between these external clients and Hadoop servers. Thus we have the requirement for different qualities of protection.

figure 2 diagram of a hadoop cluster with clients with different quality of protection requirements
Figure 2. Cluster with clients with different quality of protection requirements (Lines in red indicate communication channel requiring confidentiality)

Also note that there is communication between the Hadoop servers within cluster. They also can fall into category of clients within the same network zone or internal clients.

Table 1. Quality protection based on client location and data sensitivity
# Client Data QOP
1 Internal Normal data Authentication
2 Internal Sensitive Authentication + Encryption + Integrity
3 External Normal or sensitive data Authentication + Encryption + Integrity

Hadoop protocols

All Hadoop servers support both RPC and HTTP protocols. In addition, Datanodes support the Data Transfer Protocol.

Figure 3 diagram of the Hadoop Resource Manager with multiple protocols on different ports
Figure 3. Hadoop Resource Manager with multiple protocols on different ports

RPC protocol

Hadoop servers mainly interact via the RPC protocol. As an example, file system operations like listing and renaming happen over the RPC protocol.

HTTP

All Hadoop server expose their status, JMX metrics, etc. via HTTP. Some Hadoop servers support additional functionality to offer a better user experience over the HTTP. For example, the Resource Manager provides a web interface to browse applications over the HTTP.

Data Transfer Protocol

Datanodes store and serve the blocks via the Data Transfer Protocol.

Figure 4 diagram of a Hadoop Datanode with multiple protocols on different ports
Figure 4. A Hadoop Datanode with multiple protocols on different ports

Quality of Protection for the RPC Protocol

The quality of protection for the RPC protocol is specified via the configuration property hadoop.rpc.protection. The default value is authentication on a kerberized cluster. Note that hadoop.rpc.protection is effective only in a cluster where hadoop.rpc.authentication is set to Kerberos.

Hadoop supports Kerberos authentication and quality protection via the SASL (Simple Authentication and Security Layer) Java library. The Java SASL library supports the three levels of Quality of protection shown in Table 2.

Table 2: SASL and Hadoop QOP values
# SASL QOP Description hadoop.rpc.protection
1 auth Authentication only authentication
2 auth-int Authentication + Integrity protection integrity
3 auth-conf Authentication + Integrity protection + Privacy protection privacy

If privacy is chosen for hadoop.rpc.protection, data transmitted between client and server will be encrypted. The algorithm used for encryption will be 3DES.

Figure 5 diagram of Resource Manager supporting privacy on its RPC port
Figure 5: Resource Manager supporting privacy on its RPC port

Quality of protection for HTTP

The quality of protection for HTTP can be controlled by the policies dfs.http.policy and yarn.http.policy. The policy values can be one of the following:

  • HTTP_ONLY. The interface is served only on HTTP
  • HTTPS_ONLY. The interface is served only on HTTPS
  • HTTP_AND_HTTPS. The interface is served both on HTTP and HTTPS

Figure 6 diagram of Resource Manager supporting privacy on its RPC and HTTP ports
Figure 6. Resource Manager supporting privacy on its RPC and HTTP ports

Quality of protection for the Data Transfer Protocol

The quality of protection for the Data Transfer Protocol can be specified in a similar way to that for the RPC protocol. The configuration property is dfs.data.transfer.protection. Like RPC, the value can be one of authentication, integrity, or privacy. Specifying this property makes SASL effective on the Data Transfer Protocol.

Figure 7 diagram of a Datanode supporting privacy on all its ports
Figure 7. Datanode supporting privacy on all its ports

Specifying authentication as the value of dfs.data.transfer.protection forces the client to require a block access token while reading and storing blocks. Setting dfs.data.transfer.protection to privacy results in encrypting the data transfer. The algorithm used for encryption will be 3DES. It is possible to agree upon a different algorithm by setting dfs.encrypt.data.transfer.cipher.suites on both client and server sides. The only value supported is AES/CTR/NoPadding. Using AES results in better performance and security. It is possible to further speed up encryption and decryption by using Advanced Encryption Standard New Instructions (AES-NI) via the libcrypto.so native library on Datanodes and clients.

Performance impact

Enabling privacy comes at the cost of performance. Encryption and decryption require extra performance.

Setting hadoop.rpc.protection to privacy encrypts all communication from clients to Namenode, from clients to Resource Manager, from datanodes to Namenodes, from Node Managers to Resource managers, and so on.

Setting dfs.data.transfer.protection to privacy encrypts all data transfer between clients and Datanodes. The clients could be any HDFS client like a map-task reading data, reduce-task writing data or a client JVM reading/writing data.

Setting dfs.http.policy and yarn.http.policy to HTTPS_ONLY causes all HTTP traffic to be encrypted. This includes the web UI for Namenodes and Resource Managers, Web HDFS interactions, and others.

While this guarantees the privacy of interaction, it slows down the cluster considerably. The cumulative effective will be a fivefold decrease in the cluster throughput. We had set quality of protection for both RPC and data transfer protocols to privacy in our main cluster. We experienced severe delays in processing that resulted in days of backlogs in data processing. The Namenode throughput was low, the data read/writes were very slow, and this increased the completion times of individual applications. Since applications were occupying containers for five times longer than before, there was severe resource contention that ultimately caused the application backlogs.

The immediate solution was to change the quality of protection back to authentication for both protocols, but we still required privacy for some of our interactions, so we explored the possibility of choosing a quality of protection dynamically during connection establishment.

Selective encryption

As noted, the RPC protocol and Data Transfer Protocol internally use SASL to incorporate security in the protocol. On each side (client/server), SASL can support an ordered list of QOPs. During client-server handshake, the first common QOP is selected as the QOP of that interaction. Table 2 lists the valid QOP values.

Figure 8 diagram showing sequence of client and server agreeing on a QOP based on configured QOPs
Figure 8. Sequence of client and server agreeing on a QOP based on configured QOPs

In our case, we wanted most client-server interactions to require only authentication and very few client-server interactions to use privacy. To achieve this, we implemented an interface SASLPropertiesResolver. The method getSaslProperties on SASLPropertiesResolver will be invoked during handshake to determine the potential list of QOP to be used for the handshake. The default implementation of the SASLPropertiesResolver simply returned the value specified in hadoop.rpc.protection and dfs.data.transfer.protection for RPC and data transfer protocols.

More details on SASLPropertiesResolver can be found by reviewing the work on the related JIRA HADOOP-10221.

eBay’s implementation of SASLPropertiesResolver

In our implementation of SASLPropertiesResolver, we use a whitelist of IP addresses on the server side. If the client’s IP address is in the whitelist, then the list of QOP specified in hadoop.rpc.protection/dfs.data.transfer.protection is used. If the client’s IP address in not in whitelist, then we use the list of QOPs specified in a custom configuration, namely hadoop.rpc.protection.non-whitelist. To avoid very long whitelists, we use CIDRs to represent IP address ranges.

In our clusters, we set hadoop.rpc.protection and dfs.data.transfer.protection on both client and servers to be authentication,privacy. The configuration hadoop.rpc.protection.non-whitelist was set to privacy.

When whitelisted clients connect to servers, they will agree upon authentication as the QOP for the connection, since both client and servers support a list of QOP values, authentication,privacy. The order of QOP values in the list is important. If both client and servers support a list of QOP values as privacy,authentication, then client and servers agree upon privacy as the QOP for their connections.

Figure 9 digram showing internal clients, external clients, cluster
Figure 9. internal clients, external clients, cluster

Only clients inside the firewall are in the whitelist. When these clients connect to servers, the QOP used will be authentication as both clients and servers use authentication,privacy for RPC and data transfer protocols.

The external clients are not in the whitelist. When these clients connect to servers, QOP used will be privacy as servers offer only privacy based on the value of hadoop.rpc.protection.non-whitelist. The clients in this case could be using authentication,privacy or privacy in their configuration. If the client only specifies authentication, the connection will fail as there will be no common QOP supported between client and servers.

Figure 10 diagram showing sequence of Hadoop client and server choosing QOP using SASL PropertiesResolver
Figure 10. Sequence of Hadoop client and server choosing QOP using SASL PropertiesResolver

The diagram above depicts the sequence of an external client and Hadoop server agreeing on a QOP. Both client and server support two potential QOP, authentication,privacy. When the client initiates the connection, the server consults the SASLPropertiesResolver with the client IP address to determine the QOP. The whitelist-based SASLPropertiesResolver checks the whitelist, finds that the client’s IP address is not whitelisted and hence offers only privacy as the QOP. The server then offers only privacy as the only QOP choice to the client during SASL negotiation. Thus the QOP of the subsequent connection will be privacy. This necessitates the need for setting up a secret key based on the cipher suites available at the client and server.

In some cases, internal clients need to transmit sensitive data and prefer to encrypt all communication with the cluster. In this case, the internal clients can force encryption by setting both hadoop.rpc.protection and dfs.data.transfer.protection to privacy. Even though the servers support authentication,privacy, connections will use the common QOP, privacy.

Table 3: QOP list at clients, servers and the negotiated QOP.
# Client QOP Server QOP Negotiated QOP Comments
1 Authentication, Privacy Authentication, Privacy Authentication For normal clients
2 Authentication, Privacy Privacy Privacy For external clients
3 Privacy Authentication, Privacy Privacy For clients to transmit sensitive data

Selective secure communication using Apache Knox

Supporting multiple QOPs on Hadoop protocols enables selective encryption. This will protect any data transfer. The data could be real data or signal data like delegation tokens. Another approach will be to use a reverse proxy like Apache Knox. Here the external clients will have connectivity only to Apache Knox servers and Apache Knox will allow only HTTPS traffic. The cluster will support only one QOP, which is Authentication.

Figure 11 showing the process of Protecting data transfer using Reverse Proxy (Apache Knox)
Figure 11. Protecting data transfer using Reverse Proxy (Apache Knox)

As shown in the diagram, the external clients interact with the cluster via knox servers. The transmission between the client and knox will be encrypted. knox, being an internal client, forwards the transmission to the cluster in plain text.

Selective Encryption using reverse proxy vs SaslPropertiesResolver

As we discussed, it is possible to achieve a different quality of protection for the client-cluster communication in two ways.

  • With a reverse proxy to receive encrypted traffic
  • By enabling the cluster to support multiple QOPs simultaneously

There are pros and cons with both approaches, as shown in Table 4.

Table 4: Pros and cons of two approaches to support multiple qualities of protection
Approaches PROS CONS
SaslPropertiesResolver Natural and efficient. The clients can communicate directly with the Hadoop cluster using the Hadoop commands and using native Hadoop protocols. Maintenance of whitelist. The whitelist needs to be maintained on the cluster. Care should be taken to keep the whitelist at the minimal size using CIDRs to avoid lookup in a large list of IP addresses for each connection.
Reverse Proxy Closed Cluster. The external clients need to communicate only with Knox servers and hence all other ports can be closed. Maintenance of additional components. A pool of knox servers needs to be maintained to handle traffic from external clients. Depending on the amount of data, there could be a quite a lot of Knox servers.

Depending on the organization’s requirements, the proper approach should be chosen.

Further work

Enable separate SaslPropertiesResolver properties for client and server

A process can either be a client or server or both. In some environments, it is desirable to use different logic for choosing QOP depending on whether the process is a client or server for the specific interaction. Currently, there is provision to specify only one SaslPropertiesResolver via configuration. If a client needs a different SaslPropertiesResolver, it needs to use a different configuration. If the same process needs a different SaslPropertiesResolver while acting as client and server, there is no way to do that. It would be a good enhancement to be able to specify different SaslPropertiesResolver for client and server.

Use Zookeeper to maintain whitelist of IP Addresses

Currently the whitelist of IP addresses is maintained on local files. This introduces the problem of updating thousands of machines and keeping them all in sync. Storing the whitelist information on a Zookeeper node may be a better alternative.

Currently, the whitelist is cached by each server during initialization and then reloaded periodically. A better approach may be to reload the whitelist based on an update event.