Skip to main content

Open Source based Hyper-Converged Infrastructures and Hadoop

Listen:

According to a report from Simplivity [1] Hyper-Converged Infrastructures are used by more than 50% of the interviewed businesses, tendentious increasing. But what does this mean for BigData solutions, and Hadoop especially? What tools and technologies can be used, what are the limitations and the gains from such a solution?

To build a production ready and reliable private cloud to support Hadoop clusters as well as on-demand and static I have made great experience with OpenStack, Saltstack and the Sahara plugin for Openstack.

Openstack supports Hadoop-on-demand per Sahara, it's also convenient to use VM's and install a Hadoop Distribution within, especially for static clusters with special setups. The Openstack project provides ready to go images per [2], as example for Vanilla 2.7.1 based Hadoop installations. As an additional benefit, Openstack supports Docker [3], which adds an additional layer of flexibility for additional services, like Kafka [4] or SolR [5].

Costs and Investment

The costs of such an Infrastructure can vary, depending on the hardware and future strategy. Separate compute and storage nodes have been proven in the past, and should be used in future, too. The benefits outweigh the limitations, mostly end up in having move bare metal servers than in a high packed (compute and storage in one server) environment. Additionally, a more stretched environment

helps to balance peaks and high usage better than packed servers. A typical setup would have 2 controller nodes (for HA reasons), a decent count on compute nodes (high memory and CPU count) and several storage nodes (1 CPU, 8 or 16GB RAM and plenty JBOD (just a bunch of disks)). Those storage nodes should have 2 LVM’s (or raids, if that feels better) to avoid later conflicts with production and development / staging / QA buildouts.

Technology

Hadoop itself has some limitations, especially in Hyper-Converged Infrastructures, given by the demand on data locality for batch processes (MapReduce). In a typical cloud environment, like Sahara is providing in Openstack, the storage area is virtualized, and all data is transferred over the network stack. This can be avoided by using VM images for a persistent Hadoop cluster, as a production one mostly is. The data storage (HDFS) will then be provided within the VM and can be extended by mounting additional volumes to the VM (partitions for the data nodes, for example). In both implementations, Cloud based by Sahara and VM, the use of HDFS caching [6] is recommended. This will dramatically speed up the platform for analytical workloads by using columnar based storage formats like Parquet or Kudu [7], together with Hive on Spark [8]. To identify bottlenecks analyzer like Dr. Elephant [9] are very useful and recommended.

Hadoop on demand provides much more flexibility as a static cluster has, especially in terms of load peaks, dynamical resource allocation and cost efficiency. But there are some points to consider. The first and most important one is the separation of block storage and computing. Hadoop itself works with different other distributed filesystems, like ceph [10], but those often rely on Hadoop 1 (MRv1) and Yarn and MRv2 aren’t supported (yet).

The best solution here is to use the standard HDFS layer over cinder [11], which provides good performance with reliability and decent IOpS. The second, and also important one is the network layer. Every compute and storage node should have at least bonded 1GB uplinks, 10G are better (but more expensive). The network needs to be separated into front- and backend. The front-end link provides accessibility to the services the cluster provides to its users, and the back-end provides inter-cluster-communication only. As a third point the use of in-memory filesystems like Alluxio [12] (former Tachyon) may be considered, especially for research clusters, like Genome calculation or NRT applications with high ingestion rates of small data points, like IoT devices typically do.

With these points in mind, streaming based applications getting the most out of this approach, given by the high flexibility and the availability to deal with large load peaks by adding computing resources dynamically. 

Conclusion

Using Hyper-Converged Infrastructures in the world of BigData tools is trending now and proves the success of the private cloud idea. Large companies like LinkedIN, Google, Facebook are on this road since years, and the success outweighs the implementation and maintenance considerations.

List of tools used in this article

Openstack:
Sahara:

Saltstack - Openstack:

Links and References:

Comments

Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Under some strange circumstances, it can happen that a message in a Kafka topic is corrupted. This often happens when using 3rd party frameworks with Kafka. In addition, Kafka < 0.9 does not have a lock on Log.read() at the consumer read level, but does have a lock on Log.write(). This can lead to a rare race condition as described in KAKFA-2477 [1]. A likely log entry looks like this: ERROR Error processing message, stopping consumer: (kafka.tools.ConsoleConsumer$) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of each consumer in Zookeeper. To read the offsets, Kafka provides handy tools [2]. But you can also use zkCli.sh, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/kafka-consumer-groups.sh --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9, the only way to get this in...

Beyond Ctrl+F - Use LLM's For PDF Analysis

PDFs are everywhere, seemingly indestructible, and present in our daily lives at all thinkable and unthinkable positions. We've all got mountains of them, and even companies shouting about "digital transformation" haven't managed to escape their clutches. Now, I'm a product guy, not a document management guru. But I started thinking: if PDFs are omnipresent in our existence, why not throw some cutting-edge AI at the problem? Maybe Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) could be the answer. Don't get me wrong, PDF search indexes like Solr exist, but they're basically glorified Ctrl+F. They point you to the right file, but don't actually help you understand what's in it. And sure, Microsoft Fabric's got some fancy PDF Q&A stuff, but it's a complex beast with a hefty price tag. That's why I decided to experiment with LLMs and RAG. My idea? An intelligent knowledge base built on top of our existing P...

Run Llama3 (or any LLM / SLM) on Your MacBook in 2024

I'm gonna be real with you: the Cloud and SaaS / PaaS is great... until it isn't. When you're elbow-deep in doing something with the likes of ChatGPT or Gemini or whatever, the last thing you need is your AI assistant starts choking (It seems that upper network connection was reset) because 5G or the local WiFi crapped out or some server halfway across the world is having a meltdown(s). That's why I'm all about running large language models (LLMs) like Llama3 locally. Yep, right on your trusty MacBook. Sure, the cloud's got its perks, but here's why local is the way to go, especially for me: Privacy:  When you're brainstorming the next big thing, you don't want your ideas floating around on some random server. Keeping your data local means it's  yours , and that's a level of control I can get behind. Offline = Uninterrupted Flow:  Whether you're on a plane, at a coffee shop with spotty wifi, or jus...