The goal of Big Data Block is to remove all technical barriers for end users or miners and simply have the system run.  No tweaking or trying to determine how to run the data processing.  The users of BDB won’t need any technical skills – no tech installs or setup of any tech components required.  We are 100% focused on solving the problem of processing big data.

I do believe the Golem’s and Sonm’s of the world might be efficient at the AI/Machine Learning aspect as that is more about raw processing power.  The movement of data is likely less of an impact.

With this in mind, I thought I would answer a question we have gotten lately:  How is Big Data Block different than distributed CPU solution like Golem or SONM?

First, let me say these distributed CPU projects are great and serve very real needs.  This isn’t about better or worse.  It’s simply about the ability to truly use something built to be more of a generic distributed supercomputer or building a system uniquely designed for big data processing.

Big Data Block does one thing and does it very well. Our system is custom tuned for this one task. Both Golem and SONM’s systems are designed to handle a wide variety of tasks where raw processing power is the goal. One might say they’re a jack of all processing jobs, master of none. We are Big Data masters.

To expand on this I have to get into a bit of the tech of big data systems based on Hadoop and MapReduce.

Hadoop MapReduce is the heart of the Hadoop system. It provides all the capabilities you need to break big data into manageable chunks, process the data in parallel on your distributed cluster, and then make the data available for user consumption or additional processing.

The first challenge is storing big data. Hadoop Distributed File System (HDFS) solved this challenge.

HDFS provides a distributed way to store big data. Your data is stored in blocks in data nodes and you specify the size of each block. Basically, if you have 512MB of data and you have configured HDFS such that it will create 128MB of data blocks. So HDFS will divide data into 4 blocks (512/128 = 4) and store it across different data nodes.  it will also replicate the data blocks on different DataNodes. And since we are using commodity hardware, storing is not a challenge.

HDFS also solves the scaling problem since it focuses on horizontal scaling instead of vertical scaling. You can always add extra data nodes to an HDFS cluster when required, instead of scaling up the resources of your data nodes. As an example, if you are storing 1TB with HDFS you don’t need a 1TB system.  Instead, you can do it on multiple 128GB systems or even less.

The second challenge is storing the variety of data. This problem is also addressed by HDFS.

With HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured since in HDFS there is no pre-dumping schema validation. And, it also follows a write-once-and-read-many model. Because of this, you can just write the data once and you can read it multiple times for finding insights.

The third challenge is accessing & processing the data faster. This is one of the major challenges with Big Data. In order to solve it, we move processing to data and not data to processing. What does this mean, exactly?

Instead of moving data to the master node and then processing it, in YARN the processing logic is sent to the various slave nodes and then data is processed parallelly across different slave nodes. Then the processed results are sent to the master node where the results are merged and the response is sent back to the client.

YARN performs all your processing activities by allocating resources and scheduling tasks.

What this all means, in a nutshell, is that the whole engine processing the data in big data systems is running on all of the machines in the distributed ecosystem.  The data and the logic are managed on each machine that’s part of the processing ecosystem.  This makes it possible to spread the load to a very large number of machines because you remove the bottleneck of a single data stream or network congestion or memory issues.  This makes for high efficiency in processing large-scale data projects.  Add to this the fact that you need a central management layer, like YARN, to help manage the across all these nodes.

Where will this run?

The challenge for a system built on the premise of adding additional processing power is that it doesn’t actually remove all the above bottlenecks that will cause problems for processing a lot of data.  Most notably, where’s the data supposed to go?  If there’s a single stream of data that has to be sent to all of these remote CPU’s then there’s a very real argument that this will be slower as you are trying to push data on a one lane road to get it to a superhighway.  You can’t really enjoy the superhighway if you are stuck in traffic on the one-lane road.

The other issue is how the data would be returned back.  You need a system to manage the data returning back to combine it all for the end user.  Let’s even assume you can run this on these distributed supercomputers, how would you get the system setup and then run it?  I don’t think that capability exists, but even if it does, and the structure is there, who is setting all this up?

Big Data Block is different than distributed CPU solution like Golem or SONM.

At Big Data Block we are building something totally centered around one use case – the removal of technical and fiscal barriers that surrounds big data processing.  These other projects are more generic and trying to build something that can do a lot.  This generic approach does come with natural limitations, as the system can’t be completely tweaked to one specific function.  There’s a place for all of us in this ecosystem and I am sure they will have great success.

At Big Data Block we are 100% focused on solving the problem of processing big data.  The whole system is laser-focused on distributing this workload, data, and analysis to all nodes on the network and makes that super efficient.  We are not trying to be a catch-all for all processing. Big Data Block is a solution for those who may not have the funding or the technical know-how to process data.

Jason

Author Jason

More posts by Jason

Leave a Reply

Keep Me Updated