When “BIG” data knocks on your door…

Anyone who has an idea about what’s happening around in the world of technology would have come across the term “Big Data” and yes, as one would correctly assume, it means a huge amount of data or data that contains a greater variety of information in increasing volume with higher and higher velocity (the three V’s) in a decorative sentence.

Three Vs of Big Data
To put it simply, let’s say big data is just a huge volume of data, that needs some extra attention and special care while storing and processing. If not, she will eat your time and eventually your money.

Why Big Data?

This is one important, simple yet difficult question. To answer this, we have to understand, what we can derive from big data and where it is coming from.

Let’s take a simple example of Big Data in healthcare.

There are millions of patients around the globe with simple flu to deadly cancer. Suppose we are taking only 10000 ICU patients and collecting only their ECG, EEG, PPG, heart rate, temperature, pressure, SpO2 and other vital parameters in their respective sampling frequency. We will be generating a file or a couple of files of a size of 10MB an hour altogether. 

1 hour — 10MB
24 hour — 240 MB
24 hours/5 days — 1.2 GB (consider we are only collecting 5 days of data)

If this is only from one patient,
imagine what it would be for 10000 patients? — 12000 GB or 12 TB.

So we are able to collect this huge patient data and store it as well, but now what? 

We need to use this data to derive some “value” that makes some meaning. Collecting and storing 12 TB of patient data doesn’t make any sense until and unless we process the data and extract the inner meaning of the data. How efficiently we can process this data in a given time frame is a whole new interesting topic and for now, let’s not go into that. In most cases, it’s the pattern we are deriving from big data, the reason being, in most cases big data represents a population.

In this example, suppose we are trying to find the pattern of how patients went through different critical stages and their survival rates, we can predict whether 100001st patient will survive or not (to some extent), right? And yes, you guessed it… eventually, we reach “machine learning”. This is in-fact the final step in processing big data in most of the cases. But not always. 

In some other cases, say in a mobile app, the data will be user taps and scrolls, which when processed, become user patterns which will eventually generate revenue if we deliver what users like/dislike coined with perfect targeted marketing.

Understanding and deriving patterns need considerable sample size. Moreover, we cannot derive a pattern from a single type of data, we need a variety of data. The more data we churn out, the more revenue we can generate but only if we are able to properly extract meaningful insights from that data.

How huge should the data be to be termed as Big Data?

Well, it depends on the context. Usually, the data which is either in gigabytes, terabytes, petabytes, exabytes or anything larger than this in size is considered as Big Data. But even a small amount of data can become big data if the context calls for it. To be clear, a 200MB email attachment is big data compared to normal emails which are supposed to be in size of KBs and 10 TB of images for processing them on a desktop computer can also be referred to as big data.  

Challenges associated with big data

Mainly two challenges are associated with big data:

1. Storing, managing and retrieval
2. Processing the data

Two simple challenges but very difficult to tackle because we are talking about petabytes of data. What if we want a chunk of data, queried in the database and need to wait for a couple of minutes to get the results? Really bad idea!
Even after we received the data, and start processing and it took a couple of days to derive some insights from data.. really really bad idea!!

For that, we can use different frameworks like the Hadoop framework for management and retrieval of big data and in the processing side, of course, efficient programming backed with powerful processors is the only hope. While writing a single line of code, a keen idea of time and space complexity should be there because the next day we may encounter another petabyte of data!

Leave a Reply

Your email address will not be published. Required fields are marked *

two × five =