Data vs Big Data: Know the difference

BLYDEN
3 min readApr 29, 2022

You know that you’re a nerd if you enjoy analyzing data utilizing R programing language. R is awesome! First, you can explore the data, perform data mining, transform the data, and create a model to pull out insights. This is great for MB-level data sets, but what happens when you try to do the same on data sets in the TBs and, dare I say it, PBs!?!? It will take a very long time to process this data because desktop PCs (even a beefed-up one like the one I have) are always limited in regards to horsepower (ram, CPU, disk space, etc.). You need a platform with an abundance of processing resources that can handle large quantities of data. Welcome to big data processing.

The modern way of big data processing is leveraging distributed computing. Distributed computing is a concept that manages cluster(s) of machines while synchronizing their resources (ram, CPU, disk space, etc). For example, analyzing a data set at the PB scale would require a management service to break up the job into smaller jobs and distribute instructions to the various clusters. Hadoop has been the go-to solution for big data processing but is limited by the on-premise cost of resources, hardware, and personnel required to manage the clusters. We’ll look at the Google Cloud Platform (GCP) for our example. [Note: GCP is just one of many cloud big data solutions. Other popular solutions are Azure, AWS, IBM, and Snowflake].

ETL

Extract, Transform, and Load (ETL) is a popular concept for analyzing data, especially big data. The main difference between data and big data is, you guessed it, the word BIG, which really changes everything. I can extract smaller data sets, transform them, and load them into a model or a data processing application all from one desktop efficiently. But what if I’m extracting PB size data sets from databases? Where would I store the data for extraction (Data Lake)? What software can transform this data on such a scale? And what tool could I load it to derive insights(Data Warehouse)? For instance, large companies can process PBs of data per day. Most people have a 1 TB hard drive on their PC. That is 1000 TB hard drives each day…I think you can paint the picture of how complex this problem can be.

Theoretical Example

Let us say that I’m asked to analyze a dataset from a database, and the data set is 22 TBs.

Single PC Method

My PC can’t handle processing 22 TBs of data, so I will have to create a SQL query on their data set to extract a limited/more-manageable amount of data and export the results to a CSV. My PC can only manage maybe a tenth of this data, so I have to keep that in mind. Next, I import the dataset into R, and see what I’m working with. Then, I transform the data to create dummy variables when needed, changing yes/no’s to binary syntax (0s and 1s), deleting incomplete entries, etc. After the preliminary analysis and transformation, I split the data, create a machine learning model, and then test the model. By the end, I hope that 10% of the data is representative of the 22 TBs of data, or my analysis and models are null and void.

Big Data Method

I extract the entire 22 TBs of data into the GCP Data lake solution, which is called Google Cloud Storage, in which I create a bucket for the dataset to be stored. From there, I open Cloud Fusion which is a tool that is a fully managed service that streamlines ETL and data pipeline functions. From Cloud Fusion, I connect to my bucket and perform transformations within a tool called wrangler. After I finish my transformations, I load the data into a Data Warehouse called BigQuery, where I can perform further analysis and create ML models.

Conclusion

People keep telling me that data is the new oil :\ (very cliche). Leaders must understand and leverage our data for the benefit of our organizations. But first, we must understand the difference between data and big data and what systems are available to draw insights and intelligence from our vast amounts of data.

--

--

BLYDEN

Project Engineer specializing in design, development, acquisition, data analysis, cyber security, and production.