Data Pipelines

3 min readJun 15, 2021

Throughout the years, as the needs of data scientists grew, efficiencies developed in tandem to meet ever-increasing demands. Aptly name a data pipeline, this method of processing data has proven to be a staple in the data scientist’s tool-belt.

https://sarasanalytics.com/blog/what-is-a-data-pipeline

Analogous to that of an actual pipeline or factory process, data pipelines were created to increase the speed and consistency in which a good is delivered through a pre-configured route (or manner of processing).

So why would one use a pipeline? Simply put, it is to reduce tedium. Imagine you were given the task of finding, in real time, the most common hashtag in a tweet and you are equipped with the necessary equipment. You are only interested in the hashtag(s) in the tweet, so you want to delete all other extraneous information that could lengthen the data processing time and take up valuable storage. As tweets constantly come in you do not want to manually process them each time you begin your analysis. You want your data ready to analyze as it comes to you, through a uniform process. Therefore, you would use a data pipeline that either you process locally or is outsourced.

Let’s go one step further. You want to find the most common hashtag among several social media platforms that are not centralized. You will have a tough time transforming and modifying the data as it can be formatted differently.

Data pipelines, like data science, are conceptual, so rather than it being defined by a single method of use, there are many methods underneath it. And in this case, the most common method you will come across is the practice of extracting, transforming, and loading your data, otherwise known as ETL. As the data you gather gets larger in scope and more diverse in source, you will want to avoid hand-coding as much as possible to reduce resource costs. As a result of this need, many companies create their own and sell ETL tools to data scientists.

Your goal in essence with ETL is to have all your data “on the same page”. And it is just one subsection of data pipelines.

When creating and/or using your own pipeline, remember to keep the following questions in mind:

What does the end-user need to do with the data?
How can you maximize efficiency?
What are the potential downstream effects of your data pipeline?

Data Pipelines

Written by Nate Tsegaw

No responses yet