By Sean Fynn, StellarAlgo CTO
Approximate Read Time: 3 minutes
I talk to a lot of customers, business leaders, managers and technical individuals who are looking to make their businesses more data driven. The discussion usually starts off with something along the lines of “Everyone is talking about how advanced analytics or “data science” has helped their business become more profitable through better sales, cost efficiencies etc.; OR “I have a data warehouse, how can I do more? Should I build a data lake? Should I invest in data science role(s)?”
These are all great questions. I usually start with trying to figure out and understand what point or which path they are on in their data development and analytics lifecycles. Of course, everyone is at a different stage – some organizations have full blown analytics teams with technical talent, others are thinking about investing in a data warehouse or data lake, and others still, are looking to work with a strategic vendor to support their journey.
I often begin with questions like “What are your objectives?”. And follow up with “Does your current data structure allow you to be actionable? What actions are important to you? What do you wish you could understand better before making business decisions? Is making course corrections and proactive business decisions quickly important? What does “quickly” mean to you? Where can data help you maximize returns and create measurable and concrete upside?
Once we go over all these questions it establishes a baseline of downstream use cases for a data infrastructure to support. From here we can move onto one of those original queries: Do I need a data warehouse or a data lake? My answer is that it depends on how you plan on using it. Data lakes and data warehouses are the first part of a many step process that could feed into your data-science team or your strategic vendor.
What is the difference between a data lake and a data warehouse?
A data warehouse can be viewed as a highly structured sub-set of your business data. It is organized around a set of business questions that were known at the time of the design of your warehouse. This usually forms the basis for canned reporting, structured ad-hoc analysis, and historical reporting. The data is highly cleansed and security applied by role. Making changes to this is an extensive process.
A data lake is very different from a data warehouse. Think of it as a data parking lot for all aspects of your business where the data is not being highly structured. It is designed more around the subject areas that you may want data from rather than specific business questions. It is a starting point, or a staging area for ad-hoc reporting. Sometimes it even feeds a data warehouse. In the data lake the data is generally in natural context, uncleansed, and it is a relatively simple process to add more data sources. The data lake is often a place for your data science team members to begin their journey. Generally, only advanced technical users can leverage data from a data lake.
I don’t want to hack on the data lake movement, however just parking huge amounts of source system data in a “staging” area doesn’t mean you are ready to convert your data into actions. You will still need to scrub this data and begin to relate it in a way that works towards answering your burning questions. If you already have this built then it can be a valuable starting point to sit your advanced analytics on-top of, or to allow a singled integration point for a partner / strategic analytics vendor. Why pay for to re-integration to your source systems?
It is important to note that questions like “What is the likelihood that customer X will purchase product Y next month?” or “What segment (cluster) of non-customers look like my best current customers?” will not be answered directly from either the data warehouse or data lake. You will need to implement more value-added processes to your data in order to build out the necessary clustering models based upon robust attributes.