Authored by: Ravi Shankar, Senior Vice President at Denodo
Organisations are fast realising the potential of data science and the opportunities it offers, especially the rapid recent advances in artificial intelligence. Data science is also increasingly business-driven, as organisations use it to gain customer and market insights and make informed decisions that impact the bottom line.
Every data science project takes place within a data science lifecycle with defined steps. Although most data science projects tend to flow through a similar lifecycle, every project team is different, so every data science lifecycle is slightly unique.
Interestingly, many of the stages in a typical data science lifecycle have more to do with data than science. Even before data scientists can engage in science, they have to take several data-related steps:
Determine where the right data is located.
Access the data they need, which requires an understanding of the bureaucracy of the organisation in terms of ownership, credentials, access methods, and access technologies.
Transform the data into a format that is easy or suitable to use.
Combine that data with other data from other sources, bearing in mind that the other data may be formatted differently.
Profile and cleanse the data to eliminate incomplete or inconsistent data points.
The fact is, most data science projects fail to deliver business value; many do not even make it into production. This is largely due to the high diversity of data types that come from a wide variety of sources. Add the large data volumes, and the data scientists may have an incredibly complex task. Providing access to all the enterprise data – as well as the ability to flexibly model it – is crucial to the success of a data science project.
Overcoming Obstacles with Data Virtualisation
We need to efficiently bridge the gap between data and data scientist, and data virtualisation is one modern data integration and data management technology that can do that. Data virtualisation provides data scientists with an integrated, real-time view of the data, across its existing locations, without having to move the data itself into a centralised repository, such as a data lake.
This is possible because data virtualisation forms a data layer over the different data sources. This layer contains only the metadata necessary to access the different data sources, but no actual data. Data virtualisation accelerates data access for data scientists and effectively overcomes the key obstacles in the data science lifecycle.
The following is a breakdown of how data virtualisation is able to provide data scientists with real-time access to the data they need, regardless of its format and location, in a typical data science workflow:
Identifying Useful Data: Data virtualisation provides data scientists with a single unified interface for accessing all types of data, including data residing in data lakes, Presto or Spark systems, social media, or even flat and/or JSON files. Some data virtualisation solutions also offer data catalogues, which enable data scientists to discover data using Google-like search functionality.
Modifying Data into a Useful Format: Some data virtualisation solutions also provide administrative tools that enable data scientists to document data sets for future reference and even share them with other data scientists. Data scientists can use their own notebooks, such as Jupyter, for such operations, or leverage the notebooks included in some data virtualisation solutions with highly integrated user interfaces that also include advanced features like automatically generated recommendations using artificial intelligence/machine learning (AI/ML), based on past usage and behaviour.
Analysing Data: With data virtualisation, a data scientist can conduct analysis by executing queries on the data pretty much whenever they want—when identifying useful data or modifying it into different formats.
Preparing and Executing Data Science Algorithms: Advanced data virtualisation solutions provide query optimisers that streamline query performance through a variety of optimisations such as maximising the push-down of processes to the sources. Optimisers may push down only a part of the operation, depending on the best expected results.
Sharing Results with Business Users: Data virtualisation enables data scientists to share their queries and results with other team members, for a more collaborative, iterative workflow, using a data catalogue as part of a data virtualisation implementation. Also, data scientists can get feedback from their team at any point of the workflow.
Furthermore, data virtualisation offers different ways for data scientists to share information with business users when the results are ready. For instance, they can publish the data from the data virtualisation solution directly to a specific application like MicroStrategy, Power BI, or Tableau. Users of these tools can connect to the data virtualisation layer and see the results directly using their tool of choice.
Data Virtualisation and the Data Science Lifecycle
Data virtualisation can be strategically deployed at critical phases of the data science lifecycle, to accelerate processes and eliminate bottlenecks in data science initiatives. The technology can offer data scientists real-time access to disparate sources of data, help streamline the preparation and analysis process, and finally, enable easier sharing of results to the wider team.