The following is a guest article from William Merchan, chief strategy officer of DataScience.com.
By 2019, 80% of Global Fortune 1000 companies will have brought IT operations and software development teams together under the umbrella of DevOps, a practice that embraces collaboration and rapid iteration and testing for faster software development.
Companies that have implemented DevOps are seeing increased business efficiency and reduced IT costs. Consequently, many are now applying similar principles to data management in a practice fittingly called DataOps.
DataOps is designed to eliminate common roadblocks to developing and deploying data-intensive applications like the predictive models built by data science teams — Netflix's content recommendation engine, for instance — with a combination of organizational changes and strategic technology adoption.
At companies utilizing DataOps, data scientists and analysts work side by side with data engineers, architects, developers and operations and IT teams. Together, these make sure data is accessible and can be used across a variety of tools in a way that is valuable to the organization as a whole.
In theory, DataOps sounds great for any company that leverages large amounts of data. But in practice, implementing a DataOps model can require sweeping cultural changes and unexpected technology considerations. Below are three areas where teams often struggle to make the transition.
Collaborating Throughout the Entire Data Lifecycle
Collaboration is a core tenant of both DevOps and DataOps, but DataOps often involves many more disparate parties than its software development counterpart. That's because DataOps encompasses the entire lifecycle of data at an organization, from collection and storage to data model deployment, meaning there are plenty of opportunities for slow, clunky processes to take root.
In some cases, just getting a data science environment up and running is hard work. It's not uncommon for a data scientist to submit a formal request to IT and wait days or weeks for an environment with the right packages to be built.
And once that data scientist has finished building a model in that environment, he or she must hand the model off to a data engineer to be rewritten into a production stack language and refactored before it is tested and rolled out. This process can take months, depending on the team's production cadence.
It's easy to become entrenched in these processes but not so easy to change them — but the key to doing so is adopting technology that enables self-sufficiency. DataOps-driven companies are taking the onus of model deployment off of engineering by giving data scientists the ability to deploy models as APIs.
Engineers can grab API code and place it where it's needed, whether that's in a software application or on a website. Software containers like Docker make it possible for IT to build environment templates in advance that data scientists can use to launch new environments as needed.
All of these technologies make it easier for teams to get work done together, rather than in spite of one another.
Establishing data transparency while maintaining security
It's a mistake to believe that the majority of today's data applications are lightweight or stateless. What many companies are now dealing with are models that require huge amounts of data — like deep learning models — and moving that data from place to place can be both cumbersome and costly.
DataOps promotes data locality, in which analyses use compute resources that are near the data, rather than requiring the data be moved.
The reason companies are increasingly grappling with heavyweight data projects is that data science work is inherently creative; it can involve many different types of data in a variety of formats that might be unexpected. To build better, more powerful models, data scientists need unprecedented access to data that, in the past, might have been treated as nonessential.
Now, companies embracing a DataOps mentality are collecting and retaining huge amounts of raw data using next-generation technology such as scalable big data platforms, as opposed to storing it in expensive relational data warehouses.
This brings about another challenge: maintaining data security. Companies are increasingly adopting fine-grained access control to allow data scientists to use the data they need without disrupting production applications.
Utilizing version control for data science projects
In software development, the practice of pushing new code to the master branch of a repository on a regular basis is called continuous integration. It's often paired with automated testing, and the reason DevOps teams embrace it is because it eliminates problems like bugs and merge conflicts.
These problems arise when a member of the team has waited too long to commit changes and the code is so far removed from the master branch that won't easily integrate.
DataOps borrows this concept and applies it to data science. When a company has hundreds or thousands of data scientists working together — or separately — on many different projects, the ability to track code changes and keep code up to date is essential. In the same vein, deploying multiple versions of one model can also make testing faster and easier.
Too often, data scientists work on their local machines and store code in personal files. This not only reduces reproducibility — meaning data scientists will have to do work over again that has already been done by another team member — but also slows down the process of productionizing work.
A shared repository of code can mitigate these problems before they arise.
The reality of DataOps
There are a lot of considerations to made when implementing a DataOps model. But with the amount of data across the globe expected to hit 44 zettabytes by 2020, managing and leveraging it efficiently will invariably make or break companies.
Don't let the potential complications of embracing DataOps be dissuading. Agile, collaborative data management is a practice worth cultivating.