[Dec 2022] Core Challenges in MLOps
MLOps requires a lot of moving variables across the business, data, code as well as model engineering at each phase of the machine learning lifecycle introducing questions that are unique to ML.
Over the past few years, the adoption of cloud engineering, understanding of big data, and massive popularity of the open-source libraries for predictive modeling have made it easier for everyone to be enticed by the possibility to generate user insights or personalize their “software 1.0” by hiring a team of data scientists.
However, to most companies’ constant surprise, data scientists alone have been far from adequately equipped to handle/guide the end-to-end deployment or monitor or debug these models once deployed in production.
It truly needs an army, some in-house (DE/MLE) and some outsourced to the A.I. tooling/SaaS companies. While every data team at different companies is substantially different however there are some core challenges that are common to everyone.
For a machine learning model to be considered successful, it must be able to generate stakeholder buy-in. Thus, it becomes incredibly important to tie the models to business KPIs instead of the accuracy scores (F1, recall, precision, ROC, AUC). However, with business KPIs changing every so often through different stages it becomes incredibly hard to measure the model performances.
For any business to build powerful and reliable ML models, investing effort into creating and maintaining a data catalog becomes crucial to track meta-data and while debugging, to retrieve information on which data source is the model being trained on. While building a data catalog may not seem like a hard task but the real challenge is building relevancy into data discovery. This is often when a lot of companies give up. If you instead decide to opt for a commercial solution, most of the out-of-the-box commercial data cataloging solutions do not adapt well to different organizations’ data needs and cost several kidneys and more. Requesting a feature-add can put you on nothing short of an eight-ten month-long waitlist, optimistically speaking, if the requested feature is even aligned with their product plan. The final option i.e. to build an in-house solution requires upfront investment and a team with an excellent understanding of user-friendly database design practices thus making it a time and resource-consuming process. To make it even harder, there is a lack of documentation around best practices about creating, managing, and scaling an in-house data-cataloging tool and certain evaluation/compliance metrics so as to not end up with an incomplete catalog esp with the new live data being streamed into the system making the effort futile at best.
Your machine-learning model is only as good as your data. For any data science project to be successful, data quality and more importantly labeled data quantity are the biggest defining factors. However, best practices on data evaluation about how to standardize and normalize new incoming data are still case-by-case considerations. Most training environments need to come pre-loaded with few checks and balances based on the different stages of model deployment. For example, for a model that is being tested for production, has a random seed been set on the model to make sure that the data is divided the same way every time the code is run?
While there are many advantages to using commercial feature stores however they can also introduce inflexibility and also limit the customization of models and sometimes you simply don’t need them (more on this in a next month’s post). This inspires many to go with open-source solutions and develop their own on top of say Feast or DVC. While batch features may be easier to maintain, real-time features are sometimes inescapable for several reasons. Real-time features introduce a lot of complexity into the system, especially around back-filling real-time data from streaming sources with data-sharing restrictions. This requires not only technical but also process controls that are often not talked about. Recently, there has been more discussion around Data Contracts however, they are not yet a commonly accepted practice across organizations.
There is a lack of commonly well-defined best practices around creating model version control or project structures at different stages from exploration to deployment. Cookie cutter is one of the efforts toward developing a unified project structure for cross-team collaborations.
Undefined/poorly defined prerequisites about when to push a model in production can create unnecessary bugs and introduce delays during monitoring and debugging.
Code Reviews—how much time should be spent on code review in different stages especially given the model behavior may not accurately represent live training data and how frequently should they happen? Different companies currently have different systems for it. While some prefer one-off deployment, others have more clustered deployment stages eg. test, dev, staging, shadow, and A/B for business-critical pipelines with different review stages and guidelines. However, even the end-to-end tools do not have any in-built support for the same. It’s very much only institutional knowledge as of now as to what makes good quality production code.
While it is clear to everyone that test-driven development is critical to catch minor errors early in the deployment stage however how much time and effort be invested in the same given there are large samples of data that can only be gathered once the model is deployed in production?
To use static validation sets to test the models in production which can introduce bias or dynamic validation sets to more closely resemble live data and address localized shifts in data.
Should we use model registries or change only config files instead of the model thus making it easier to debug? For the former, if model validation passes all checks, the model is usually passed to a model registry.
Having clearly defined rule-based tests to make sure the model outputs are not incorrect while factoring in when is it okay to give incorrect outputs (for eg. shopping recommendations) vs to give no outputs (for eg. cancer prescriptions).
Best practices around code quality and a need for similar deployment environments. While most data scientists prefer working with Jupyter NBs however the way code is usually written in NBs (copy-paste) instead of re-using functions, can introduce unnecessary bugs and introduce technical debt affecting the model as well as the integration code when the notebook owner leaves the team.
While experiment tracking tools and dashboards have added quite some observability to the model runs, contextual changes still remain majorly undocumented.
While sandbox tools to stress-test can be quite useful in some scenarios, however in others for eg. recsys it may not generate any useful information whatsoever.
The warnings about which alerts are critical and require a quick migration to a failsafe model (for eg. hate speech, racial or gender bias) and which ones are mere information to be factored in for the next model configuration phase still require close human monitoring.
This is a part of a post MLOps isn’t DevOps for ML for Advent of Data 2022 by Christophe aka blef. You can give him a follow on Mastodon here.
Thanks for reading Data Driven Babe by Abi! Subscribe for free to receive new posts. I send out only one per month.
What papers are you currently reading this month? Would you like for me to share the list with you?
Found it interesting? Useful? This post is publicly available. Feel free to share it with your network. I choose not to be active on Linkedin or Twitter, but you can also connect with me at Mastodon (@firstname.lastname@example.org)