Cleaning Up the Data Stream with Data Observability

Picture this: A clear, glistening stream flows steadily through a meadow and directly into a watermill. Suddenly, a powerful storm blows in and knocks over multiple trees growing along the stream’s bank. Chunks of earth, debris and branches run down the water and into the watermill, clogging the typically reliable supply of power and creating an unknown amount of downtime before clearing the system.

Sound familiar? That’s because it’s a problem data scientists and engineers face on a regular basis — except the stream is the flow of data powering complicated tech stacks that keep organizations functioning smoothly. Without a dedicated overseer to monitor, track and alleviate any data obstructions, the probability of downtime is much more likely, frustrating data teams, entire organizations and their users alike.

Enter data observability.

By embracing automated methods of data monitoring and system alerts when an issue arises, teams are much more likely to quell data flow errors while simultaneously building reliability into their processes. But is it enough for the most impactful data observability? According to Andrea Gagliano, head of data science, AI and machine learning at Getty Images: no. An additional step is crucial to downtime prevention.

“The key lies within clear data ownership,” Gagliano said. “You can have endless tools and monitoring systems in place, but if it isn’t clear who on your team owns which data, monitoring tools won’t get you very far.”

Built In Seattle caught up with Gagliano and Knock Director of Engineering Kishan Gnanamani to learn more about their data observability best practices, the biggest data challenges they’ve faced recently and the tools helping them overcome any unexpected obstacles that may arise.

Andrea Gagliano

Head of Data Science, AI and Machine Learning • Getty Images

What is one of the most critical best practices your team follows when it comes to data observability, and why?

At Getty Images, we make a point to assign ownership for each database at every phase of the pipeline, which allows us to protect the health of our analytics while also ensuring that our data is as accurate and as up to date as possible.

What are some of the tools your team is using to streamline and automate data observability and monitoring? And what made you decide to use these tools over other options on the market?

We predominantly rely on a combination of Splunk, a data platform used for monitoring and searching through big datasets, and homegrown systems to streamline our data monitoring. Splunk allows us to quickly monitor quality checks, called counts and makes it easy to generate reports and alerts. Our homegrown systems, on the other hand, are customized for monitoring our search algorithms. This is invaluable to our team, given how critical the overall search experience is to our customers.

Looking ahead, we are exploring additional monitoring systems to add to our toolbox such as WhyLabs, an observability platform that could help us better monitor the distribution shifts that we rely on in our AI models.

We make a point to assign ownership for each database at every phase of the pipeline.”

What’s the biggest challenge your team has faced with data observability? What have you done to overcome this challenge?

One of the biggest challenges for our team is detecting gradual shifts in data over time. These shifts can be hard to spot, especially when it comes to visual and language data — both essential to our work at Getty Images. But it’s crucial that we keep close tabs on gradual changes in datasets, as they can ultimately impact AI models that rely on data for training down the line.

In order to overcome this challenge, our team leverages the custom systems I mentioned earlier to monitor our search algorithms and training data on an ongoing basis. Fortunately, we also have the ability to review a breakdown of people across our training datasets — spanning age, gender and ethnicity — which helps us detect changes in our data before biases emerge in our AI models. We’re able to do this because we require model releases for the creative content that we license which allows us in certain regions to include self-identified information in our metadata. This further enables our AI team to automatically search across millions of images and quickly identify data shifts and skews.

A group of Knock employees during a team outing. — KNOCK

Kishan Gnanamani

Director of Engineering — Data Platform & Analytics • Knock CRM

What is one of the most critical best practices your team follows when it comes to data observability, and why?

We stream business metrics-related data in addition to the application performance monitoring data, and the application logs in to our data observability platform. We also have Slack and email integrations in place to alert the teams on any anomalies or errors. Being a product development company that supports more than a million multi-family units in our platform, it is super critical for us to catch the business metrics-related anomalies in addition to application errors and issues before they impact the end customers.

What are some of the tools your team is using to streamline and automate data observability and monitoring? And what made you decide to use these tools over other options on the market?

We use New Relic as our data observability and monitoring platform. New Relic is a SaaS offering that uses a standardized Apdex (application performance index) score to set and rate application performance. Some of the features that New Relic offers like log management, error tracking, monitoring capabilities, ability to integrate with the public cloud solutions and its pricing model made us decide to go with the platform.

Being a product development company, it is super critical for us to catch anomalies before they impact the end customers.”

What’s the biggest challenge your team has faced with data observability? What have you done to overcome this challenge?

One of the biggest challenges that we are still facing is around standardizing the way we stream application metrics and log data into our data observability platform. We are looking at solutions like StatsD to overcome this challenge.

Cleaning Up the Data Stream With Data Observability

Recent Articles