diego gómez reflections on my journey through the tech landscape

Understanding Unity Catalog

Throughout the dozens of engagements I’ve had since joining Databricks, I’ve found that customers often struggle to understand the scope and concept of Unity Catalog. Questions like “Does it store my data?”, “Is it safe?”, “Can I have multiple Unity Catalogs?”, and “Will it break anything?”, are more common than I’d like. We sell it as a one-size-fits-all solution, and it truly is one, but it can be a complex idea to grasp.

For customers coming from legacy architectures, this shift is even more significant, as it turns their understanding of how things work on its head, introducing them to a whole new way to govern their data.

I’ve always found it useful to grasp the ‘why’ before diving into the ‘how’, so I’ll do my best to break down what Unity Catalog is (and isn’t).

A short history lesson.


The hive_metastore was once a central component within the Databricks platform, serving as the metadata repository for managing data objects and enabling querying. It did its job well enough — for its time. But here’s the thing: it isn’t exactly a top choice when it comes to flexibility and ease of use, bringing some significant limitations to the table.

For starters, hive_metastore was designed as a workspace-level construct responsible for maintaining the metadata repository and governance. This meant that if you had, say, a hundred workspaces sharing a single data source, you’d be stuck with managing a hundred sets of equal permissions across all those workspaces. Imagine trying to juggle a hundred different keys for a single door — not exactly efficient or fun.

workspaces before unity catalog diagram

And then, there was the matter of the two-level namespace restriction. This setup made managing permissions at scale more of a headache than needed. Picture this: a schema with a hundred tables beneath it, each serving different use cases. Your options? Either grant access to the entire schema, exposing all its tables, or meticulously assign access to each table one by one. Not exactly a recipe for scalable success.

access permissions on hive diagram

The situation was further complicated by the challenge of controlling access to underlying storage, particularly with dbfs, which was the default location for hive_metastore. Access to external object storage was managed through instance profiles, but these profiles operated at the cluster-level, not at the user-level. So, every user on a cluster ended up with the same access permissions, whether they needed them or not. This lack of granularity made it tricky to enforce precise security controls, leaving you with a headache and a setup that wasn’t tailored to individual needs.

And let’s not forget that hive_metastore was showing its age in other ways, too. It lacked key features like data lineage, access patterns, and data discovery — capabilities crucial for centralized governance solutions expected in today’s data-driven environments.


This is where Unity Catalog steps in. Unlike its predecessor, Unity Catalog is an account-level construct, allowing metadata and permissions to be shared across multiple workspaces. This shift enables centralized governance at scale by making data management more streamlined and efficient.

But it doesn’t stop there. It also introduces an additional level in the namespace with catalogs. This extra layer allows for more precise permissions, enhancing security and governance, and making it much easier to organize and manage data across diverse environments and use cases. For example, having three catalogs, dev, stg, prod, with identical schema and table structures allows easy testing without code changes.

In short, while hive_metastore had its time in the spotlight, Unity Catalog represents a significant leap forward, offering the flexibility, security, and scalability that modern data environments require.

Metastore


unity catalog architecture diagram

Alright, now that we’ve got the background covered, let’s dive into the nuts and bolts of getting Unity Catalog up and running. If you’ve heard someone talking about “enabling UC” and felt a bit lost, don’t worry — you’re not alone. What they’re really talking about is creating a metastore.

So, what exactly is a metastore? Think of it as the brain behind Unity Catalog — the top-level container that holds all the metadata. But here’s the kicker: a metastore isn’t exactly where your data lives. It’s a logical construct, kind of like the blueprint of a building. It tells Unity Catalog where everything is and how it should be managed, but it doesn’t actually store any of your data.

This distinction can trip people up because, naturally, when you hear the word “store”, you might think it’s holding your data. But nope! The metastore is more like a master directory that keeps everything organized and in its proper place. The actual data remains in whatever cloud object storage you’ve chosen to store it in.

You’re probably asking yourself, “Will attaching a metastore to my workspace disable hive_metastore or break anything?” The answer is no. Unity Catalog and hive_metastore can co-exist without any issues. Your clusters can keep running in their legacy modes, your data will stay intact, and your job references won’t suddenly switch over. Think of it as adding a new tool to your toolbox, not replacing the one you’re already comfortable with, yet.

Creating and configuring the metastore


With that out of the way, we can move on to creating the metastore. The only requirement is that you must be an account administrator, and any workspace you plan to enable must be enrolled on the Premium plan or above.

On your account console, click on Catalog and then select Create Metastore

In this screen, you only need to do provide a name and select the region where your workspace resides. Remember that Unity Catalog is tied to a cloud region and only stores metadata, your data’s blueprint.

There’s an option to add storage configurations to choose the default storage location for the metastore, but I’d highly recommend skipping this for now. Adding this might lead you to lose track of where your data is stored, making it harder to manage later on.

With the metastore in place, the next step is to attach it to your workspaces. Ignore the pop-up messages and march ahead like a fearless explorer — you’re not defusing a bomb here, so don’t worry, nothing’s going to blow up.

And just like that, your workspace is enabled on Unity Catalog! But wait — where does your data get stored? We’ll work out the details in the next post. Stay tuned!


Enjoyed reading?

Subscribe to the mailing list to receive future content.