Why Data Management Matters in Unity Catalog

15 Oct 2024

When designing your data’s structure in Unity Catalog, there’s often a fine line between simplicity and scalability. I’ve seen many customers struggle in this area — especially those coming from more rigid, legacy systems — largely because it’s so open-ended. With no hard-and-fast rules, it becomes tricky to know what’s “right” or “wrong.” Without a clear path forward, it’s easy to end up with an architecture that feels disjointed and difficult to expand.

Setting up a solid foundation for your data organization is crucial to its long-term success. Done right, it will simplify governance, improve access control, and make your life a whole lot easier. But if done poorly, it can slow you down and hinder innovation. When your data’s structure isn’t thoughtfully designed, simple changes — like onboarding a new team, granting access to a new dataset or adjusting permissions — turn into time-consuming exercises in frustration.

As Benjamin Franklin once said (and Databricks engineers often find themselves muttering): “By failing to prepare, you are preparing to fail.” To save you from the pain of untangling messy permissions and poorly structured data, I’ll walk you through a familiar case study that highlights key pitfalls and best practices, helping you avoid mistakes and future-proof data setup in Unity Catalog.

Laying the groundwork.

Let’s take a look at DIP Inc., a fictional company that’s just set up its three workspaces—dev, stg, prod—and now they’re ready to start moving data into the platform. Sounds easy, right? Well, not so fast.

Their data includes sensitive PII (Personally Identifiable Information) that, for compliance reasons, needs to be both physically and logically isolated from their end-users. If not, they would end up with a massive fine—or worse—lose their license to operate. No pressure.

On top of that, they’re worried about the dozens of teams that will be accessing the platform. Not everyone should have access to all environments, and certainly not all the data.

DIP’s team is now tasked with the fun job of designing a catalog structure that ticks all the boxes, keeps the auditors happy, and ensures that nobody accidentally gets more access than they should—without someone calling in the lawyers.

The perils of default storage.

When setting up the metastore (as we discussed when first creating it), there was an option to define a default storage location, which would automatically store data there unless a specific location was defined. Sounds convenient right? Well, it’s a convenience you’ll want to avoid.

Imagine this: the team lead, who had never laid eyes on Databricks before, creates a catalog for super-sensitive production data—data no one should have access to. But because they didn’t explicitly define its storage location, the data ends up in the default bucket—right next to random test data from other environments. Now, anyone with access to the default storage location could not only view these production files but, even worse, overwrite them. Not exactly what you’d call secure.

This brings us to our first key lesson in designing your catalog structure: avoid setting a default storage location. Doing so forces you to explicitly define where each catalog’s data should be stored, ensuring you maintain complete control over physical isolation. This allows you to separate storage buckets for different environments—or even for specific use cases.

For example, here’s how DIP Inc. could create their catalogs for different environments, ensuring each is physically isolated under a single metastore:

CREATE EXTERNAL LOCATION dev_and_stg 
URL 'abfss://dip@dipstorage.dfs.core.windows.net/' 
WITH (STORAGE CREDENTIAL dip_credential);

CREATE CATALOG IF NOT EXISTS dev
MANAGED LOCATION 'abfss://dip@dipstorage.dfs.core.windows.net/dev/';

-- Stores data under the same container but a different path. 
CREATE CATALOG IF NOT EXISTS stg
MANAGED LOCATION 'abfss://dip@dipstorage.dfs.core.windows.net/stg/';

CREATE EXTERNAL LOCATION production 
URL 'abfss://prod@dip-prod-storage.dfs.core.windows.net/' 
WITH (STORAGE CREDENTIAL dip_credential);

-- Stores data in a different storage account and container.
CREATE CATALOG IF NOT EXISTS prod
MANAGED LOCATION 'abfss://prod@dip-prod-storage.dfs.core.windows.net/';

With this approach, each environment has a catalog stored in its own designated, isolated bucket—ensuring compliance with the company’s expectations.

There’s no need to go with a single catalog approach either. You can create additional catalogs for other types of data, each with its own managed location. This ensures data is physically isolated by purpose or sensitivity, giving DIP Inc. the peace of mind that the right data is in the right place.

Tying catalogs to workspaces: a smarter approach.

Managing permissions for these new catalogs might seem like a hassle, right? After all, 90% of our users are strictly part of the development team, so it seems pointless to assign them permissions for the staging and production catalogs if they won’t even be using them.

But remember how, in our first blog post, we discussed that Unity Catalog is an account-level construct? This gives it the ability to support what’s called workspace-catalog binding. Even though multiple workspaces use the same metastore, this feature allows catalogs to be tied to specific workspaces.

workspace-catalog binding

This means the dev catalog is only accessible in the development workspace, stg in staging, and prod in production. Of course, catalogs can still be shared across workspaces if needed, but this approach significantly reduces the overhead of managing permissions.

Here’s how easy it is to bind a catalog to a workspace:

The key lesson: use workspace-catalog binding to minimize the number of permissions you need to assign. DIP Inc’s developers are already thanking us for the weeks of work we’ve saved them!

Taking isolation one step further.

Now we’re ready to take it down a notch. We’re just one level away from tables and actually getting a hold of the data.

Remember how we discussed that DIP Inc. has sensitive PII data for historical purposes that needs to be completely separated in a dedicated storage bucket?

One of the great things about Unity Catalog is that just like catalogs, schemas (and tables, thought that’s often too granular) can also be physically isolated if needed.

DIP Inc. has two options here: they can inherit the storage location from the parent catalog, or overwrite it by specifying a new storage location for sensitive data.

CREATE EXTERNAL LOCATION dip_pii 
URL 'abfss://dip_pii@isolated-storage.dfs.core.windows.net/' 
WITH (STORAGE CREDENTIAL dip_pii_no_access);

-- Overwrites the parent's location. 
CREATE SCHEMA IF NOT EXISTS prod.dip_pii MANAGED LOCATION
'abfss://dip_pii@isolated-storage.dfs.core.windows.net/';

-- Inherits from parent's (catalog) location.
CREATE SCHEMA IF NOT EXISTS prod.managed_schema; 

Now, all tables under dip_pii will be stored in the isolated container, while everything remains organized under the same catalog.

Organizing data across layers: the smart approach.

As you’ve probably noticed by now, the main premise is that storage can be controlled at different levels, so we’ve been crafting our catalog structure to streamline and reduce the pain of permission management. Spoiler alert: This is what will decide whether you make or break your environment.

Let’s say we’re ready to start designing specific use cases. For example, DIP Inc. will be ingesting telemetry data from cloud object storage on a regular basis. When managing a lakehouse, a common (and smart) approach is to follow the Medallion Architecture:

Bronze for raw data
Silver for cleaned data
Gold for transformed, ready-to-use data

Now, I won’t go into the depths of Medallion Architecture here (that’s a topic for another day), but keep the basic premise in mind.

Here’s the thing: data engineers need access to all three layers (bronze, silver, and gold) to do their job. But data analysts? They only need access to the gold tables where the final, polished data lives.

Imagine this setup involves 30 tables, 10 in each layer. Thanks to Unity Catalog’s three-level namespace, DIP Inc. has a couple of options for organizing this:

Create three separate schemas—one for each layer (bronze, silver, gold)—and store the tables accordingly.
Create a single schema and separate the tables by layer within it.

designing your schemas

Here’s the catch: If you store everything under a single schema, you can give your data engineers schema-level access. But when it comes to analysts, you’d need to grant access to each of the 10 gold tables individually. That’s not only time-consuming, but also a recipe for mistakes—like forgetting to update permissions when new tables are added.

A much better approach? Divide the tables into three separate schemas—one for each layer.

Here’s how DIP Inc. could create their schemas and grant access to both groups:

CREATE SCHEMA IF NOT EXISTS dev.telemetry_bronze;
CREATE SCHEMA IF NOT EXISTS dev.telemetry_silver;
CREATE SCHEMA IF NOT EXISTS dev.telemetry_gold;

-- grant access to data engineers
GRANT CREATE TABLE ON dev.telemetry_bronze TO 'data-engineers';
GRANT CREATE TABLE ON dev.telemetry_silver TO 'data-engineers';
GRANT CREATE TABLE ON dev.telemetry_gold TO 'data-engineers';

-- grant access to data analysts for gold schema
GRANT USE SCHEMA ON dev.telemetry_gold TO 'data-analysts';

This way, you can grant analysts access to just the gold schema, and any new tables added to it will automatically be available to them. No extra work, no hassle.

Final thoughts

Before we wrap things up, let’s quickly recap the key ways you can isolate and manage your data in Unity Catalog:

Catalog-level isolation – assign separate storage locations for each catalog to physically isolate environments (e.g. dev, stg, prod).
Schema-level isolation – use schemas to group tables by layers (e.g., bronze, silver, gold) or by use cases, and manage access with fine-grained permissions.
Table-level isolation – although usually too granular for most cases, tables can also be isolated with their own storage location if necessary.

When designing your catalogs, don’t focus solely on physical storage. Remember, storage can be controlled at any level—catalog, schema, or table. Instead, think ahead and design with permissions in mind. The goal is to streamline the number of permissions you’ll need to manage over time, which will ultimately make life easier for both admins and users. Trust me, setting this up the right way from the start will save you a mountain of headaches down the road.

And just like that, you’ve built a solid foundation for designing your catalogs and schemas in Unity Catalog! Next up, we’ll dive into tables, volumes, and how to get your data ready for action. Stay tuned—there’s plenty more to come!

Enjoyed reading?

Subscribe to the mailing list to receive future content.