Understanding Unity Catalog: How to store your data?
One of the most common misconceptions I hear from customers is that Unity Catalog actually stores your data. And I get it — storing data sounds like a simple, straightforward concept. But once you’re working in a distributed, multi-cloud world, the details become a lot more nuanced.
Let’s clear this up right from the start: Unity Catalog doesn’t store your data. It holds the metadata — the blueprints that describe where your data lives, its structure, and the permissions attached to it. Your actual data? That remains tucked away safely in your cloud provider’s object storage.
Now that we’ve set the record straight, let’s get into the real work: how do you configure your data storage properly?
Where will your data be stored?
The first step is getting your cloud provider’s infrastructure ready. There are only two things we need to set up: a place to store our data, and a way to access it securely. That’s it.
When it comes to choosing where to store your data, Unity Catalog requires that you use Azure Data Lake Storage Gen 2—and this isn’t just a recommendation, it’s a hard requirement. You can achieve this by making sure your storage account has hierarchical namespace enabled.
Follow the steps to create the storage account and a container (bucket). If you already have a storage account set up, feel free to use that.
So what does this mean? In simpler terms, hierarchical namespace allows your data to be organized in a directory-like structure, much like a traditional file system. More importantly, it enables Unity Catalog to manage permissions with precision, controlling who can access what and limiting actions down to specific operations or data. This level of control is essential for the fine-grained governance and security that Unity Catalog is designed to deliver.
Now that we’ve sorted out where your data will live, the next step is ensuring that it’s accessed securely.
How will your data be accessed?
The “way to access it” differs depending on the cloud provider you’re using. In Azure, you can either use a system-assigned managed identity (letting Azure automatically create and manage the identity for individual resources) or a user-assigned managed identity (where you create and manage the identity yourself), allowing it to be shared across multiple resources.
Search for the Unity Catalog Access Connector and follow the steps to set it up. Don’t forget to copy the Resource ID — we’ll need it later.
System-assigned identities are simpler because their lifecycle is completely managed for you, while user-assigned identities give you more flexibility and control, especially if you want to reuse it across resources.
Who gets access to your cloud storage?
Now that all the necessary infrastructure is set up, it’s time to make sure your managed identity has the necessary permissions to access the storage container. This step allows Databricks, as a platform, to interact with your data. Keep in mind that at this stage, the per-user permission isn’t a factor yet — we’re just granting platform-level access to the storage.
There are several ways to approach this, and your choice largely depends on your security needs and how fine-grained your permissioning requirements are. Let’s break it down:
Granting access to the entire storage account
If you’re looking for a broad-level approach, the simplest method to grant access to the entire storage account. To do this, assign your managed identity the Storage Blob Data Contributor at the Storage Account level. This will give Databricks the ability to read, write, and manage blob containers and data across your account.
Granting access to specific container(s)
If you need more granular control, you can limit the permissions to specific containers. There are two key roles required for this:
- Storage Blob Delegator at the Storage Account level.
- Storage Blob Data Contributor at the Storage Container level.
This allows you to maintain tighter security by restricting access to only the containers that Databricks needs to work with, without opening up the entire storage account.
Additional permissions for file event notifications
In addition to data access, Databricks needs to be able to process files efficiently, and this is where file notifications come into play. To allow Databricks to subscribe to file event notifications, you’ll need to assign the Storage Queue Data Contributor role at the Storage Account level. This allows Databricks to capture notifications emitted by your cloud provider and react to file changes in real time.
Optional (but recommended) permissions for automatic event configuration
To take full advantage of Databricks’ capabilities, it’s also highly recommended to grant a couple of additional roles:
- Storage Account Contributor at the Storage Account level.
- EventGrid EventSubscription Contributor at the Resource Group level.
These roles allow the platform to set up file events on your behalf, automating the process and saving you from configuring them manually for each location. While this step is optional, it’s important to note that if you don’t grant these permissions, you’ll handle event configuration yourself, and you may miss out on critical features in the future. For maximum flexiblity and to future-proof your setup, it’s a good idea to assign these roles.
Networking considerations
Properly configuring your network is a key part of setting up infrastructure for Unity Catalog. In most production-ready environments, you’ll likely want to block public access settings to the internet using firewalls or private endpoints. While we’re not diving into that here, it’s definitely something to keep in mind as you move forward with your setup.
Another important point to consider is cross-region storage. While it’s technically possible to have your metastore in one region and your storage containers in another, it’s not something I’d recommend. This setup can introduce latency issues and rack up high egress costs when data is transferred across regions.
For the best performance and cost efficiency, it’s always better to keep your metastore and storage in the same region or delegate to Delta Sharing as an alternative — allowing you to securely share data across regions or even clouds without needing to move it, thereby avoiding these issues altogether.
Storage credentials and external locations: what’s the deal?
To fully understand how Unity Catalog accesses your secure data, it’s important to grasp two key concepts: storage credentials and external locations.
Think of these as mappings to the physical objects we’ve just created. They serve as an interface that allows fine-grained permissioning on a per-user or per-group basis, offering a level of control that wasn’t possible before.
Your key to controlled access
A storage credential maps directly to the managed identity we previously created.
To set this up, go to your Unity Catalog-enabled workspace, click on Catalog, and follow the steps. The only thing you’ll need is the Resource ID from the access connector we set up earlier, as seen below.
Through storage credentials, you’re no longer just granting broad platform-level access; instead, you can specify which users or teams have authority to access certain storage paths, bringing much-needed control and security to your data management process.
Secure storage made simple
An external location takes the concept of storage credentials further by linking a specific cloud object storage URI to that credential. In simpler terms, an external location ties a storage path (like a folder or bucket in your cloud storage) to the access controls defined by a storage credential.
The external location will reference the previously created storage credential and a path to your container in the form of: abfss://<my-container-name>@<my-storage-account>.dfs.core.windows.net/<path>
Here’s the beauty of it: when you assign a user or group permissions to use an external location, they can access the storage paths associated with that location without needing direct access to the underlying storage credential itself. This ensures that users won’t have direct access to the sensitive storage credentials, reducing the risk of mismanagement and enhancing data security.
One important thing to note is that each external location is unique to its path in cloud storage. The URI you assign to an external location can’t overlap with another’s path. This prevents conflicts and ensures that access controls remain distinct and manageable.
Querying tables: no keys, no problem!
Thanks to Unity Catalog, users can query specific tables without needing access to the underlying storage credentials or external locations. Previously, with hive_metastore
, you had to grant direct storage access (at the platform or cluster level), which made managing access more complex and less secure.
Now, users can interact with data – run queries, retrieve results – without ever touching the underlying storage itself. This adds an extra layer of security and simplicity, allowing for much more controlled access without compromising functionality.
In short, users no longer need direct storage permissions; they only need access to the data they’re authorized to see. This makes data management more streamlined and secure.
Understanding how to store and secure your data is just the beginning of building a robust data governance framework. But what’s next? Designing and organizing your catalogs to take full control of your data. More on that in the next post!
Enjoyed reading?
Subscribe to the mailing list to receive future content.