Secure data sharing from a data lake with Snowflake in minutes
Many organisations have implemented data lakes in the past with various outcomes. One of the challenges when using data lakes is security and governance. In this blogpost I will demonstrate how easy it is to share specific data from a data lake with Snowflake in a secure way.
Securing a data lake can be difficult: data with multiple security classifications might be stored in a single file and if there are tens of thousands of files in the data lake how do you determine who has access to what data ?
Another data lake challenge is sharing data with customers and / or suppliers directly from a data lake. In this example we will use records containing information from multiple departments stored in a single file, but we’ll be only sharing the data for the HR department.
For this blog I have generated sample data stored in the parquet file format. Each record consists of three fields: ‘id’, ‘name’ and ‘department’. Please note that the names were generated and that any match with a real person is coincidence.
The parquet file is stored in an S3 bucket called ‘parquetdl’. The bucket is protected by an IAM role and security policy. For more information on how to set this up please refer to the documentation, which can be found here.
Creating a stage
We will now proceed with making the connection between the S3 bucket and Snowflake. Snowflake refers to storage services like S3, GCS or Azure Blob storage as a ‘stage’. In a newly created database named ‘Employees’ we create a stage object which points to the ‘parquetdl’ S3 bucket.
A new window opens and we are have to fill in some variables in the ‘create stage’ command.
Replace <stage_name> with the name for this stage, in this case we name the stage ‘parquetdl’. Also replace<url> with the url of the S3 bucket, <key> with the AWS secret access key and <id> with the AWS access key id. Once these are replaced press ‘Create Stage’.
Reading data from the stage
Open a new worksheet and execute a ‘LIST’ command to see if we’re able to read from the stage. Please note : the @ symbol in Snowflake indicates that a stage is referenced.
The output should look like this:
We now have successfully connected Snowflake to the S3 bucket ‘parquetdl’ and we’re almost ready to share this data.
Defining an external table
Snowflake external tables allow us to query data, stored in a data lake, as if it was a normal Snowflake table. This allows us to query for just the people in the HR department. In order to define an external table, which we name ‘emprecords’, we issue the following command.
Snowflake will scan through the specified stage and add all files, which positively match the pattern, to the table. When we then query the external table we get all the data in the parquet file(s) in a single column named ‘value’. This column is of a special type called ‘variant’ and can be used to store and access semi-structured data such as JSON or Parquet natively in Snowflake. The cool thing about this, is that variant columns can also be queried with SQL, thus allowing for an easy combination of structured and semi-structured data.
Sharing the HR data
As stated before we are only going to share the data for the HR department. For data sharing it is recommended to use secure views, so that will be our next step. We’ll define a view, over our external table, which only returns the records for HR.
Let’s see how our newly created view looks like.
The view contains all the data we need and therefore we’re now ready to define a ‘share‘ in order to be able to share the data. A share object in Snowflake acts as a container of shareable objects and are also used to define permissions.
In this example we explicitly list another Snowflake account (sfseeurope_eu_demo74) as the consumer of this share, but it’s also possible to publish data shares on the public data marketplace or a private exchange. That’s however beyond the scope of this blog.
Consuming the share
At this moment we’re able to consume the shared data in the ‘sfseeurope_eu_demo74' Snowflake account. After we have logged in, we issue a couple of statements to define a database from the share and we can query the shared data.
To underscore the fact that we’re now in the consumer account (demo74) I’ve used the classic web user interface for a screenshot of the results of the sql statements listed above.
And, as you can see, the data consumer is only able to read the HR records in the secure view.
Snowflake data sharing gives direct, realtime and secure access to shared data. So, what happens if new data emerges in the data lake? In order to showcase this I’ve generated a second parquet file with additional employee records. This new file, creatively called ‘EmpRecords2.parquet’, is stored in the same S3 bucket as the one we used before.
Then, in the producer account we issue a simple refresh statement. Please note, that these refreshes can also be automated with SQS or Event Grid. More information on that can be found here. For now, we’ll do it manually.
Snowflake will then rescan the stage and update the files for the external table.
As shown it has picked up the new parquet file and registered it for the external table. We’ll now immediately head over to our consumer account to see what happens when we query the shared data again.
As shown above : two new records (rows 8 and 9) were added automatically to the secure view and, more importantly, they were also immediately accessible by the consumer.
A high level overview of what we have done so far is shown below. With just a few steps we were able to share data, residing in a data lake, in a controlled and secure way without any data movement.
Snowflake is all about choice: you can use Snowflake to be your data lake and /or you can use Snowflake to complement an existing data lake. In the latter case you can still be part of the Snowflake data cloud where you can share or consume data, secure and fast, with other parties such as suppliers, vendors or other business partners. It also allows you to enforce a more fine grained security access layer over the data lake.
What’s next ?
Currently Snowflake supports several file formats for (semi)structured data. Right now CSV (or other delimited files), JSON, ORC, Avro, Parquet and XML can be used within Snowflake. But this is not the end of the road: recently support for Delta lake and Iceberg have been announced.