Sharing COVID-19 data with Snowflake in minutes

Harke Harkema
6 min readFeb 8, 2021

--

Photo by Ines Álvarez Fdez on Unsplash

Every organisation shares data with suppliers, business partners, regulators, customers, or other business units. But traditional ways of sharing data are costly and painful. Different data formats, delivery methods, multiple copies, and the associated security aspects all contribute to increased cost and risk.

Snowflake’s unique architecture powers the Secure Data Sharing technology. Snowflake is the only platform that allows multiple parties to collaborate on a single copy of the data. No copies, no moving data, no ETL, no delays. All protected with a single security model.

In this blog, I will demonstrate how data can be easily shared with others via Snowflake. As an example, I will use a public Covid dataset from The National Institute for Public Health and Environment in the Netherlands ‘RIVM’. This dataset is refreshed daily and published on the RIVM website.

The Covid dataset we use here in this blog is just an example, however Snowflake is actually used to distribute Covid data for instance in California or via Starschema.

Getting the data into Snowflake

Before we can share any data it needs to be in a Snowflake account. An easy and convenient way to load data is by using a stage. A stage is a cloud location such as Amazon S3, Microsoft Azure, or Google Cloud Storage. In this example, we use an AWS lambda function with a few lines of Python to fetch the dataset from the RIVM website and copy it into an S3 stage.

Please note: variables are denoted as ${variable_name}, replace them according to your needs.

Once the Lambda function has been executed successfully the dataset ‘COVID-19_aantallen_gemeente_per_dag.csv’ is stored in S3. We proceed in Snowflake. First, we create a warehouse, a database, and a schema.

Now we will define the stage so that data can be read into Snowflake.

Please note that with the second query Snowflake queries the data in S3 directly. This enables use cases where Snowflake acts as a query engine on top of data lakes in S3, Azure, or GCP. However that scenario was not the intention of this blog, so we will proceed with loading the data into Snowflake. We need a table and a file format for that. File formats are used to specify the properties of the incoming data.

Now we’re reading to load the data into Snowflake

Now we have the data in Snowflake we can share it.

Publishing the dataset to an exchange

At the moment data exchanges are a private preview feature. More information can be found here

Before we can list datasets on the exchange we need to set up a Provider Profile. This can be done via the new Snowflake User interface below the menu item ‘Manage’. In the Provider Profile, you can enter things like a logo, company name, contact email, and other properties related to your organisation. For now, we assume that this profile has been set up.

We navigate to our demo exchange called ‘SE Sandbox’

No listings yet

We haven’t published any data sets yet, so ‘My Listings’ will show nothing yet. Clicking ‘New Listing’ will present us with a wizard that will guide us through the process. The first step is to select the ‘Listing Type’. The Standard listing type is ok for now. Then we need to select which tables, view of functions we would like to share. We select the COVID_NL_MUNI_DAY database.

Step 1: Data

The next step in the wizard is ‘Basic information’ where we have the option to describe the dataset.

Step 2: Basic information

When we click ‘Next’ we’re taken to the details section. Here we can specify more details about the dataset. For instance sample queries. We click ‘Add’ to add a sample query.

Step 3a: usage example

Once we have entered the title and a SQL query we hit ‘Add’ again to take us back to the Details section of the wizard.

Step 3: Details

It is possible to add more Usage Examples but for now, we’re ok and click ‘Next’

The last section of the wizard is about Region Availability. Snowflake can replicate the data for the exchange automatically across clouds and regions, but since my two Snowflake accounts both reside in the EU (Frankfurt) region, there is no need to replicate the data across the globe.

Step 4: Region Availability

We press ‘Publish’ to publish our dataset on the exchange.

The Covid-19 listing

Obtaining the Covid dataset from the exchange

The dataset is now listed on the exchange and is ready to be consumed by other Snowflake accounts that have access to the exchange. For the next step, we have switched to another Snowflake account. When we navigate to the ‘SE Sandbox’ exchange we will find our newly created listing.

The Covid-19 listing

We click on ‘Get Data’ and are prompted to give the database a name in the share and specify which role(s) also have access to this database. Please note that these settings are only applied at the consumer side.

When we accept the terms and press ‘Create Database’ we’re good to go. We’re now able to query the shared Covid data directly. In order to do so, we open a new worksheet, set the context (Database, Role and Warehouse) and enter the sample query.

Conclusion

We covered a lot of ground today. First we fetched the data with an AWS lambda function from a public source, then we copied the data into Snowflake. Next we published the data on a data exchange, after that we switched to another account and finally we could query the data.

The benefits of sharing data with Snowflake can be categorised into three topics. The first one is Cost Reduction, a Snowflake data share is always in the right format and up to date, so no data transfers or API calls have to be made. The second benefit is Unlock New Business Insights, with Snowflake data sharing an organisation can join external data with internal data leading to new insights. And a third benefit is Strengthened Business Relationships, securely sharing data with partners will improve collaboration in your business eco system.

If you want to learn more about Snowflake or data sharing please go to snowflake.com or contact your local snowflake representatives.

--

--