December 1, 2021

Optimising your Data Analytics Workflow as a Data Professional 

Chris Hui

As a Data Analytics Professional, we're always looking for new tools and technologies to streamline our workflow.

With datasets getting larger and messier; every bit of optimisation we can add to our workflow matters. 

In this article, we'll cover the following topics and how you can get ahead as a Data Professional.

  • A Standard Operating Workflows as a Data Analytics Professional
  • Using Data Integrations (AWS S3) and Dask streamline your data loads
  • Visualising your SQL on-the-go with SQL Cells and Viz Cells

Pre-Requisites to Understanding this Article

Readers should have an basic understanding of the following technologies and how to import these into your Deepnote Environment:

The Standard Operating Workflow as a Data Analytics Professional ⚙️

A Standard Operating Workflow (SOW) is composed of four (4) key phases:

1. Data Source Ingestion

2. Data Source Preparation

3. Data Source Analysis

4. Data Analysis Production Pipeline / Dashboard Visualisation

This varies based off technology and company maturity, but the majority of corporations will follow a standard data analysis workflow that we've outlined earlier. 

We'll prioritise our focus on the first 3 steps as the last step is generally a dashboard published in Tableau / Power BI via a Tableau or Power BI Server which isn't

easily accessible for all individuals. 

All data analysis starts with sourcing the right data and information from a Data Source; generally a Data Warehouse or a standalone extract in the form of a .json or .csv format.

One of the more common painpoints you'll face (especially when it comes to shared notebooks) is the referencing of the same libraries and data sources.

This is especially the case for Quick and Dirty Data Analysis (QDA) where a CSV can be loaded in but the experiment isn't repeatable as the data source isn't loaded into the Notebook by default.

This can cause inefficiencies due to versioning and data quality concerns - how could we go about this?

Well, let's introduce you to Integrations in Deepnote and how they can help streamline your workflow.

 

What is an Integration? 🤷

Security Standards and Data Sources are always a messy topic. We've all been in a scenario where we've accessed a Shared Notebook and happened to stumble across credentials stored in a file (or heaven forbid 🙅) as variables stored in the Notebook. This presents a security risk and is not necessarily representative of best practice. 📕

Integrations at Deepnote tackles this by providing a secure and standardised way for connecting to a Data Source by storing all your credentials in an encrypted (AES-256-CBC) and securely managed database that only require configuration once before being re-used across multiple projects. In short, we now have a secure method for collaborating across a Shared Notebook without having to worry about ensuring everyone is connected to the right data source! Better yet, once your Notebook is shared with your other team members, all the integrations are shared as well, further streamlining how team members can interact with your analysis.

With this in mind - let's connect to an Public S3 Bucket here where we can load in and access some Citibike Data.

 

Setting up your AWS S3 Integration ⚙️:

In order to set up your S3 Integration, you're going to need 3 things we've listed below:

  • AWS S3 Bucket Name
  • AWS Access Key Id
  • AWS Secret Key Id

If you're not sure how to create a S3 Bucket, follow this link.

If you're not sure how to access your AWS Credentials, follow this link

Once you've got your bucket set up, please upload the following file available here to follow along with the rest of this article.

Note: If you want to be share your AWS S3 Bucket with others, you'll need to ensure you set your Bucket Policy to Public. If you're unsure how you might do this, please reference the following link

Once you've added your S3 Bucket Name and associated keys, you should see the Integration stored under your "Active Integrations" in the left-panel of your Deepnote Instance where you can access this just like any other file path in your local system.

Now how about Integrations and Dask? How do these work? 🤔 ⚙️

If you're familiar with Pandas Dataframes, one of the major pain points you'll realise is that for large datasets, you either have to load them in small chunks for processing or simply risk loading in the file at once and potentially suffer a memory execution error where the size of the data stored outstrips that of your memory available.

With Dask however, there's a solution at play through the power of Parallel Processing.

We'll perform a mini experiemnt where we'll be able to see the difference in performance between loading data via pandas and loading data via Dask.

Let's access the S3 Bucket we integrated with earlier and load up the Citibike Data as per below:

 

 

So it takes us ~ 1.5 seconds to load in 577K rows of data in Pandas. This isn't bad but what about if we were to load this via Dask?

Let's take a look.

1.5 seconds (Pandas 🐼) vs 0.013 Seconds (Dask) 🤯

Not 2x, or 3x. But 118x Faster with Dask.

Now the question is; what makes Dask so much more quicker than that of Pandas?

Well, it comes down to Parallel processing.

When we load up our data from the Pandas CSV, we are loading this in one large chunk. Dask on the other hand, will create a DataFrame that is row-wise partitioned i.e. rows are grouped by an index value where partitions are the priority.

This is further emphasised below where we can see the composition of our Dask Dataframe being made up of 4 distinct partitions as opposed to our Pandas Dataframe.

Essentially, partitions in Dask can be interpreted as meaning the number of Pandas dataframes split within the Dask dataframe.

The more partitions we have, the more tasks we will need for each computation.

Now the key tip to remember here:

1. Understand the Size of your Data and the need for performance and scalability.

From a Scalability and Performance lens, Dask operates far quicker than that of Pandas. From a functionality lens however, Pandas has more than Dask. If your dataset is in excess of a few Gigabytes (i.e. small data), Pandas is still sufficient from a functionality lens, but once the datasets get larger than this, Dask is what you'll want to be using.

Visualising your SQL on-the-go with SQL Cells, Plotly and Viz Cells 📈

Now during your work as a Data Analyst you're going to regularly dealing with Python and SQL. You'll interchangeably use both for your analysis depending on the task required. One of the most common issues we face is visualisation of the SQL queries. Options like Mode Analytics enable us to visualise our SQL, but what if we want to do this within our Deepnote Instance?

Do we have to revert to extracting the data out and then passing this to Plotly / Seaborn?

Nope.

We can actually make use of our SQL Cell and Viz. Cells to directly write SQL as well as visualise this without having to write any Code.

Let's explore how below.

Note: For all standard Databases (Big Query / Snowflake / Mongo DB etc.) SQL Cells will default to the relevant SQL Syntax. There is an exception however with CSV File references. When a CSV is referenced via a SQL Cell, the SQL syntax will use DuckDB notation.

 

SQL Cells on-the-go 🏃

SQL Cells are a special Cell (🦠) Type in Deepnote that act as an interface between your Integration (Snowflake, Red Shift, Big Query, Mongo DB etc.) and your Deepnote Instance enabling you to write SQL directly and store the returned results in a Pandas Dataframe.

However, SQL Cells can also be applied to CSV Files meaning we can directly write SQL against a CSV where Deepnote treats this just like a standard Table Reference!

Let's refer back to our Citibike Dataset where we'll seek to understand differences in Customer Behaviour:

Analysing Citibike User Behaviour in New York (2013), does there exist any obvious differences in bike usage patterns when comparing Subscribers vs Non Subscribers, sorted over a 24 window?" 🚴

We'll visualise this over two steps:

  1. We'll write the code to Query our CSV via a SQL Cell
  2. We'll use the Visualize Cell Capability to bring our data to life so we can clearly see the trends

 

 

Now that we've got our Query written, let's visualise this so the trends are easier to see.

Straight away, we can easily see that there tends to be differing behaviours between our customer groups with two sharp peaks for Subscribers (8AM/6PM) which coincide with when people start and end their respective workdays. On the other hand, we can see that Casual Customers have a much slower uptake. Perhaps this will be more obvious if we look at this on a weekday vs weekend?

Let's take another look.

 

 

It looks like our Hypothesis that there exists differences between Casual Customers vs. Subscribers holds true! It wasn't as apparent when we compared this on an Hourly Basis, but when we analyse this from an Aggregated Basis, we can clearly see that Casual Customers are far more active on the Weekend as opposed to our Subscribers!

This brings us to our second and final tip:

2. You needn't use Matplotlib / Seaborn if you want simple, interpretable visuals as Deepnote's Viz Cells can enable more interactive exploration.

From a Data Viz Completeness lens, Seaborn and Altair have many more nuanced graphics that can be created. However, from a no-code simplistic stand point, Deepnote's Viz. Cells have excellent graphical capabilities that enable any citizen data scientist to quickly mock up visuals without any matplotlib formatting or tuning.

 

Recap on what we've covered 🎉

We've covered a number of concepts in our article so let's do a quick recap over what you've now learnt:

 

  1. Standard Operating Workflows - You've covered what a Standard Operating Workflow looks like as an Analytics Professional.
  2. AWS S3 Integrations - You've learnt how to connect to an S3 Integration and access this in Deepnote.
  3. Dask Use Case vs. Pandas - You've learnt the differences between Dask and Pandas and the use case differences
  4. Visualisation - You've learnt about how can you use Visualisation Cells to reach quick insights through the use of a SQL cell