Sankey generator webapp with python, pandas, plotly and Streamlit

Jelena Ristic
5 min readMar 26, 2022

--

Abstract flows in gradient colour on black background
lilzidesigns, Unsplash

Since I discovered python and the plotly library, I fell in love with sankey diagrams. If you are not familiar with this type of data representation, it is a diagram that represents flows within a given system. The flows look like ribbons and their width is determined by their value.

This type of data visualisation was named after Irish Captain Matthew Henry Phineas Riall Sankey, who used it in 1898 to show the energy efficiency of a steam engine (source: Wikipedia). One of the earlier uses of this type of flow representation is credited to the French civil engineer and infographics pioneer Charles Minard who, in 1869, represented Napoleon’s invasion of Russia in a beautiful flow map combining topography, weather data, time span, and movements and number of French soldiers. This diagram is often quoted as an example of beauty, functionality, truthfulness and insight.

Charles Minard’s flow map drawn in 1869 representing Napoleon’s invasion of Russia
Charles Minard‘s diagram of Napoleon’s invasion of Russia, 1869 / Wikipedia, public domain

The plotly library offers an interactive way to plot a sankey with your data, allowing you to rearrange the flows and aggregate elements (or nodes) before downloading it as a png. Here is an example of the output using data from a thread of tweets and representing actors and their respective screen times in Quentin Tarantino’s films:

Sankey diagram of actors and their respective screentimes in Quentin Tarantino’s films
Actors and their screentimes in Quentin Tarantino’s films by Jelena Ristic, 2022

The actors that feature in more than one film have coloured flows and the selection tool allows you to aggregate the nodes and display the new visualisation, for instance per actor:

In my opinion one sankey is worth a thousand bar charts, so after having played a while with python, pandas and plotly in a Jupyter notebook, I decided to use Streamlit to create and publish my own webapp so I could generate sankey diagrams when needed even if I didn’t have access to my own private computer, for example at work.

It is quite straightforward and the success ultimately lies in how you format your data.

Basic sankey diagram

According to plotly documentation (code shown below), a basic sankey diagram requires 2 sources for data: nodes and links.

import plotly.graph_objects as go

fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = ["A1", "A2", "B1", "B2", "C1", "C2"],
color = "blue"
),
link = dict(
source = [0, 1, 0, 2, 3, 3], # indices correspond to labels, eg A1, A2, A1, B1, ...
target = [2, 3, 3, 4, 4, 5],
value = [8, 4, 2, 8, 4, 2]
))])

fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()

The output:

Basic sankey diagram generated by the code above. Source: plotly

Setting up your data files

You will need 2 CSV files: one for nodes and one for links.

The NODES file will contain 3 columns with the following header: “ID”, “label” and “color”. It will feature all the elements (nodes) that you intend to link.

Screenshot of the node csv file with headers used for the Tarantino sankey diagram
Example of my node csv file for the Tarantino sankey above

The LINKS file will contain all the links between the elements listed in your NODES file. The LINKS file will contain 4 columns with the following header: “source”, “target”, “value”, “link color”. In the “source” column you put the IDs of nodes that will be shown on the left, and in the “target” column the IDs of the ones that will show on the right, and their link will be displayed thanks to the “value” and “link color” columns. In the example below, the “source” column lists the actors and the “target” column the films. The “value” is the number of seconds spent on screen, and the “link color” will colour the flow ribbon between the source and target nodes.

Screenshot of the links csv file for the Tarantino sankey diagram shown earlier
Example of my links csv file for the Tarantino sankey above

A few tips to organise your data

  1. The LINKS “value” only takes integers and floats.
  2. Dismiss weak or minor connections between nodes as a sankey diagram can quickly become crowded. Too many links may make it difficult to read.
  3. Choose your colours wisely! If you still need to represent complex relationships and expect a lot of intertwining, opt for rgba colours for the “link color” column and play with the alpha value to add transparency. The overlapping and crossing flows will be much easier to read.

Streamlit-powered python sankey webapp

I opted for Streamlit as it offers a quick way to publish a python-based webapp and I didn’t want to spend more time on the front-end development. Streamlit offers it all on a plate for you to pick and choose. The documentation is straightforward and I built easily an interface that allows me to upload my nodes and links files (pandas for reading csv files), display the data, and customise the sankey (orientation, background colour, font colour and size, width and height of diagram, etc.). When it comes to publishing, it’s also easy. I linked Streamlit to my github profile and published the app from its github repository. So any changes I make on github will be reflected directly in the app, and that is quite neat.

Streamlit sankey diagram generator interface / Jelena Ristic, 2022

Click here to go to the webapp. Click here for the source code, and here is another article on the topic:

Thanks for reading and happy sankey-ing!!

Would you like to support me and other writers on Medium?

To get access to unlimited stories, you can also consider signing up to become a Medium member for just 5$. If you sign up using my link, I’ll receive a small commission at no extra cost to you. Thank you!

--

--

Jelena Ristic

Creative t(h)inker, former museum curator, and python enthusiast delving into the wondrous realm of digital humanities. Reach me at hello@jelenaristic.info