Sankey generator webapp with python, pandas, plotly and Streamlit
Since I discovered python and the plotly library, I fell in love with sankey diagrams. If you are not familiar with this type of data representation, it is a diagram that represents flows within a given system. The flows look like ribbons and their width is determined by their value.
This type of data visualisation was named after Irish Captain Matthew Henry Phineas Riall Sankey, who used it in 1898 to show the energy efficiency of a steam engine (source: Wikipedia). One of the earlier uses of this type of flow representation is credited to the French civil engineer and infographics pioneer Charles Minard who, in 1869, represented Napoleon’s invasion of Russia in a beautiful flow map combining topography, weather data, time span, and movements and number of French soldiers. This diagram is often quoted as an example of beauty, functionality, truthfulness and insight.
The plotly library offers an interactive way to plot a sankey with your data, allowing you to rearrange the flows and aggregate elements (or nodes) before downloading it as a png. Here is an example of the output using data from a thread of tweets and representing actors and their respective screen times in Quentin Tarantino’s films:
The actors that feature in more than one film have coloured flows and the selection tool allows you to aggregate the nodes and display the new visualisation, for instance per actor:
In my opinion one sankey is worth a thousand bar charts, so after having played a while with python, pandas and plotly in a Jupyter notebook, I decided to use Streamlit to create and publish my own webapp so I could generate sankey diagrams when needed even if I didn’t have access to my own private computer, for example at work.
It is quite straightforward and the success ultimately lies in how you format your data.
Basic sankey diagram
According to plotly documentation (code shown below), a basic sankey diagram requires 2 sources for data: nodes and links.
import plotly.graph_objects as go
fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = ["A1", "A2", "B1", "B2", "C1", "C2"],
color = "blue"
),
link = dict(
source = [0, 1, 0, 2, 3, 3], # indices correspond to labels, eg A1, A2, A1, B1, ...
target = [2, 3, 3, 4, 4, 5],
value = [8, 4, 2, 8, 4, 2]
))])
fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()
The output:
Setting up your data files
You will need 2 CSV files: one for nodes and one for links.
The NODES file will contain 3 columns with the following header: “ID”, “label” and “color”. It will feature all the elements (nodes) that you intend to link.
The LINKS file will contain all the links between the elements listed in your NODES file. The LINKS file will contain 4 columns with the following header: “source”, “target”, “value”, “link color”. In the “source” column you put the IDs of nodes that will be shown on the left, and in the “target” column the IDs of the ones that will show on the right, and their link will be displayed thanks to the “value” and “link color” columns. In the example below, the “source” column lists the actors and the “target” column the films. The “value” is the number of seconds spent on screen, and the “link color” will colour the flow ribbon between the source and target nodes.
A few tips to organise your data
- The LINKS “value” only takes integers and floats.
- Dismiss weak or minor connections between nodes as a sankey diagram can quickly become crowded. Too many links may make it difficult to read.
- Choose your colours wisely! If you still need to represent complex relationships and expect a lot of intertwining, opt for rgba colours for the “link color” column and play with the alpha value to add transparency. The overlapping and crossing flows will be much easier to read.
Streamlit-powered python sankey webapp
I opted for Streamlit as it offers a quick way to publish a python-based webapp and I didn’t want to spend more time on the front-end development. Streamlit offers it all on a plate for you to pick and choose. The documentation is straightforward and I built easily an interface that allows me to upload my nodes and links files (pandas for reading csv files), display the data, and customise the sankey (orientation, background colour, font colour and size, width and height of diagram, etc.). When it comes to publishing, it’s also easy. I linked Streamlit to my github profile and published the app from its github repository. So any changes I make on github will be reflected directly in the app, and that is quite neat.
Click here to go to the webapp. Click here for the source code, and here is another article on the topic:
Thanks for reading and happy sankey-ing!!
Would you like to support me and other writers on Medium?
To get access to unlimited stories, you can also consider signing up to become a Medium member for just 5$. If you sign up using my link, I’ll receive a small commission at no extra cost to you. Thank you!