If something doesn't work, you can report a problem.
This notebook provides you with basic tools for a data-driven inspection of your collection using data visualisation tools. By plotting several charts, it should facilitate the identification of the nature of the content you have stored in your collection, helping with analysing your corpus or even planning ammendments to the content you collected. Moreover, this notebook aims at fostering data/source criticism by raising awareness on the origing of Impresso metadata and quality of digitised materials.
These visualisations can be useful for revealing trends or aspects of your collection that you did not realise before or for presenting your collection to different audiences.
In this notebook, you will learn how to:
Quick tip: Code in Jupyter notebooks must be run in sequence. If something does not work, double-check if you forgot to run a previous cell.
If you work with Google Colab, there are two ways to upload your csv file with the following codes:
from google.colab import filesIf your csv file is stored on your computer:
uploaded = files.upload()If your csv file is stored online:
url = "https://example.com/mydata.csv"
We use the function 'read' in Pandas to load the csv as a dataframe
# Import .csv file retrieved from Web App and load the file as Pandas dataframe
import pandas as pd
dataframe = pd.read_csv('file.csv', delimiter=',', on_bad_lines='skip')
Now, you can inspect your Pandas dataframe. By running dataframe.info()
# This shows how many columns you have, how many entries,the name of the columns and the type of values you have in each column.
dataframe.info()
You can also preview the first rows of your dataframe by running dataframe.head()
# Try here
To create charts, we are going to use a library called Plotly. Let's import it first.
import plotly.express as px
from plotly.colors import sequential, qualitative
You can now start exploring our collection by ploting charts.
Below, we demonstrate how to plot a bar chart showing how many articles you have in your collection by newspaper. Let's do it step-by-step:
# Count values in column 'newspaper' of dataframe
newspaper_counts = dataframe['mediaUid'].value_counts().reset_index()
# Visualise your counts
newspaper_counts
newspapers_counts. We determine axis x as corresponding to column 'mediaCode' and axis y as corresponding to column 'count'.# Plot bar chart using Plotly for dataframe 'newspaper_counts'
fig = px.bar(newspaper_counts, x='mediaUid', y='count', title='Count of Articles by Newspaper')
# Define colour as gray scale
fig.update_traces(marker_color='teal')
# Show figure
fig.show()
You can also plot a pie chart with the exact same values.
# Pie charts will have 'names' and 'values' instead of axis x and y
fig = px.pie(newspaper_counts, names='mediaUid', values='count', title='Count of Articles by Newspaper', color_discrete_sequence=sequential.Teal)
# Layout options to improve readability: include text inside chart. Also include %
fig.update_traces(textposition='inside', textinfo='percent+label')
# Show figure
fig.show()
Question: which chart do you think is more useful or more efficient in communicating the data? What are the pros and cons of each of them?
We can plot a treemap to visualise the number of articles by newspaper along with their content provider
# Count values based on columns 'providerCode' and 'mediaUid'
np_provider_counts = dataframe[['providerCode', 'mediaUid']].value_counts().reset_index()
# Take a look at the dataframe with your counts.
np_provider_counts
It is worth noticing that the dataframe looks different. You have now counted values based on two columns, rather than a single column.
# Plot a treemap
fig = px.treemap(np_provider_counts, path=[px.Constant("all"),'providerCode', 'mediaUid'], values='count',
color='count', color_continuous_scale=px.colors.sequential.Teal,
title='Count of articles by Newspaper and their Content Providers')
fig.show()
See temporal distribution of articles for a given newspaper using a histogram. In the example below, we look at values for the newspaper Gazette de Lausanne (GDL).
Bar charts are used to represent categorical data. Histograms, in contrast, represent continuous data such as ages, daily temperatures, etc. For dates, histograms can be particularly useful for time-series analysis, to help identify trends over long periods. Be aware that Gazette de Lausanne will only show up in the chart below if you have a content item by them in your collection.
# Create a dataframe with values for Gazette de Lausanne (GDL)
df_GDL = dataframe[dataframe['mediaUid'] == 'GDL'].copy()
# Define histogram axis
fig = px.histogram(df_GDL, x='publicationDate', title=f'Distribution of Articles over time for Gazette de Lausanne')
# Data is aggragated in bins based on month (size in millisecond)
fig.update_traces(xbins=dict(size=30*24*60*60*1000)) # 1 month in ms | To aggregate by day, use 1 day = 24*60*60*1000
# Define colour
fig.update_traces(marker_color='Teal')
# Show figure
fig.show()
You can also generate a list of histograms, one for each newspaper in your collection by writing a for loop.
# Get the earliest and latest dates in dataset
earliest_date = dataframe['publicationDate'].min()
latest_date = dataframe['publicationDate'].max()
# Get unique list of newspapers
newspapers = dataframe['mediaUid'].unique()
# Create separate histograms for each newspaper using a 'for loop'
for newspaper in newspapers:
df_newspaper = dataframe[dataframe['mediaUid'] == newspaper].copy()
fig = px.histogram(df_newspaper, x='publicationDate', title=f'Distribution of Articles over Time for {newspaper}')
# Data is aggragated in bins based on month (size in millisecond)
fig.update_traces(xbins=dict(size=30*24*60*60*1000)) # 1 month in ms
# Always set start of x-axis based on the ealiest date in dataset and end based on the latest.
fig.update_layout(
xaxis_range=[earliest_date, latest_date],
# Always set minimun value of y-axis 0 and max value as 60 (customise as necessary)
yaxis_range=[0, 60]
)
# Define colour
fig.update_traces(marker_color='Teal')
# Show figure
fig.show()
To facilitate comparison, you can display the histograms in dashboard format.
# Import IPython display and plotly.graph_objects modules to create a dashboard
from IPython.display import display, HTML
import plotly.graph_objects as go
# plot histograms in dashboard side-by-side showing number of articles overtime by newspaper
dashboard_html = """
<html>
<head>
<style>
.container {{
display: flex;
flex-wrap: wrap;
gap: 20px;
justify-content: center; /* Center the items in the container */
}}
.chart-item {{
width: 45%; /* Adjust as needed to have two columns with gap */
box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
transition: 0.3s;
padding: 10px;
box-sizing: border-box; /* Include padding and border in the element's total width and height */
}}
.chart-item:hover {{
box-shadow: 0 8px 16px 0 rgba(0,0,0,0.2);
}}
h2 {{
text-align: center;
width: 100%; /* Make sure title takes full width */
}}
</style>
</head>
<body>
<h2>Data Overview Dashboard</h2>
<div class="container">
{newspaper_histograms}
</div>
</body>
</html>
"""
# Re-generate plots and capture their HTML output
# Histograms for each newspaper
newspaper_histograms_html = ""
newspapers = dataframe['mediaUid'].unique()
min_date = dataframe['publicationDate'].min()
max_date = dataframe['publicationDate'].max()
for newspaper in newspapers:
newspaper_df = dataframe[dataframe['mediaUid'] == newspaper]
fig = px.histogram(newspaper_df, x='publicationDate',
labels={'x':'Date', 'y':'Number of Articles'},
title=f'Count of content items over time for {newspaper}',
range_x=[min_date, max_date])
fig.update_traces(xbins=dict(size=30*24*60*60*1000)) # 1 month in ms
fig.update_traces(marker_color='Teal')
fig.update_layout(yaxis_range=[0, 60])
newspaper_histograms_html += f'<div class="chart-item">{fig.to_html(full_html=False, include_plotlyjs="cdn")}</div>'
# Always set start of x-axis based on the ealiest date in dataset and end based on the latest.
# Populate the dashboard HTML template
full_dashboard_html = dashboard_html.format(
newspaper_histograms=newspaper_histograms_html,
)
# Display the dashboard
display(HTML(full_dashboard_html))
To visualise the temporal distribution of articles for the whole collection, you can plot a histogram as follows
Tip: by clicking on the name of the newspapers in the right-hand side column, you can select specific newspapers you want to visualise
# Plot chart to see distribution of articles over time
earliest_latest_dated_per_np = dataframe.groupby('mediaUid').agg({'publicationDate': ['min', 'max']})
earliest_latest_dated_per_np.columns = ['earliest_date', 'latest_date']
# varying the nbins value provides more or less precision
nbins_value = 366
# stacked histogram
fig = px.histogram(dataframe, x="publicationDate", color='mediaUid', title=f'Distribution of Articles over Time', nbins=nbins_value)
fig.show()
In Part 4 of this notebook, you will plot several charts based on your collection's available metadata such as countries where articles were published, languages, articles sizes and newspaper page in which the article occurs.
At the end of this session, you will be able to load your charts into a dashboard.
# count values on column 'countryCode' and plot a treemap
country_counts = dataframe['countryCode'].value_counts().reset_index()
country_counts.columns = ['countryCode', 'count']
fig = px.treemap(country_counts, path=['countryCode'], values='count',
color='count', color_continuous_scale='Teal',
title='Distribution of Countries')
fig.show()
# count values on column 'providerCode' and plot a treemap
content_provider_counts = dataframe['providerCode'].value_counts().reset_index()
content_provider_counts.columns = ['providerCode', 'count']
fig = px.treemap(content_provider_counts, path=['providerCode'], values='count',
color='count', color_continuous_scale='Teal',
title='Distribution of Content Providers')
fig.show()
# count values on column 'languageCode' and plot a treemap
language_counts = dataframe['languageCode'].value_counts().reset_index()
language_counts.columns = ['languageCode', 'count']
fig = px.treemap(language_counts, path=['languageCode'], values='count',
color='count', color_continuous_scale='Teal',
title='Distribution of Languages')
fig.show()
# count values in column 'type' and plot pie chart showing distribution
type_counts = dataframe['type'].value_counts().reset_index()
type_counts.columns = ['type', 'count']
fig = px.pie(type_counts, values='count', names='type',
color_discrete_sequence=px.colors.sequential.Teal,
title='Distribution of Article Types')
fig.show()
Note: If you are running this notebook with the default dataset, this plot will only show a single category "article", and is therefore not very informative. However, it could become more meaningful when used with another dataset.
# plot chart showing distribution of values in column 'transcriptLength' and add mark on the average and mean
fig = px.histogram(dataframe, x='transcriptLength', title='Distribution of Article Sizes', color_discrete_sequence=['Teal'])
fig.add_vline(x=dataframe['transcriptLength'].mean(), line_dash="dash", line_color="red", annotation_text=f"Mean: {dataframe['transcriptLength'].mean():.2f}", annotation_position="top right")
fig.add_vline(x=dataframe['transcriptLength'].median(), line_dash="dash", line_color="black", annotation_text=f"Median: {dataframe['transcriptLength'].median():.2f}", annotation_position="top left")
fig.show()
# Violin plots showing distribution of transcript length per media code
import plotly.graph_objects as go
fig = go.Figure()
for title in dataframe['mediaUid'].unique():
fig.add_trace(go.Violin(x=dataframe['mediaUid'][dataframe['mediaUid'] == title],
y=dataframe['transcriptLength'][dataframe['mediaUid'] == title],
name=title,
box_visible=True,
meanline_visible=True))
fig.update_layout(title_text="Distribution of Article Sizes per Title")
fig.show()
# Violin plots showing distribution of article page per media code
fig = go.Figure()
# Remove brackets from 'pageNumbers' column and convert to integer
dataframe["pageNumbers"] = (
dataframe["pageNumbers"]
.astype(str)
.str.replace(r"[\[\]]", "", regex=True)
.str.split(",", n=1)
.str[0]
.replace("nan", pd.NA)
.astype("Int64")
)
sorted_page_num = dataframe.sort_values(by='pageNumbers').dropna(subset=['pageNumbers'])
for title in sorted_page_num['mediaUid'].unique():
fig.add_trace(go.Violin(x=sorted_page_num['mediaUid'][sorted_page_num['mediaUid'] == title],
y=sorted_page_num['pageNumbers'][sorted_page_num['mediaUid'] == title],
name=title,
box_visible=True,
meanline_visible=True))
# Change the yaxis range value to have different views of your distribution
fig.update_layout(title_text="Distribution of Article Pages per Title", yaxis_range=[-5,25], height=800)
fig.update_yaxes(dtick=1)
fig.show()
# plot chart with count of values in column 'isOnFrontPage'
# count values in column 'is_on_front_page' and plot a bar chart
is_on_front_page_counts = dataframe['isOnFrontPage'].value_counts().reset_index()
is_on_front_page_counts.columns = ['isOnFrontPage', 'count']
fig = px.bar(is_on_front_page_counts, x='isOnFrontPage', y='count',
labels={'isOnFrontPage':'On Front Page', 'count':'Number of Articles'},
title='Count of Articles on Front Page', color_discrete_sequence=['Teal'])
fig.show()
Try visualising in full-screen by clicking on the arrow in the output cell (see figure):
# @title This cell will create a dash board and plot several charts about your collection
# create a dashboard showing all the charts
import pandas as pd
from IPython.display import display, HTML
import plotly.graph_objects as go
# Combine all figures into a single dashboard layout
dashboard_html = """
<html>
<head>
<style>
.container {{
display: flex;
flex-wrap: wrap;
gap: 20px;
justify-content: center; /* Center the items in the container */
}}
.chart-item {{
width: 45%; /* Adjust as needed */
box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
transition: 0.3s;
padding: 10px;
box-sizing: border-box; /* Include padding and border in the element's total width and height */
}}
.chart-item:hover {{
box-shadow: 0 8px 16px 0 rgba(0,0,0,0.2);
}}
h2 {{
text-align: center;
width: 100%; /* Make sure title takes full width */
}}
</style>
</head>
<body>
<h2>Data Overview Dashboard</h2>
<div class="container">
<div class="chart-item">{bar_chart}</div>
<div class="chart-item">{country_treemap}</div>
<div class="chart-item">{content_provider_treemap}</div>
<div class="chart-item">{language_treemap}</div>
<div class="chart-item">{type_pie_chart}</div>
<div class="chart-item">{size_histogram}</div>
<div class="chart-item">{pages_histogram}</div>
<div class="chart-item">{front_page_bar_chart}</div>
</div>
</body>
</html>
"""
# Re-generate plots and capture their HTML output
# Bar chart of newspaper counts
newspaper_counts = dataframe['mediaUid'].value_counts().reset_index()
newspaper_counts.columns = ['mediaUid', 'Count'] # Rename for clarity in plot
fig_newspaper_bar = px.bar(newspaper_counts, x='mediaUid', y='Count',
labels={'mediaUid':'Newspaper', 'Count':'Number of Articles'},
title='Count of content items by newspaper title')
fig_newspaper_bar.update_traces(marker_color='Teal')
bar_chart_html = fig_newspaper_bar.to_html(full_html=False, include_plotlyjs='cdn')
# Treemap for countries
country_counts = dataframe['countryCode'].value_counts().reset_index()
country_counts.columns = ['countryCode', 'count']
fig_country_treemap = px.treemap(country_counts, path=['countryCode'], values='count',
title='Distribution of Countries', color='count', color_continuous_scale='Teal')
country_treemap_html = fig_country_treemap.to_html(full_html=False, include_plotlyjs='cdn')
# Treemap for content providers
content_provider_counts = dataframe['providerCode'].value_counts().reset_index()
content_provider_counts.columns = ['providerCode', 'count']
fig_content_provider_treemap = px.treemap(content_provider_counts, path=['providerCode'], values='count',
title='Distribution of Content Providers', color='count', color_continuous_scale='Teal')
content_provider_treemap_html = fig_content_provider_treemap.to_html(full_html=False, include_plotlyjs='cdn')
# Treemap for languages
language_counts = dataframe['languageCode'].value_counts().reset_index()
language_counts.columns = ['languageCode', 'count']
fig_language_treemap = px.treemap(language_counts, path=['languageCode'], values='count',
title='Distribution of Languages', color='count', color_continuous_scale='Teal')
language_treemap_html = fig_language_treemap.to_html(full_html=False, include_plotlyjs='cdn')
# Pie chart for article types
type_counts = dataframe['type'].value_counts().reset_index()
type_counts.columns = ['type', 'count']
fig_type_pie = px.pie(type_counts, values='count', names='type',
title='Distribution of Article Types', color_discrete_sequence=px.colors.sequential.Teal)
type_pie_chart_html = fig_type_pie.to_html(full_html=False, include_plotlyjs='cdn')
# Histogram for article sizes
fig_size_hist = px.histogram(dataframe, x='transcriptLength', title='Distribution of Article Sizes', color_discrete_sequence=['Teal'])
fig_size_hist.add_vline(x=dataframe['transcriptLength'].mean(), line_dash="dash", line_color="red", annotation_text=f"Mean: {dataframe['transcriptLength'].mean():.2f}", annotation_position="top right")
fig_size_hist.add_vline(x=dataframe['transcriptLength'].median(), line_dash="dash", line_color="black", annotation_text=f"Median: {dataframe['transcriptLength'].median():.2f}", annotation_position="top left")
size_histogram_html = fig_size_hist.to_html(full_html=False, include_plotlyjs='cdn')
# Histogram for article pages
dataframe['pageNumbers'] = pd.to_numeric(dataframe['pageNumbers'], errors='coerce')
fig_pages_hist = px.histogram(dataframe.dropna(subset=['pageNumbers']), x='pageNumbers', title='Distribution of Article Pages', color_discrete_sequence=['Teal'])
fig_pages_hist.add_vline(x=dataframe['pageNumbers'].mean(), line_dash="dash", line_color="red", annotation_text=f"Mean: {dataframe['pageNumbers'].mean():.2f}", annotation_position="top right")
fig_pages_hist.add_vline(x=dataframe['pageNumbers'].median(), line_dash="dash", line_color="black", annotation_text=f"Median: {dataframe['pageNumbers'].median():.2f}", annotation_position="top left")
pages_histogram_html = fig_pages_hist.to_html(full_html=False, include_plotlyjs='cdn')
# Bar chart for front page status
is_on_front_page_counts = dataframe['isOnFrontPage'].value_counts().reset_index()
is_on_front_page_counts.columns = ['isOnFrontPage', 'count']
fig_front_page_bar = px.bar(is_on_front_page_counts, x='isOnFrontPage', y='count',
labels={'isOnFrontPage':'On Front Page', 'count':'Number of Articles'},
title='Count of Articles on Front Page', color_discrete_sequence=['Teal'])
front_page_bar_chart_html = fig_front_page_bar.to_html(full_html=False, include_plotlyjs='cdn')
# Populate the dashboard HTML template
full_dashboard_html = dashboard_html.format(
bar_chart=bar_chart_html,
country_treemap=country_treemap_html,
content_provider_treemap=content_provider_treemap_html,
language_treemap=language_treemap_html,
type_pie_chart=type_pie_chart_html,
size_histogram=size_histogram_html,
pages_histogram=pages_histogram_html,
front_page_bar_chart=front_page_bar_chart_html)
# Display the dashboard
display(HTML(full_dashboard_html))
Be mindful of people with colour blindness when choosing the colours of your visuals. The internet is full of tools to generate accessible palettes, which you can consult before deciding on the best colours for your visuals.
That's it for now! Next, you can explore:
Writing - Original draft: Caio Mello, Pauline Conti, Martin Grandjean. Conceptualization: Martin Grandjean, Caio Mello, Pauline Conti. Software: Roman Kalyakin. Writing - Review & Editing: Cao Vy. Validation: Kirill Veprikov. Datalab editorial board: Caio Mello (Managing), Emanuela Boros, Estelle Bunout, Cao Vy, Pauline Conti, Marten Düring, Martin Grandjean, Juri Opitz. Data curation & Formal analysis: Maud Ehrmann, Emanuela Boros, Pauline Conti, Simon Clematide, Juri Opitz, Andrianos Michail. Methodology: Caio Mello, Pauline Conti. Supervision: Martin Grandjean. Funding aquisition: Maud Ehrmann, Simon Clematide, Marten Düring, Raphaëlle Ruppen Coutaz.
This notebook is published under CC BY 4.0 License
For feedback on this notebook, please send an email to info@impresso-project.ch
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
All Impresso code is published open source under the GNU Affero General Public License v3 or later.