Skip to main content
  1. Projects/

UCL's Changing Ethnic Diversity

data visualisation plotly socioeconomic function R

All files used for this project can be accessed through this GitHub link.

This visualisation was made in R and used as promotional material for the UCL Data Visualisation Society during our Freshers’ Fair. I also a created a function for anyone looking to design a similar visualisation.

Visualisation and Brief Analysis

This visualisation is not meant to be an in-depth analysis from which we can draw conclusions, but rather an eye-catcher to spark discussion on UCL’s ethnic diversity - has the university become more ethnically diverse over time?

Most strikingly, we see that the dominant ethnic group in UCL has shifted from majority White students in 2007/08 to majority Asian students in 2021/22. This could have been driven by a convergence of factors: rising wealth in Asian countries driving greater international mobility, active recruitment of international students from these countries by the university, the allure of London as a city, and perhaps more equal access to higher education between ethnic groups within the UK.

However, traditionally underrepresented groups, specifically Black and Mixed students, have only seen slight increases in representation across the years. This signals that there remains more to be done (not just by UCL, but by the entire higher education system) to encourage more participation from these groups.

Now, we do concede that in an attempt to make the visualisation more readable, we have grouped ethnicities into larger categories and this obscures the differences within each category. We have also grouped both UK and non-UK students together, which makes it difficult to distinguish where exactly the changes stem from and conduct fine-grained analysis.

Rationale

This visualisation is essentially a stacked bar chart cropped in the shape of an image, specifically Jeremy Bentham (UCL’s spiritual founder), where the height of the bar is proportional to the proportion of each ethnic group. As it was designed to be displayed at the Data Visualisation Society’s booth during UCL’s Freshers Fair, we wanted something atypical, eye-catching, easy to interpret, and relatable to UCL students. Our goal here was to draw students’ attention and get them curious about the society.

Data

The data was retrieved on 01 Oct 2022 from UCL’s webpage on student statistics. Tables T1 (UK-domiciled student numbers by ethnic group) and T2 (Non-UK domiciled student numbers by ethnic group) were used.

To explain a bit about the rationale behind the choice of years, 2007/08 is the earliest year when data for table T2 was available, 2021/22 is the most recent year for which data is available and 2014/15 is the midway point.

The data cleaning was done in Excel as there were two separate workbooks of data each with multiple sheets, which would make it complicated to do the data cleaning in R and it wouldn’t be worth the hassle since I’m only taking a small dataset for 3 years. The steps taken for data cleaning was as follows:

  1. Summed the number of students in each ethnic group across UK-domiciled and non-UK domiciled students for the selected years
  2. Calculated the proportion of each ethnic group in each year

The cleaned data is as shown.

##   Ethnic.Group   X2021_22   X2014_15   X2007_08
## 1        Asian 0.47242532 0.32597951 0.21541894
## 2        Black 0.03582620 0.03364865 0.02979673
## 3        Mixed 0.05337372 0.05040751 0.03698562
## 4        White 0.37643618 0.52904979 0.52607833
## 5        Other 0.06193858 0.06091454 0.19172038

One point to note is that the ‘Other’ category includes students who refused to provide information on ethnicity or students for whom UCL had no information. This proportion was considerably larger in 2007/08 likely due to less rigorous data collection procedures. I left this unimputed as it is difficult to determine whether the missingness is random or if there is an underlying bias.

The image used for the chart was obtained here and traced out in Adobe Photoshop to create a black and white cut-out. The black represents the background of the plot while the white pixels represent the figure to be coloured in.

Process

Unfortunately, I was not able to find any off-the-shelf functions in R to create this visualisation but was able to find a workaround here that uses the ‘png’ and ‘plotly’ packages.

Essentially, this involves reading in the black-and-white cut-out of the image as a matrix of 0s and 1s based on their pixels, recoding the values in the matrix based on how we want the pixels to be coloured, and plotting the matrix as a heatmap.

Setting up

library(png) #to read .png files
library(plotly) #to plot the heatmap
library(extrafont) #to expand font options

#Read in data
ethnicity.data <- read.csv("Ethnic Group Summarised.csv")
#Load .png file of Jeremy Bentham
img <- readPNG("bentham.png")
img <- round(img, 0)

#Determining height and width of image
h <- dim(img)[1]
w <- dim(img)[2]

Recoding matrix values

Here, I am creating a matrix where the values represent the pixels in the heatmap. The values correspond to the colours that I want the pixels to be in the heatmap. There are 4 key steps I am performing here:

  1. Creating a matrix comprising the pixels I want coloured
  2. Running a for loop to determine the positions of the bars for each of the 5 ethnic groups
  3. Identifying the pixels within each bar and recoding their values
  4. Simultaneously, I am also creating a vector to store the y-coordinates for where I want the percentage labels to be displayed on the heatmap
#Creating a matrix comprising pixels to be coloured
col.matrix <- img[h:1,,1]
pixel.matrix <- (col.matrix == 1)

#Creating a vector storing the y-coordinates of the 'ceiling' of the bars for each ethnic group
ylevels <- vector(mode = "logical", length = 5)

#'base' represents the y-coordinate of the 'base' of the bar
base <- 0

#Finding y-coordinate for where to place percentage labels on the chart
lab.position <- vector(mode = "logical", length = 5)

for(i in 5:1){
  #'ceiling' is derived by multiplying height of image by the proportion of each ethnic group and adding that to the base
  ylevels[i] <- round(h*ethnicity.data[i,"X2021_22"]) + base
  
  #Calculating position of percentage labels to be around the centre of each bar
  lab.position[i] <- (ylevels[i] - base)/2 + base - 2
  
  #Creating a matrix of all FALSE of equal dimension as our original image
  value.matrix <- matrix(rep(FALSE, h*w), nrow = h)
  #Filling all pixels within the bar with a value based on their category
  value.matrix[ylevels[i]:base, ] <- TRUE
  col.matrix[pixel.matrix & value.matrix] <- (6-i)/5
  
  #Recoding the 'bottom' of the bar for the next iteration
  base <- ylevels[i]
}

Customising heatmap options

The matrix now comprises pixels of 6 different values where each value represents a different category and its corresponding colour.

##     Category Value
## 1      Asian   1.0
## 2      Black   0.8
## 3      Mixed   0.6
## 4      White   0.4
## 5      Other   0.2
## 6 Background   0.0

Before plotting the heatmap using this matrix, I will first specify some customisations for the plot:

  1. Customising the colour bar. As the colour bar for a heatmap is a continuous scale by default, I created a dataframe that when used, can instead display the colour bar as a discrete scale.
  2. Specifying the positions and labels of the ticks on the colour bar.
  3. Hiding the x and y axes.
  4. Creating a vector storing the text of the percentage labels on the heatmap

A sidenote that the colour palette was chosen based on the brand colours of the UCL Data Visualisation Society.

#--- Customising the colour bar ---#

#Specifying palette of colours for the plot
colours <- c("transparent", "#313628", "#F6931E", "#93B9E7", "ivory", "#181247")

#Creating a function that cuts 0-1 into equal divisions.
colrS <- function(n){
  CUTS <- seq(0,1,length.out=n+1)
  rep(CUTS,ifelse(CUTS %in% 0:1,1,2))
}

#Creating color data frame
colorpalette <- data.frame(z = colrS(6), col = rep(colours, each = 2), stringsAsFactors = F)

#Specifying tick positions
tick.positions <- vector(length = 6)
nvalues <- length(tick.positions)
for(i in 1:nvalues){
  tick.positions[i] <- (1/nvalues)*0.5 + (1/nvalues)*(i-1)
}

#Specifying tick labels
#Appending "" so that no tick label will be displayed for values of 0 (transparent background pixels)
tick.labels <- append(ethnicity.data$Ethnic.Group, "", after = 0)


#--- Customising axes ---#
#Making the axes invisible and repurposing x-axis title as a plot title displayed from the bottom
yax <- list(ticks="", showticklabels=FALSE, showgrid=FALSE, zeroline=FALSE)
xax <- list(ticks="", showticklabels=FALSE, showgrid=FALSE, zeroline=FALSE, title = "<b>2021-22</b>")


#Creating percentage labels
pct.labs <- sapply(ethnicity.data[,"X2021_22"], function(x){
  paste0("<b>",round(x*100,1),"%</b>")})

Generating a single plot

With that, I can now plot out the chart for the proportion of ethnic groups in UCL’s student population for the 2021/22 batch.

#Plotting a heatmap using the matrix created
plot_ly(z = col.matrix, 
      showscale = T,  #To show colour scale
      type = "heatmap", 
      width = 400,  height = 490,  #Controls width and height of plot
      colorscale = colorpalette,
      hoverinfo = "none") %>%  #Hides information shown from hovering
  
layout(xaxis = xax, 
       yaxis = yax,
       font = list(family = "Roboto Condensed")) %>% #Specifies font
  
add_text(x = (w/2-20),    #Adds percentage labels onto the heatmap
         y = lab.position,
         text = pct.labs,
         type = "heatmap", 
         mode = "text", 
         textfont = list(color = c("white", "black", "black", "black", "white")), 
         showlegend = F, 
         inherit = F, 
         hoverinfo = "none") %>%
  
colorbar(title = list(text = "<b>Ethnic Group</b>", 
                      font = list(size = 16)),
         tickvals = tick.positions,
         ticktext = tick.labels,
         tickfont = list(size = 15),
         ticks = "",   #Color bar ticks are not drawn
         outlinecolor = "transparent")

Combining multiple subplots

To create the plots for 2014/15 and 2007/08, I created a generalised function to reduce the amount of repetition in my code, which can also be used for future visualisations.

stackedbar_image(image, cat_percentages, palette, 
                 show_scale = TRUE, colorbar_title = "", cat_names = NA,
                 width = NULL, height = NULL, font_family = "Arial", 
                 chart_title = "", bottom_title = "", label_color = "black")

The function and a brief documentation can be found here.

Finally, I applied the function to generate the plots for all three years and combined them to create the final output.

#Creating plots for 2021-22, 2014-15, and 2007-08
p2021_22 <- stackedbar_image(image = img, 
             cat_percentages = ethnicity.data$X2021_22,
             palette = c("#313628", "#F6931E", "#93B9E7", "ivory", "#181247"), 
             show_scale = F, 
             font_family = "Roboto Condensed", 
             label_color = c("white", "black", "black", "black", "white"),
             width = 690,
             bottom_title = "<b>2021-22</b>")

p2014_15 <- stackedbar_image(image = img, 
             cat_percentages = ethnicity.data$X2014_15,
             palette = c("#313628", "#F6931E", "#93B9E7", "ivory", "#181247"), 
             show_scale = F, 
             font_family = "Roboto Condensed", 
             label_color = c("white", "black", "black", "black", "white"),
             width = 690,
             bottom_title = "<b>2014-15</b>")

p2007_08 <- stackedbar_image(image = img, 
             cat_percentages = ethnicity.data$X2007_08, 
             palette = c("#313628", "#F6931E", "#93B9E7", "ivory", "#181247"),
             cat_names = ethnicity.data$Ethnic.Group,
             show_scale = T, 
             font_family = "Roboto Condensed", 
             label_color = c("white", "black", "black", "black", "white"),
             width = 690,
             colorbar_title = "<b>Ethnic Group</b>",
             bottom_title = "<b>2007-08</b>")

#Combining the three plots into one figure
subplot(p2007_08, p2014_15, p2021_22, titleX = TRUE) %>%
layout(title = list(text = "<b>Proportion of Ethnic Groups in UCL's student population over time</b>",
                    font = list(size = 20)))

Function

The function creates a stacked bar chart cropped in the shape of a provided image. It requires two key inputs: the proportions for each category (to determine bar sizes) and an image read into R as a matrix (using packages such as png). The plotly package must be installed and loaded to run the function.

There are two features added to the function that was not included in the visualisation process above as they were not applicable for the chart I had made but are potentially useful for future applications. I added an option for a chart title as well as a feature to detect the vertical range of the white pixels in the image. The second feature is necessary for images where there is empty space above or below the white pixels.

Here is the function.

stackedbar_image <- function(image, cat_percentages, palette, 
                             show_scale = TRUE, colorbar_title = "", cat_names = NA,
                             width = NULL, height = NULL, font_family = "Arial", 
                             chart_title = "", bottom_title = "", label_color = "black"){
  
#Rounding to ensure there's only 0 and 1s in the matrix
img <- round(image, 0)

#Finding height and width of image
h <- dim(img)[1]
w <- dim(img)[2]


##---Creating matrix used for the plot---##
#Finding the rows where white pixels first appear and end (i.e. where feet starts and head ends)
pos1 <- which(apply(img[,,1], 1, function(y) any(y==1)))
bottom <- min(pos1)
top <- max(pos1)
h.figure <- top-bottom

#Creating a matrix comprising pixels to be coloured
col.matrix <- img[h:1,,1]
pixel.matrix <- (col.matrix == 1)

ncats <- length(cat_percentages)
#Creating a vector storing the y-coordinates of the 'ceiling' of the bars for each category
ylevels <- vector(mode = "logical", length = ncats)
base <- bottom

#Creating a vector storing the y-coordinates for where to place the percentage labels
lab.position <- vector(mode = "logical", length = ncats)

#For loop to recode values in matrix
for(i in ncats:1){
  ylevels[i] <- round(h.figure*cat_percentages[i]) + base
  lab.position[i] <- (ylevels[i] - base)/2 + base - 2
  
  #Creating a matrix of all FALSE of equal dimension as our original image
  value.matrix <- matrix(rep(FALSE, h*w), nrow = h)
  #Filling all pixels between current ylevel and previous ylevel with a value
  value.matrix[ylevels[i]:base,] <- TRUE
  col.matrix[pixel.matrix & value.matrix] <- (ncats+1-i)/ncats
  
  base <- ylevels[i]
}


##---Adjusting plot options (cosmetic changes to plot)---##

#Appending 'transparent' to the palette to set the background colour as transparent.
palette <- append(palette, "transparent", after = 0)

#Function cuts 0-1 into equal divisions.
colrS <- function(n){
  CUTS <- seq(0,1,length.out=n+1)
  rep(CUTS,ifelse(CUTS %in% 0:1,1,2))
}

#Creating color data frame
colorpalette <- data.frame(z = colrS(ncats+1), col = rep(palette, each = 2), stringsAsFactors = F)

#Creating tick positions
tick.positions <- vector(length = ncats+1)
nvalues <- length(tick.positions)
for(i in 1:nvalues){
  tick.positions[i] <- (1/nvalues)*0.5 + (1/nvalues)*(i-1)
}

#Ensuring tick label for the transparent background colour does not appear on the color bar
if(show_scale == T){
  cat_names <- rev(cat_names)
  cat_names <- append(cat_names, "", after = 0)
}

#Customising axis options: ensuring axes do not appear
yax <- list(ticks="", showticklabels=FALSE, showgrid=FALSE, zeroline=FALSE)
xax <- list(ticks="", showticklabels=FALSE, showgrid=FALSE, zeroline=FALSE, title = bottom_title)

#Creating labels to be displayed on the chart
pct.labs <- sapply(cat_percentages, function(x){paste0("<b>",round(x*100,1),"%</b>")})

#Setting height and width of plot as the height and width of the image if not supplied by user
if(is.null(height)){
  height <- h
}
if(is.null(width)){
  width <- w
}

##---Creating plot---##

plot_ly(z = col.matrix, 
        showscale = show_scale, 
        type="heatmap", 
        width = width,  
        height = height, 
        colorscale = colorpalette, 
        hoverinfo = "none") %>%
  
  layout(xaxis = xax,                        
         yaxis = yax, 
         title = list(text = chart_title,        #Specifying chart title
                      font = list(size = 15),
                      yanchor = "bottom"),
         margin = list(t = 45, b= 10),           #Adding top margin to plot 
         font = list(family = font_family)) %>%  #Customising font type for all text in the plot
  
  add_text(x = (w/2-20),       #Specifies the x-coordinates of the percentage labels
           y = lab.position,   #Specifies the y-coordinates of the percentage labels
           type = "heatmap", 
           mode = "text", 
           text = pct.labs,    #Specifies the text for the labels as defined above
           textfont = list(color = label_color), #Specifies label colors as defined by user
           showlegend = F, 
           inherit = F, 
           hoverinfo = "none") %>%
  
  colorbar(title = list(text = colorbar_title,
                        font = list(size = 16)),
           tickvals = tick.positions,
           ticktext = cat_names,
           tickfont = list(size = 15),
           ticks = "",
           outlinecolor = "transparent")
}

Documentation

Variable Name Default Description
image Not set A matrix containing 0s and 1s, where each value represents a pixel.
cat_percentages Not set A numeric vector storing the proportions of each category.
palette Not set A vector specifying the colors for each category.
show_scale TRUE If true, the color bar will be shown.
colorbar_title "" Determines the title of the color bar. Only applicable if show_scale = TRUE. If left empty, the color bar will not have a title.
cat_names NA A string vector specifying the text of the tick labels on the color bar.
width NULL Width of the plot. If not supplied, it will default to the number of columns in the provided matrix.
height NULL Height of the plot. If not supplied, it will default to the number of rows in the provided matrix.
font_family "Arial" Determines the font family used for all text in the plot.
chart_title "" Determines the title displayed at the top of the plot. If left empty, the plot will not have a title at the top.
bottom_title "" Determines the title displayed at the bottom of the plot. If left empty, the plot will not have a title at the bottom.
label_color "black" Either a string or string vector specifying the color(s) of the percentage labels displayed on the chart.