Multimodal Architectures for Bias Classification in News

DSM150 Final Coursework

Author
Affiliation

Johannes Van Cauwenberghe

University of London

Published

February 22, 2025

Abstract

Deep learning is uniquely suited for multimodal classification, enabling the integration of distinct data types and build robust representations of key characteristics from multiple modalities. While deep learning techniques for text analysis have significantly advanced, political bias classification in news remains predominantly text-based. Yet, as news is increasingly read online, news readers increasingly select content by scrolling images and titles. Decisions on what to read are largely based on a combination of an image and a title. Meanwhile, news aggregators struggle with filter bubbles and over-personalisation, limiting exposure to an increasingly narrow political spectrum. We thus propose a multimodal bias classification algorithm that learns joint-representations of the image and text data. Extensive experiments on text-only and image-only branches will inform the development of a unified multimodal algorithm. This model achieves an accuracy of 82% and an AUC of 90%, outperforming our single-mode models and showing superior performance overall.

1 Introduction

Central to this project is the objective to apply deep learning techniques for political bias classification. Like Coursework 1, this work follows the universal workflow of machine learning as outlined in the excellent textbook, Deep Learning with Python, by the founder of the popular Keras library, Francois Chollet.

Whilst this methodical approach to deep learning will feature front and center, the project also has a more substantive goal. The goal is to build the best possible classifier for political bias classification in news. Specifically, we will combine text and image features to train a dual-input neural classifier that distinguishes left-wing from right-wing political news content. As social media platforms and news aggregators struggle with ideological echo chambers (Helberger 2019), a systematic method for identifying political bias can improve transparency in journalism and enable more informed news consumption.

A news article’s textual content and its visual elements (like accompanying images) often carry signals of ideological slant​ or political bias. By political bias, we refer to differences in language use, framing, and the relative emphasis on specific ideological perspectives. The same topic might be framed differently in text and illustrated with divergent imagery by outlets of opposite leanings: a left-leaning source may highlight sympathetic visuals (e.g. migrants’ struggles) while a right-leaning source might choose more fear-inducing imagery (e.g. depicting crime). Several websites provide ratings and invite readers to read the news from multiple perspectives. Examples include: ground.news, Media Fact Check, and AllSides. Over the following pages, we will use the term political bias to describe these ideological leanings.

We will use a self-compiled dataset consisting of three parts:

  1. Labels scraped from AllSides, an organisation that exposes polarisation and partisanship in news content by providing ratings and labels of political bias for all major publications.
  2. 20,000 articles fetched from the NewsCatcher API.
  3. Images associated with each article, downloaded and converted.

Our experimental procedure consists of three steps. First, we identify a series of candidate model architectures. Next, we implement them. Finally, we refine and optimise them. This takes an empirical approach to model architectures. Chollet (2021) argues that model architecture defines the hypothesis space of the model as “the space of possible functions that gradient descent can search over, parameterized by the model’s weights” (Chollet 2021).

In this empirical analysis, we will attempt many of the techniques found in (Chollet 2018, chap. 5 and 6) and its later extensions (Chollet 2021, 2024). We aim to develop a unified model architecture that integrates the best-performing text and image models (see a diagram in Figure 8). However, to maintain the report’s coherence, we do not present an exhaustive account of every experiment conducted. Instead, we document key findings in the discussion sections of each branch and we focus on systematically refining architectural configurations and hyperparameters.

The structure of the report is as follows. We briefly discuss related work and the experiment design before moving on to an overview of the dataset with summary statistics in Section 4. Next, in Section 5, we establish a common-sense baseline and build a basic model that beats that baseline. We converge for each type on three architectural configurations with optimised model parameters. We apply this methodical approach for both of the branches individually, i.e. the text models in Section 5.1 and the image models in Section 5.2, as well as for the combined multimodal models in Section 5.3. There we will perform a hyperparameter search to converge on our best model. Finally, in Section 6 we evaluate our performance on the test set, and summarise the key contributions of the work, the challenges it faced, and further directions.

3 Experiment design

We train all the models on the training set and tested on the test set. Batch sizes of 512 (text) and 64 (image/multimodal) are selected based on empirical tuning of GPU load. While subject to hyperparameter tuning, models are largely trained with a learning rate of 1e-4 and an Adam optimiser (Kingma and Ba 2017). In our code, we set an early stopping condition which triggers validation loss stops decreasing and save the best model to disk. Each input image is resized to 180 \(\times\) 180 and the maximum length of each input text is set to 45. We train the model on an NVIDIA RTX GPU with 12 GB VRAM. We train each model with the training set, and keep the model that performs best on the validation set. The validation set is not seen by the trained models aside from guiding the model selection. The test set is held back until evaluation, where each model is evaluated against the testing set on accuracy and AUC. The best model is further evaluated on recall, precision, F1, and a confusion matrix showing the detailed false positive and false negative rates.

4 Dataset Overview and Preprocessing

The dataset contains the following key attributes:

  • index: Unique identifier for each article
  • title: The title of the piece
  • snippet: An excerpt of the article
  • summary: The main body of the article
  • country: The country of the publication
  • bias: A categorical string ‘left’ or ‘right’
  • numerical_rating: A continuous variable on a range from -5 to +5 indicating the bias (left to right)
  • source_api: The publication name
  • media: A hyperlink to the image associated with the content

4.1 Imports and Loading

In this section, we will import the libraries and load the dataset from the folder. The datasets are read using Pandas, allowing easy exploration analysis of the data.

Imports
# Turn off (some) warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# Import numerical packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from graphviz import Source
from sklearn.metrics import (classification_report, 
                             RocCurveDisplay, 
                             ConfusionMatrixDisplay, 
                             roc_auc_score, accuracy_score)


# Standard library
from tqdm.notebook import tqdm
from typing import Literal
import json, re, shutil
from pathlib import Path

# Neural network libraries
import keras; print("Using keras version:", keras.version()) # 3.8.0
from keras.api import layers, Model, saving, Input
from keras.api.utils import (load_img, 
                             img_to_array, 
                             array_to_img, 
                             image_dataset_from_directory, 
                             plot_model, model_to_dot)
import keras_tuner
import tensorflow as tf; print("Using tensorflow version:", tf.__version__)

# Global settings
pd.options.display.max_colwidth = 100
tqdm.pandas()
keras.utils.set_random_seed(153)
sns.set_theme('notebook','darkgrid')

# Path settings
data = Path(".data/")
imgs = data/"imgs"
articles = data/'articles'
models = data/'models'

# For better monitoring of VRAM memory use
physical_devices = tf.config.list_physical_devices('GPU')
if physical_devices:
    print("Running on", physical_devices[0].device_type)
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
Using keras version: 3.8.0
Using tensorflow version: 2.17.0
Running on GPU
Load the dataset
# Load the dataset
articles_df = pd.read_pickle(articles/'up_to_feb24.pkl')

# Inspect the key columns
articles_df[[
    'title', 
    'snippet', 
    'bias', 
    'numerical_rating', 
    'source_api'
    ]
    ].sample(5)
title snippet bias numerical_rating source_api
8816 The fight is on: Progressive groups gear up for a second Trump term � Donald Trump's presidential win still has the world reeling and processing the discouraging re... left -4.00 dailykos.com
6117 Pelosi undergoes ‘successful' hip replacement surgery in Luxembourg after injury The 84-year-old former House Speaker 'is well on the mend,' said spokesperson Ian Krager. left -1.20 politico.com
15814 STEPHEN MOORE: Will Blacks And Hispanics Vote Their Pocketbooks? Trump Should Hope So Trump's policies were far better for blacks and Hispanics than those of America's first black pr... right 3.80 dailycaller.com
20254 Facing pressure at home, GOP lawmakers warn Johnson against ‘hatchet' spending cuts On the eve of their first major vote to advance President Donald Trump's agenda, key House Repub... left -1.30 cnn.com
6186 Rahm Emanuel: An alliance to counter's China's aggression U.S. Amb. to Japan, Rahm Emanuel, joins Morning Joe to discuss his latest WSJ column 'An Allianc... left -3.71 msnbc.com
# Inspect the `media` urls
articles_df.media.sample(3).tolist()
['https://www.washingtonpost.com/wp-apps/imrs.php?src=https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/AX3RPYI2G5F7FDBTE2CWHP5DF4.jpg&w=1440',
 'https://freebeacon.com/wp-content/uploads/2024/10/MixCollage-25-Oct-2024-12-18-PM-740.jpg',
 'https://assets1.cbsnewsstatic.com/hub/i/r/2025/02/16/ac6c0659-40fb-408a-9346-0a8116d62b42/thumbnail/1200x630/ddfe30fbb91929fffe9e7ba273a12f32/20250216-ftn-marco-rubio-pretape-guest-iso-rs11-frame-6596.jpg?v=f303dc12868a012283443d8b9123e5fe']

4.2 Summary Statistics

The dataset contains 19,654 articles fetched from the NewsCatcher API, extended with images for this project.1

The API call specified:

  1. A topic: politics.
  2. A list of sources:
    The sources are based on labels obtained from AllSides.com2. These include categorical strings “left” and “right”, as well as continuous variables with a numerical rating. These features were were joined onto the articles data. To enable the binary classification objective, “centrist” sources were omitted. Furthermore, preference was given to sources with many records on AllSides.com, as this was taken as an indication of discriminativeness, and therefore useability.

The following table summarises the core attributes of the dataset:

Key Figures for the News Dataset
Attribute Value
Total Articles 19,654
Right-labelled articles 9,358
Left-labelled articles 10,296
Number of news sources 64
Rating range -5 to +5
Average numerical rating 0.28 (center right)
Date range 2/10/2024 - 14/2/2025
Country Distribition 77% US, 18% UK, 5% other

4.2.1 Summary of the Sources

Originally scraped from AllSides, the numerical_rating represent political bias. Below, we show a quarter of the labelled sources. See Appendix 8.1 for an exhaustive overview.

# Visualise political bias
articles_df['source'] = articles_df.source_api.str.removesuffix('.com')
bias_viz = articles_df[['source', 'numerical_rating']]\
    .drop_duplicates().sort_values('numerical_rating', ascending=False)

# Draw a sample
bias_viz_qrt = bias_viz[bias_viz.index % 4 == 0]

# Plot
ax = bias_viz_qrt.plot.barh('source', figsize=(6,4),
                            title='Political Bias Ratings',ylabel='',
                            xlabel='Political Bias', legend=False)
ax.vlines(0, *ax.get_ylim(), "gray", lw=0.5)
ax.set_xlim(-6,6)
plt.show()
Figure 1: Political Bias Ratings.

Let us now look at the key sources:

Number of Articles by News Source
#Articles by Source
newsmax 2322
dailycaller 2192
nbcnews 1543
cnn 1471
spectator.org 1149
commondreams.org 1054
dailykos 991
freebeacon 829
independent.co.uk 767
washingtonexaminer 760
fig, ax = plt.subplots()

source_viz = bias_viz.set_index('source')
source_viz.index.name = 'sources'
source_viz = source_viz.join(top_sources)
color = ['r' if num_rat < 0 else 'b' for num_rat in source_viz.numerical_rating]
ax.scatter(x=source_viz.numerical_rating, 
           y=source_viz.source, 
           s=source_viz.source * 0.05,
           c=color)

for x, y, s in zip(source_viz.numerical_rating, source_viz.source, source_viz.index):
    if y > 450 and 'dreams' not in s:
        ax.text(x - 1, y + y * 0.05, s)
        
ax.set_ylim(-100, 3000)
ax.set_xlim(-6,6)
ax.set_xlabel('Political Bias')
plt.title("Two right-labelled publications feature heavily", y=1.02)
plt.show()
Figure 2: Two right-labelled publications feature heavily in the dataset.
ax = sns.histplot(data=articles_df[['source_api', 'numerical_rating']],
                    x='numerical_rating',
                    hue=articles_df.numerical_rating < 0,
                    kde=True, legend='', bins=20)
ax.set_xlim(-6,6)
ax.set_xlabel('Political Bias')
ax.set_ylabel('')
ax.set_title('Distribution of the labels')
plt.show()
Figure 3: The distribution of continuous labels has a clear gap in the centre
# Binarise the labels
articles_df['binary_bias'] = articles_df.bias.map(lambda bias: 0 if bias == 'left' else 1)

4.2.2 Summary of article lengths

Next we deliberate the lengths to use for padding / truncating sequences. Shortening these will speed up training while longer sequences will give the model a richer input representation. We propose to balance the trade-offs by adding the snippets to the titles, and drawing a line at 45 words. Concatenating these text features should give model consistent lengths to work with.

fig, ax = plt.subplots(1,2, figsize=(8,4))
articles_df.title.str.split().apply(len).plot.hist(bins=100, ax=ax[0], xlim=(0,30))
ax = articles_df.snippet.dropna().str.split().apply(len).plot.hist(bins=50, ax=ax[1], xlim=(0,80))
ax.set_ylabel('')
plt.suptitle('Respective number of words in title and snippet')
plt.show()

articles_df['title_snippet'] = articles_df.title.fillna('') + ". " + articles_df.snippet.fillna('')
articles_df.dropna(subset='title_snippet', inplace=True)
ax = articles_df.title_snippet.str.split().apply(len).plot.hist(
    bins=100, xlim=(0,75), title='Number of words in combined string', figsize=(6,4))
ax.vlines(45, 0, 3000, 'r', '--')
ax.text(45, 2500, '<- Pad / truncate to here')
plt.tight_layout()
plt.show()

4.2.3 Summary of images

A sample of 20 images:

random_imgs = articles_df.sample(20).index

fig, axs = plt.subplots(4,5, figsize=(10,8), sharex=True, sharey=True)
axs = axs.flatten()

for i, img in enumerate(random_imgs):
    img_path = (imgs/str(img)).with_suffix(".jpg")
    img = load_img(img_path, target_size=(180, 180), keep_aspect_ratio=True)
    array = img_to_array(img)
    axs[i].imshow(array.astype("uint8"))
    axs[i].axis('off')

plt.suptitle('Some Images')
plt.tight_layout()
plt.show()
Figure 4: Some images in the dataset

4.3 Preprocessing

In this section we will compute train, validation, test indices, pass it in a tf.data.Dataset object, and batch the data. Next, we create a series of convenience functions to allow model-building iterations to focus solely on the model graph itself.

We start with, what is arguably the most important step in any machine learning project, and that is setting aside a portion of the dataset for testing.

4.3.1 Splitting the data

We here set aside 3000 samples for testing. This includes images, which we move to a dedicated test folder on our file system.

Note that this dataset is shuffled, i.e. the final 3000 rows corresponds to the full date range. We add all constituent parts and shuffle again. Crucially, before splitting the train and validation set, we use cache() to freeze the shuffle. This is essential as we continue training these models as pre-trained branches in our multimodal model. This ensures the validation set has not been seen by any of the branches.

# Set aside 3000 images as test
def set_aside_test():
    """Set aside a portion of the dataset for evaluation."""
    os.mkdir(imgs/'test')
    for image in articles_df.iloc[-3000:].index:
        image_str = str(image)
        shutil.move((imgs/image_str).with_suffix('.jpg'), 
                    (imgs/'test'/image_str).with_suffix('.jpg')) 
        
# set_aside_test()
# Load saved data (without 3000 test samples)
articles_df = pd.read_pickle(articles/'up_to_feb24.pkl')
articles_df = articles_df.iloc[:-3000].sort_index()
articles_train_val = articles_df.title_snippet.values
y_true_train_val = articles_df.binary_bias.values
y_true_numerical = articles_df.numerical_rating.values
def load_train_val_ds():
    """
    Loads and shuffles the data.
    
    Returns:
        Tuple (tf.data.Dataset): text_ds, img_ds, multimodal_ds, multimodal_ds_num

    Note: 
        It is imperative to run all models with one and only one shuffle 
        to prevent validation data being seen in the combined model. 
    """

    # Load training image dataset from folder
    img_ds_train_val = image_dataset_from_directory(
        directory=imgs/'train', labels=None, image_size=(180, 180),
        batch_size=None, shuffle=False)
    text_ds_train_val = tf.data.Dataset.from_tensor_slices(articles_train_val)
    y_true_ds_train_val = tf.data.Dataset.from_tensor_slices(y_true_train_val.astype("float32"))
    y_true_ds_numerical = tf.data.Dataset.from_tensor_slices(y_true_numerical.astype("float32"))
    
    # Combine using `zip`
    full_ds = tf.data.Dataset.zip((img_ds_train_val, text_ds_train_val, y_true_ds_train_val, y_true_ds_numerical))

    # Shuffle once and only once
    full_size = full_ds.cardinality().numpy()
    full_ds = full_ds.shuffle(buffer_size=full_size, seed=264).cache(data/'cache')

    # Create datasets    
    text_ds = full_ds.map(lambda img, text, label, num: (text, label))
    img_ds = full_ds.map(lambda img, text, label, num: (img, label))
    multimodal_ds = full_ds.map(lambda img, text, label, num: ((img, text), label))
    multimodal_ds_num = full_ds.map(lambda img, text, label, num: ((img, text), num))
    return text_ds, img_ds, multimodal_ds, multimodal_ds_num

text_ds, img_ds, multimodal_ds, multimodal_ds_num = load_train_val_ds()
BATCH_SIZE = 512

# Compute train-val split
train_size = text_ds.cardinality().numpy() - 3000
val_size = 3000

# Split dataset 
train_batch = text_ds.take(train_size).repeat().batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_batch = text_ds.skip(train_size).take(val_size).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

# Steps per epoch
steps_per_epoch = max(train_size // BATCH_SIZE, 1)
validation_steps = max(val_size // BATCH_SIZE, 1)

4.3.2 Shared functionality

In this section, we define:

  • A function to wrap a vectoriser
  • A function to save results to a JSON file on disk
  • A function to wrap the compilation and training (fit) of the model

This will make the following sections easier to read.

Wrapping functionality
# Hyperparams
VOCAB_SIZE = 20000
MAX_LENGTH = 45

def get_vectoriser(
    output_sequence_length=MAX_LENGTH, 
    output_mode: Literal["multi_hot", "int", 'count', 'tfidf'] = 'multi_hot'):
    """
    Create a TextVectorization layer with specified presets.

    Args:
    output_sequence_length (int): The maximum length of the output sequence. Defaults to MAX_LENGTH.   
    output_mode (Literal): The mode in which the output should be represented. 
                   Can be 'multi_hot', 'int', 'count', or 'tfidf'. Defaults to 'multi_hot'.

    Returns:
    keras.layers.TextVectorization: A configured TextVectorization layer.
    """
    # Combine presets
    pad_to_max_tokens = True if output_sequence_length else False
    ngrams = 2 if output_mode == 'multi_hot' else None

    # Instantiate the vectoriser
    vectoriser = layers.TextVectorization(
            max_tokens=VOCAB_SIZE, # Global variable
            output_mode=output_mode,
            output_sequence_length=output_sequence_length,
            pad_to_max_tokens=pad_to_max_tokens,
            ngrams=ngrams)
    
    # Convert the vocabulary
    vectoriser.adapt(articles_train_val)

    # Return the preconfigured layer
    return vectoriser

def save_results(history, name):
    """
    Save the training history results to a JSON file and display the last epoch's results.

    This function reads an existing 'results.json' file or creates a new one if it doesn't exist.
    It appends the latest training results to the file, including the model name and the metrics
    from the last epoch. The results are then displayed as a DataFrame.

    Args:
        history (keras.callbacks.History): The history object returned by the `fit` method of a Keras model.
        name (str): The name of the model, used to identify the results in the JSON file.

    Returns:
        None
    """
    # Open or create results
    try:
        with open('results.json', 'r') as fin:
            results = json.load(fin)
    except:
        results = []

    # Add the model name and the last results
    result = {'model_name': name}
    last_epoch = {k: round(history.history[k][-1], 4) for k in history.history.keys()}
    result.update(last_epoch)
    results.append(result)

    # Save
    with open('results.json', 'w') as fout:
        json.dump(results, fout, indent=4)

    display(pd.Series(last_epoch).to_frame(name))
    
def compile_fit(model, name, **kwargs):
    """
    Compile the model and fit it to the data.

    This function compiles the given model with the 'rmsprop' optimizer and 'binary_crossentropy' loss.
    It also sets up callbacks for early stopping, TensorBoard, and model checkpointing to save the best iteration.
    The model is then trained on the training data and validated on the validation data for a specified number of epochs.
    The training history is saved and the final iteration's results are printed.

    Args:
        model (keras.Model): The Keras model to be compiled and trained.
        name (str): The name used for saving the best model checkpoint and results.

    Returns:
        None
    """
    lr = kwargs.get('lr', 1e-4)

    model.compile(
        optimizer=kwargs.get('optimizer', keras.optimizers.Adam(learning_rate=lr)), 
        loss="binary_crossentropy", metrics=["accuracy", "auc"])

    callbacks = [
        keras.callbacks.EarlyStopping(), keras.callbacks.TensorBoard(),
        keras.callbacks.ModelCheckpoint(models/f"{name}.keras", save_best_only=True)]
    
    # A brief summary
    print(f"Total params: {model.count_params()} (of which trainable: {sum(np.prod(w.shape) for w in model.trainable_weights)}).")

    history = model.fit(train_batch, 
                        validation_data=val_batch, 
                        epochs=kwargs.get('epochs', 50),
                        steps_per_epoch=steps_per_epoch,
                        validation_steps=validation_steps,
                        callbacks=callbacks,
                        verbose=kwargs.get('verbose', 0))
    
    print(f"Trained in {len(history.epoch)} epochs.")
    
    # Save results and show the last values
    save_results(history, name)

5 Building a Multimodal Classifier

In this section we will train two kinds of models, a text and an image classifier. Then we will combine the two in a joint architecture, using the Keras Functional API.

The objectives for this section are:

  1. Training a text-based model

    1. A simple bigram model
    2. A Bidirectional LSTM with an embedding layer
    3. A Bidirectional LSTM with Pre-trained Word Embeddings
  2. Training an image-based model

    1. A basic Convolutional Network
    2. A more sophisticated ConvNet
    3. A fine-tuned Pre-trained CNN
  3. Training an multimodal model

    1. A basic multimodal classifier
    2. An optimised version using keras_tuner hyperparameter search
    3. A multimodal regressor

In keeping with best practice, let’s first put down a purely analytical baseline. Here, we simply predict the majority label for each of our test records:

# Calculate the majority class
majority_class = np.mean(y_true_train_val) # 0.52
majority_class_binary = 1 if majority_class > 0.5 else 0

# Calculate the Accuracy of a majority-class predictor
y_true_test = pd.read_pickle(articles/'up_to_feb24.pkl').iloc[-3000:].binary_bias.values
np.mean(y_true_test == majority_class_binary)
0.5206666666666667
Note

The purely analytical baseline achieves an accuracy of 0.52. This is low yet unsurprising given our efforts to balance the dataset.

5.1 Text-based Models

We will experiment with three common architecture patterns. For each of these, we will try extensive variations in configuration, both of model architecture and of hyperparameters. We here present three consolidated attempts at finding the best model, each within its respective architectural typology. At the end of the section we will walk through the paths that have been explored.

5.1.1 The Bag-of-bigrams Model

We start by looking at a simple base model: the bigram model. We converged on a model that is made up of two Dense layers layers with 50% dropout in between. Linguistic features are represented a bag-of-words: a sparse vector representation of the occurance of words in the article.

HID_DIM = 20
MAX_LENGTH = None

inputs = Input(shape=(), dtype=tf.string)
x = get_vectoriser(MAX_LENGTH)(inputs)
x = layers.Dense(HID_DIM, "relu")(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(HID_DIM, "relu")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

bigram_model = Model(inputs, outputs, name='bigram_model')

compile_fit(bigram_model, 'bigram', epochs=30)
Total params: 400461 (of which trainable: 400461).
Trained in 30 epochs.
bigram
accuracy 0.9204
auc 0.9722
loss 0.2850
val_accuracy 0.8109
val_auc 0.8938
val_loss 0.4055

5.1.2 The Bidirectional LTSM Model

In this model, we will use Keras’ recurrent layers for sequence learning. More specifically, we will implement a Bidirectional LSTM. This is fed dense word vectors from an embedding layer, which it learns as part of its model training.

Here, we converged on 16-dimensional embedding vectors, contributing to making this the smallest model in our experiment. It uses one bidirectional LSTM layer with dropout and recurrent dropout both set at high levels, more dropout is added before and after a subsequent Dense layer. This may is counter intuitive but leads to superiour results.

MAX_LENGTH = 50
EMBED_DIM = 16
LTSM_DIM = 25
HID_DIM = 50

inputs = Input(shape=(), dtype=tf.string)
x = get_vectoriser(MAX_LENGTH, 'int')(inputs)
x = layers.Embedding(VOCAB_SIZE, EMBED_DIM, mask_zero=True)(x)
x = layers.Bidirectional(layers.LSTM(LTSM_DIM, 
                                     dropout=0.5, 
                                     recurrent_dropout=0.5
                                     ))(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(HID_DIM, "relu")(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

bidir_lstm = Model(inputs, outputs, name='bidir_lstm')

compile_fit(bidir_lstm, 'bidir_lstm')
Total params: 331001 (of which trainable: 331001).
Trained in 46 epochs.
bidir_lstm
accuracy 0.9023
auc 0.9543
loss 0.2880
val_accuracy 0.7891
val_auc 0.8689
val_loss 0.4465

5.1.3 Bidirectional LTSM with Glove

Continuing with the bidirectional LTSM, here we add GloVe pre-trained word embeddings (Pennington, Socher, and Manning 2014). First, we create the embedding matrix (adapted from Chollet (2021)) mapping words to dense word vectors. Then we pass our vectoriser to the embedding layer, which contains the vocabulary mapping giving each word a unique id.

The model graph itself consists of one bidirectional LSTM layer with recurrent_dropout set to 0.5. The subsequent block starts and ends with dropout and has a residual connection providing direct feedback connection to the bidirectional LSTM layer. This allows better propagation of the error signal, with empirically validated results.

def get_embeddings(vectoriser):
    """
    Get the pre-trained embeddings and transform the vocabulary into an embedding_matrix.

    Args:
        vectoriser (keras.layers.TextVectorization): The TextVectorization layer used to vectorize the text data.

    Returns:
        np.ndarray: A mapping of word ids and vectors in the form of an embedding matrix.

    Note:
        Ensure a global variable EMBED_DIM = 100 is defined.
    """
    
    path_to_glove_file = data/"glove.6B/glove.6B.100d.txt"
    embeddings_index = {}
    with open(path_to_glove_file) as f:
        for line in f:
            word, coefs = line.split(maxsplit=1)
            coefs = np.fromstring(coefs, "f", sep=" ")
            embeddings_index[word] = coefs

    vocabulary = vectoriser.get_vocabulary()
    word_index = dict(zip(vocabulary, range(len(vocabulary))))

    embedding_matrix = np.zeros((VOCAB_SIZE, EMBED_DIM))
    for word, i in word_index.items():
        if i < VOCAB_SIZE:
            embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix
EMBED_DIM = 100 # glove
MAX_LENGTH = 50

vectoriser = get_vectoriser(MAX_LENGTH, 'int')

embedding_layer = layers.Embedding(VOCAB_SIZE, EMBED_DIM,
                                   weights=get_embeddings(vectoriser),
                                   trainable=True, mask_zero=True, 
                                   embeddings_regularizer=keras.regularizers.L2(1e-5))
LTSM_DIM = 128
HID_DIM = 256

inputs = Input(shape=(), dtype=tf.string)
x = vectoriser(inputs)
x = embedding_layer(x)
x = layers.Bidirectional(layers.LSTM(LTSM_DIM, recurrent_dropout=0.5))(x)
residual = x
x = layers.Dropout(0.5)(x)
x = layers.Dense(HID_DIM, "relu")(x)
x = layers.Dropout(0.5)(x)
x = layers.Add()([x, residual])
outputs = layers.Dense(1, activation="sigmoid", dtype=tf.float32)(x)

bidir_lstm_glove = Model(inputs, outputs, name='bidir_lstm_glove')
# bidir_lstm_glove.summary(line_length=80)

compile_fit(bidir_lstm_glove, 'bidir_lstm_glove')
Total params: 2300545 (of which trainable: 2300545).
Trained in 50 epochs.
bidir_lstm_glove
accuracy 0.8730
auc 0.9433
loss 3.6390
val_accuracy 0.7809
val_auc 0.8609
val_loss 3.8156

5.1.4 Summary of Results of Text-based Models

Before discussing the models in greater detail let’s inspect the results side-by-side:

text_results = pd.read_json('results.json').set_index('model_name') * 100
text_results = text_results.iloc[:3][['accuracy', 'auc', 'loss', 'val_accuracy', 'val_auc', 'val_loss']]
text_results
accuracy auc loss val_accuracy val_auc val_loss
model_name
bigram 89.37 95.85 35.09 80.04 88.82 43.99
bidir_lstm 85.07 90.18 41.05 76.72 84.80 48.20
bidir_lstm_glove 84.46 92.19 389.36 75.90 85.02 401.82
acc_viz = text_results[['val_accuracy', 'val_auc']]
ax = acc_viz.plot.bar(title='Text Models Accuracy and AUC (validation)', 
                      rot=0, xlabel='', ylabel='Percent')

for x, y in zip(ax.get_xticks(), acc_viz.val_auc.values):
    ax.text(x, y+ 1, round(y, 2))
for x, y in zip(ax.get_xticks(), acc_viz.val_accuracy.values):
    ax.text(x - 0.28, y + 1, round(y, 2))

plt.ylim(0,110)
plt.legend(['Accuracy', 'AUC'],
           bbox_to_anchor=(1, 1, 0, 0))
plt.show()
Figure 5: Text Models Accuracy and AUC (validation)
Discussion of Text Models

Key Insight

Moving from the simple Bag-of-bigrams model to a pre-trained embedding-based recurrent network, the expectation was to see a clear progression. However, performance decreased. This is an important result and we adjust our objective correspondingly. We will proceed in our multimodal architecture with the Bag-of-bigrams Model.

In a discussion on sequence learning and model complexity, Chollet (Chollet 2021) suggests to look at the ratio between the number of samples in the training data and the mean number of words per sample. If that ratio is less than 1,500 then the bag-of-bigrams model may perform better. In our dataset, we have just under 20,000 samples with 45 words per sample. With a ratio of under 450, the Bag-of-Bigrams model aligns with Chollet’s heuristic, suggesting it as a more suitable approach for our dataset. According to this heuristic, we would need 67,500 articles for sequence models to outperform.

The bigram model has the added benefit of being lightweight and fast. With its 2.3 million parameters the LTSM-based model with Glove embedding is high in complexity, both in terms of memory and compute — irrespective of whether the pre-trained embeddings are fine-tuned (by setting trainable=True). The memory load remains significant. Regularising the updates to the pre-trained embeddings with L2-regularisation (1e-5) helped reducing large updates to the embedding weights, boosting performance further.

Before continuing with the Computer Vision Branch of our classifier, we briefly reflect on our empirical approach to finding the best model configuration and share the key insights.

Bag-of-Bigram performs better without Tf-idf

While the Bigram Model performed very well out-the-box, it proved hard to optimise further. Additional dense layers, additional units, different optimisers, adjustments to the learning rate, all have little effect. Even Tf-Idf had a (small) negative impact on performance.

Pre-Trained Embeddings Struggle with New Words

The two recurrent models follow a common architectural pattern in text processing. They both use an embedding layer. The main difference is the use of pre-trained GloVe embeddings (Pennington, Socher, and Manning 2014). While the first model learns embedding weights as part of the model training, the latter either updates the weights (when trainable=True in the embedding layer) or leaves them frozen. Adding regularisation to the pre-trained embeddings eventually gave the best performance, yet the gain is small. Including them in the training has little difference overall. We suspect the culprit is the high specificity and timeliness of the words used in news stories — illustrated here:

A print-out of words zet to zero by GloVe
vectoriser = get_vectoriser(MAX_LENGTH, 'int')
embedding_matrix = get_embeddings(vectoriser)
vocabulary = vectoriser.get_vocabulary()

unknown_words = embedding_matrix.sum(axis=1) == 0
mask_vocab = np.array(vocabulary)[unknown_words]
print(list(mask_vocab)[:100])
['', '[UNK]', 'presidentelect', 'zerohedge', 'bidenharris', '�', 'the…', 'to…', 'cnns', 'tiktok', 'starmers', 'antiisrael', 'jenrick', '‘the', 'and…', 'brexit', 'covid', 'zelenskyy', 'cop29', 'maralago', 'farright', 'countybycounty', 'poilievre', 'presidentelects', 'of…', 'mailin', 'a…', 'protrump', 'shouldnt', 'kamalas', 'in…', '‘i', '‘a', 'vances', 'lastminute', 'ukraines', 'syrias', 'farages', 'alassad', 'advertisingfree', 'antitrump', '‘garbage', 'sinwar', 'chrystia', 'charlamagne', 'hegseths', 'harriswalz', 'for…', 'covid19', '•', '‘we', 'newsoms', 'michigans', '‘not', 'trudeaus', 'taxpayerfunded', 'propalestinian', 'msnbcs', 'highprofile', '‘its', 'workingclass', 'werent', 'waspi', 'theyve', 'gaetzs', 'bessent', '‘no', 'streeting', 'that…', 'ocasiocortez', 'arizonas', 'trumpvance', 'theyll', 'rfla', 'nonconfidence', 'chavezderemer', 'alsobrooks', 'agencys', '‘they', 'sunak', 'reevess', 'pennys', 'on…', 'malliotakis', 'his…', '‘you', 'walzs', 'universitys', 'swingstate', 'rtexas', 'rny', 'postbrexit', 'pennsylvanias', 'genderaffirming', 'franciscos', 'farleft', 'familys', 'badenochs', '‘it', '‘he']
Discussion of Text Models (continued)

Small is beautiful when it comes to learned Embeddings

Letting the model learn its embeddings through backpropagation overcomes this problem of new and unknown words. Starting off with 256 dimensions, the model was comparable to the second model in size. However, reducing the dimensionality of these embeddings had little impact to performance. With only 16 dimensional vectors per word, we significantly reduced the overal parameter count, without sacrificing accuracy and AUC. This reduction also reduced our reliance on dropout and residual connections for regularisation.

Alongside dropout, for the recurrent layers we used recurrent dropout. This is a temporally constant dropout that preserves its state at every timestep of the sequence (Chollet 2021, 301). Similarly we found that adding a dense layer makes a significant difference. This is in line with Chollet’s suggestions in “Going even further” (Chollet 2021, 307 - 308).

An overview of other variations to the embedding-based models

  • Embedding layer
    • Embedding dimensions from 32 to 256
  • Recurrent layer
    • Unidirectional LSTM
    • Uni- and bidirectional GRU
    • One and two bidirectional LSTM layers (with return_sequences=True)
    • Recurrent dropout
      • Without recurrent_dropout
      • With recurrent_dropout ranging from 0.2 to 0.5
    • With recurrent layer units ranging from 16 to 512
  • Dropout
    • Without dropout
    • With dropout ranging from 0.2 to 0.5
  • Dense layers
    • With Dense layer ranging from 64 to 512
  • Learning rate and optimiser
    • Changing the optimiser to keras.optimizers.Adam
    • Lowering the learning rate to 1e-4
  • Residual stream
    • With a residual stream around the dense and second dropout layer
  • Max Pooling
    • Setting return_sequences=True outputs the articles as sequences; GlobalMaxPool1D then aggregates these.

See Appendix 8.4 for an in-depth look of model outputs on various pieces of text.

Moving on to developing our Computer Vision branch, we will now load the image data and then compare a number of CNN-based models.


5.2 Image-based Models

For our Computer Vision branch, we will again experiment with three common architecture patterns. As for the text branch. We will try various configurations, both in terms of model architecture and hyperparameters. Yet here we present a consolidated effort at finding the best model. We will discuss their respective merits, and the paths that led us to them in the discussion at the end of the section.

We start by batching the data (continuing from Splitting the Data 4.3.1):

BATCH_SIZE = 64

# Compute train-val split
train_size = img_ds.cardinality().numpy() - 3000
val_size = 3000

# Split dataset 
train_batch = img_ds.take(train_size).repeat().batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_batch = img_ds.skip(train_size).take(val_size).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

# Steps per epoch
steps_per_epoch = max(train_size // BATCH_SIZE, 1)
validation_steps = max(val_size // BATCH_SIZE, 1)

5.2.1 Basic ConvNets

We start with the typical architectural pattern (found in Chollet 2021) of a series of Convolutional and Max Pooling Layers.

inputs = Input(shape=(180,180,3), dtype=tf.float32)
x = layers.Rescaling(1./255)(inputs)
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

# Construct the model
basic_cnn_model = Model(inputs=inputs, outputs=outputs)

# Compile and train
compile_fit(basic_cnn_model, 'basic_cnn')
Total params: 308417 (of which trainable: 308417).
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1740517459.953424  259519 service.cc:146] XLA service 0x792078036a90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1740517459.953439  259519 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 4070 SUPER, Compute Capability 8.9
I0000 00:00:1740517462.973896  259519 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
Trained in 3 epochs.
basic_cnn
accuracy 0.7481
auc 0.8331
loss 0.4935
val_accuracy 0.5564
val_auc 0.5810
val_loss 0.7845
del basic_cnn_model

5.2.2 Better ConvNet

We continue our search for the best model by combining different layers and borrowing from Xception (see Chollet 2021, 259 - 260).

inputs = Input(shape=(180, 180, 3))
x = layers.Rescaling(1./255)(inputs)
x = layers.SeparableConv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.SeparableConv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.SeparableConv2D(filters=256, kernel_size=3, use_bias=False)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
better_cnn_model = Model(inputs=inputs, outputs=outputs)

compile_fit(better_cnn_model, 'better_cnn', lr=1e-5)
Total params: 474460 (of which trainable: 473948).
Trained in 16 epochs.
better_cnn
accuracy 0.6455
auc 0.7018
loss 0.6236
val_accuracy 0.5778
val_auc 0.5927
val_loss 0.6635
del better_cnn_model

5.2.3 Pre-Trained ConvNet

Here we compare feature extraction and fine-tuning on our dataset, as in Chollet (2021).

conv_base = keras.applications.VGG16(
    weights="imagenet",
    include_top=False,
    input_shape=(180,180,3))
# Option 1: Feature extraction
conv_base.trainable = False
# Option 2: Fine-tuning > Not used
# conv_base.trainable = True
# for layer in conv_base.layers[:-8]:
#     layer.trainable = False
conv_base.summary(show_trainable=True, line_length=80) # Feature extraction (visible by the N in the last column)
Model: "vgg16"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┓
┃ Layer (type)                   Output Shape                Param #  Trai… ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━┩
│ input_layer_5 (InputLayer)    │ (None, 180, 180, 3)    │           0-   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block1_conv1 (Conv2D)         │ (None, 180, 180, 64)   │       1,792N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block1_conv2 (Conv2D)         │ (None, 180, 180, 64)   │      36,928N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block1_pool (MaxPooling2D)    │ (None, 90, 90, 64)     │           0-   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block2_conv1 (Conv2D)         │ (None, 90, 90, 128)    │      73,856N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block2_conv2 (Conv2D)         │ (None, 90, 90, 128)    │     147,584N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block2_pool (MaxPooling2D)    │ (None, 45, 45, 128)    │           0-   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block3_conv1 (Conv2D)         │ (None, 45, 45, 256)    │     295,168N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block3_conv2 (Conv2D)         │ (None, 45, 45, 256)    │     590,080N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block3_conv3 (Conv2D)         │ (None, 45, 45, 256)    │     590,080N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block3_pool (MaxPooling2D)    │ (None, 22, 22, 256)    │           0-   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block4_conv1 (Conv2D)         │ (None, 22, 22, 512)    │   1,180,160N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block4_conv2 (Conv2D)         │ (None, 22, 22, 512)    │   2,359,808N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block4_conv3 (Conv2D)         │ (None, 22, 22, 512)    │   2,359,808N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block4_pool (MaxPooling2D)    │ (None, 11, 11, 512)    │           0-   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block5_conv1 (Conv2D)         │ (None, 11, 11, 512)    │   2,359,808N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block5_conv2 (Conv2D)         │ (None, 11, 11, 512)    │   2,359,808N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block5_conv3 (Conv2D)         │ (None, 11, 11, 512)    │   2,359,808N   │
├───────────────────────────────┼────────────────────────┼─────────────┼───────┤
│ block5_pool (MaxPooling2D)    │ (None, 5, 5, 512)      │           0-   │
└───────────────────────────────┴────────────────────────┴─────────────┴───────┘
 Total params: 14,714,688 (56.13 MB)
 Trainable params: 0 (0.00 B)
 Non-trainable params: 14,714,688 (56.13 MB)
inputs = Input(shape=(180, 180, 3))
x = layers.Rescaling(1./255)(inputs)
x = keras.applications.vgg16.preprocess_input(x)  
x = conv_base(x)
x = layers.Flatten()(x)
x = layers.Dense(64)(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

pretrained_cnn = Model(inputs, outputs)
compile_fit(pretrained_cnn, 'pretrained_cnn', lr=1e-5)
Total params: 15534017 (of which trainable: 819329).
Trained in 8 epochs.
pretrained_cnn
accuracy 0.5555
auc 0.5627
loss 0.6773
val_accuracy 0.5296
val_auc 0.5464
val_loss 0.6817
del pretrained_cnn

5.2.4 Summary of Results of ConvNet Models

results = pd.read_json('results.json').set_index('model_name') * 100
cnn_results = results[results.index.str.contains('cnn')]
cnn_results[['accuracy', 'auc', 'loss', 'val_accuracy', 'val_auc', 'val_loss',]]
Table 1: Image Models Accuracy and AUC (validation)
accuracy auc loss val_accuracy val_auc val_loss
model_name
basic_cnn 74.81 83.31 49.35 55.64 58.10 78.45
better_cnn 64.55 70.18 62.36 57.78 59.27 66.35
pretrained_cnn 55.55 56.27 67.73 52.96 54.64 68.17
acc_viz = cnn_results[['val_accuracy', 'val_auc']]
ax = acc_viz.plot.bar(title='Image Models Accuracy and AUC (validation)', 
                      rot=0, xlabel='', ylabel='Percent')

for x, y in zip(ax.get_xticks(), acc_viz.val_auc.values):
    ax.text(x, y+ 1, round(y, 2))
for x, y in zip(ax.get_xticks(), acc_viz.val_accuracy.values):
    ax.text(x - 0.28, y + 1, round(y, 2))

plt.ylim(0,100)
plt.legend(['Accuracy', 'AUC'],
           bbox_to_anchor=(1, 1, 0, 0))
plt.show()
Figure 6: Image Models Accuracy and AUC (validation)
Discussion of Image Models

Key Insights

For the task of bias classification, pretrained models do not yield improvements over the basic cnn. In fact, a series of iterations on the typical ConvNet pattern yielded the greatest gains. However, the choice of optimiser was significant. RMSprop with a learning rate of 1e-3 failed to reduce any loss. By contrast, Adam with 1e-4 performed very well.

Another central takeaway is kernel size. Larger kernel sizes allow the network to learn more complex patterns (Chollet 2021). Adding filters means more expressive power. The typical pattern follows a doubling of filter size in each successive ConV2D block. However, too many parameters lead to overfitting. Hence the need for BatchNormalization. We apply this in our better model right before activation. We also omit the bias (use_bias=False) to prune some of the complexity.

Other paths of enquiry that have been investigated:

  • Increasing filters, both the number of layers and the step size (doubling, quadrupling of the number of filters) of successive layers
  • Using SeparableConv2D instead of Conv2D (in various configurations)
  • Using GlobalAveragePooling2D instead of a Flatten layer (as seen in Xception (Chollet 2021))
  • Adding up to 3 additional Dense layers between the Flatten and Sigmoid layer, for a more gradual dimensionality reduction (as seen in the VGG16 example (Chollet 2021))

Note: the model is likely to learn shortcuts

Given that some images feature logos of the publication, it is likely that our convnets are learning logos and other pictoral features that reveal its affiliation (recall that labels are publication-based).

We can inspect this visually in three steps:

  1. We get an image
  2. We create an activation model by collecting the intermediary activations
  3. We call predict on the activation model with the below image
Pick an image
#| fig-cap: "An image with a logo"
#| label: fig-logo

# 1. Pick an image
img_path = (imgs/'train'/str(19)).with_suffix(".jpg")
img = load_img(img_path, target_size=(180, 180), keep_aspect_ratio=True)
array = img_to_array(img)
plt.imshow(array.astype("uint16"))
plt.axis("off")
plt.show()

Create an activation model by collecting the intermediary activations
# 2. Create an activation model by collecting the intermediary activations

better_cnn_model = saving.load_model(models/'better_cnn.keras')
layer_outputs = []
layer_names = []
for layer in better_cnn_model.layers:
    if isinstance(layer, (layers.Conv2D, layers.MaxPooling2D)):
        layer_outputs.append(layer.output)
        layer_names.append(layer.name)
        
activation_model = Model(inputs=better_cnn_model.input, outputs=layer_outputs)
Call predict on the activation model with the image tensor
# 3. Call `predict` on the activation model with the image tensor
layer_num = 0
img_tesnor = np.expand_dims(array, axis=0)
activations = activation_model.predict(img_tesnor)
layer_activation = activations[layer_num]

print(f'The {layer_names[layer_num]} activation has {layer_activation.shape[-1]} filters')
filters = np.random.randint(0, 64, 20) 

fig, axs = plt.subplots(4,5, figsize=(10,8))
axs = axs.flatten()

for i, f in enumerate(filters):
    axs[i].imshow(layer_activation[0, :, :, f], cmap="magma")
    axs[i].set_xticks([]) 
    axs[i].set_yticks([]) 
    axs[i].set_frame_on(False) 


plt.suptitle("The Guardian logo is picked up by many filters", y=1.01)
plt.tight_layout()
plt.show()
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step
The max_pooling2d_5 activation has 64 filters
Figure 7: The Guardian logo is picked up by many filters
Discussion of Image Models (continued)

The logo will aid the model to some extent. As this is an unwanted side-effect, future efforts should include an attempt to identify patches with unnaturally low variation in RGB values (e.g. the rectrangular frame around the logo) and mask them out.

We now move on to part three of this project, where we combine the text and image features into a multimodal classifier.

5.3 Multimodal model

We batch the data (see Splitting the Data 4.3.1).

BATCH_SIZE = 64

# Compute train-val split
train_size = multimodal_ds.cardinality().numpy() - 3000
val_size = 3000

# Split dataset 
train_batch = multimodal_ds.take(train_size).repeat().batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_batch = multimodal_ds.skip(train_size).take(val_size).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

# Steps per epoch
steps_per_epoch = max(train_size // BATCH_SIZE, 1)
validation_steps = max(val_size // BATCH_SIZE, 1)

5.3.1 A lean dual-input classifier

To create a multi-input classifier, we reload our best models, remove the final layers, and concatenate their results. For efficiency, we freeze the weights and retrain only the classifier head itself. This reduces the number of trainable parameters by half.

# Load the pre-trained models
bigram_model = saving.load_model(models/'bigram.keras')
better_cnn_model = saving.load_model(models/'better_cnn.keras')

# Remove the final layers and load into a new model
bigram_model_ = Model(inputs=bigram_model.input, 
                      outputs=bigram_model.layers[-2].output, 
                      name='bigram_model_')

better_cnn_model_ = Model(inputs=better_cnn_model.input, 
                          outputs=better_cnn_model.layers[-2].output, 
                          name='better_cnn_model_')

# Before calling the above graph, we freeze the layers 
bigram_model_.trainable = False
better_cnn_model_.trainable = False

# Run input through the image model
img_inputs = Input(shape=(180, 180, 3))
image_output = better_cnn_model_(img_inputs)

# Run input through the text model
text_inputs = Input(shape=(), dtype=tf.string)
text_output = bigram_model_(text_inputs)

combined = layers.Concatenate()([image_output, text_output])
outputs = layers.Dense(1, activation="sigmoid")(combined)
multimodal_model_sm = Model(inputs=[img_inputs, text_inputs], outputs=outputs)

compile_fit(multimodal_model_sm, 'multimodal_model_sm')
Total params: 874920 (of which trainable: 430357).
Trained in 22 epochs.
multimodal_model_sm
accuracy 0.9546
auc 0.9911
loss 0.2374
val_accuracy 0.8013
val_auc 0.8969
val_loss 0.4173
dot_string = model_to_dot(multimodal_model_sm, rankdir="TB", 
                          show_dtype=False, show_layer_names=True,
                          expand_nested=True, dpi=60,
                          show_layer_activations=False,
                          show_trainable=True).to_string()

Source(re.sub(r'\(\w+\)', '', dot_string))
Figure 8: A plot of the multimodal model

5.3.2 An optimised dual-input classifier

Here we add two layers to the frozen model. The first is between the ConvNet output (of 430 thousand) and the concatenation, and the second after the concatenation.

We also use Keras-tuner for a further refinement in the form of a random search over a range of hyperparameters (O’Malley et al. 2019).

# Load models without compile
bigram_model = keras.models.load_model(models/'bigram.keras', compile=False)
better_cnn_model = keras.models.load_model(models/'better_cnn.keras', compile=False)

# Remove the final layers
bigram_model_ = Model(inputs=bigram_model.input, outputs=bigram_model.layers[-2].output, name='bigram_model_')
better_cnn_model_ = Model(inputs=better_cnn_model.input, outputs=better_cnn_model.layers[-2].output, name='better_cnn_model_')

# Freeze the layers 
bigram_model_.trainable = False
better_cnn_model_.trainable = False

def build_model(hp):
    """Build the model with automated hyperparameter search"""

    # Define search space hyperparams
    units = hp.Choice('units', [1, 4])
    lr = hp.Float("lr", min_value=1e-5, max_value=1e-4, sampling="log")
    dropout = hp.Choice('rate', [0.2, 0.5])
    
    # Run input through the text model
    text_inputs = Input(shape=(), dtype=tf.string)
    text_output = bigram_model_(text_inputs)

    # Run input through the image model
    img_inputs = Input(shape=(180, 180, 3))
    image_output = better_cnn_model_(img_inputs)
    image_output = layers.Dense(units, 'relu')(image_output)       # Added layer

    combined = layers.Concatenate()([text_output, image_output])
    combined = layers.Dense(units, activation='relu')(combined)    # Added layer
    if hp.Boolean("dropout"):
        combined = layers.Dropout(dropout)(combined)                # Added dropout

    outputs = layers.Dense(1, activation="sigmoid")(combined)
    better_multimodal_model = Model(inputs=[img_inputs, text_inputs], outputs=outputs)  

    better_multimodal_model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr), 
        loss="binary_crossentropy", metrics=["accuracy", "auc"])
    
    return better_multimodal_model

tuner = keras_tuner.RandomSearch(
    build_model,
    objective='val_loss',
    max_trials=5,
    overwrite=True,
    directory=models,
    project_name="tuner",)

tuner.search_space_summary()

tuner.search(train_batch, validation_data=val_batch, 
             epochs=5, steps_per_epoch=steps_per_epoch,
             validation_steps=validation_steps)

tuner.results_summary()
Trial 5 Complete [00h 00m 41s]
val_loss: 0.6034466028213501

Best val_loss So Far: 0.42429736256599426
Total elapsed time: 00h 03m 26s
Results summary
Results in .data/models/tuner
Showing 10 best trials
Objective(name="val_loss", direction="min")

Trial 1 summary
Hyperparameters:
units: 4
lr: 3.928633850073882e-05
rate: 0.2
dropout: False
Score: 0.42429736256599426

Trial 0 summary
Hyperparameters:
units: 4
lr: 3.46687027268955e-05
rate: 0.5
dropout: False
Score: 0.5003401041030884

Trial 2 summary
Hyperparameters:
units: 1
lr: 2.7698135335774998e-05
rate: 0.2
dropout: False
Score: 0.5480660200119019

Trial 4 summary
Hyperparameters:
units: 4
lr: 2.8051212860282087e-05
rate: 0.5
dropout: True
Score: 0.6034466028213501

Trial 3 summary
Hyperparameters:
units: 1
lr: 4.37851186270442e-05
rate: 0.2
dropout: True
Score: 0.6198257207870483
tuner.get_best_hyperparameters()[0].values
{'units': 4, 'lr': 3.928633850073882e-05, 'rate': 0.2, 'dropout': False}
# Run another iteration for logging
compile_fit(tuner.get_best_models()[0], 'best_multimodal_model')
/home/jonux/miniconda3/envs/neural/lib/python3.10/site-packages/keras/src/saving/saving_lib.py:757: UserWarning: Skipping variable loading for optimizer 'adam', because it has 2 variables whereas the saved optimizer has 14 variables. 
  saveable.load_own_variables(weights_store.get(inner_path))
Total params: 2166016 (of which trainable: 1721453).
Trained in 9 epochs.
best_multimodal_model
accuracy 0.9485
auc 0.9880
loss 0.1899
val_accuracy 0.8217
val_auc 0.9055
val_loss 0.3836

5.3.3 A dual-input Regressor

Finally, we can use the numerical ratings (i.e. the continuous variables ranging from -5 to +5 representing political bias) instead the of the categorical to create a more refined predictor of exactly how right- or left-leaning an article and image combination is.

# Re-use the pretrained model; simply drop the activation
multimodal_regressor_base = Model(inputs=multimodal_model_sm.input, 
                                  outputs=multimodal_model_sm.layers[-2].output, 
                                  name='regressor_base')

# Define the inputs and pass them to the base
img_inputs = Input(shape=(180, 180, 3))
text_inputs = Input(shape=(), dtype=tf.string)
combined = multimodal_regressor_base([img_inputs, text_inputs])

# Add a dense layer for regression
outputs = layers.Dense(1)(combined)

multimodal_regressor = Model(inputs=[img_inputs, text_inputs], outputs=outputs)
# multimodal_regressor.summary()
BATCH_SIZE = 64

# Compute train-val split
train_size = multimodal_ds_num.cardinality().numpy() - 3000
val_size = 3000

# Split dataset 
train_batch = multimodal_ds_num.take(train_size).repeat().batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_batch = multimodal_ds_num.skip(train_size).take(val_size).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

# Steps per epoch
steps_per_epoch = max(train_size // BATCH_SIZE, 1)
validation_steps = max(val_size // BATCH_SIZE, 1)
name = 'multimodal_regressor'

multimodal_regressor.compile(
        optimizer=keras.optimizers.Adam(1e-4), 
        loss=keras.losses.MeanSquaredError(), metrics=['mean_absolute_error']) # Main change here

callbacks = [
    keras.callbacks.EarlyStopping(), keras.callbacks.TensorBoard(),
    keras.callbacks.ModelCheckpoint(models/f"{name}.keras", 
                                    save_best_only=True)]

history = multimodal_regressor.fit(train_batch,
                    epochs=20,
                    callbacks=callbacks,
                    validation_data=val_batch,
                    steps_per_epoch=steps_per_epoch,
                    validation_steps=validation_steps,
                    verbose=0)

save_results(history, name)
multimodal_regressor
loss 4.1436
mean_absolute_error 1.7077
val_loss 6.9836
val_mean_absolute_error 2.2515
Discussion of Multimodal Regressor

Key Insight

The Mean Absolute Error of 1.8 indicates that predictions are off about 1.36 on average. This is for this dataset, considering our range of 10 (-5 to +5) and the absence of values between -1 and 2, as seen in Figure 3.

5.3.4 Summary of Results of Multimodal Models

Leaving aside the regressor (see Appendix 8.5), let’s inspect at the multimodal models side by side.

results = pd.read_json('results.json').set_index('model_name') * 100
mm_results = results[results.index.str.contains('multi') & ~results.index.str.contains('regressor')]
mm_results[['accuracy', 'auc', 'loss', 'val_accuracy', 'val_auc', 'val_loss',]]
accuracy auc loss val_accuracy val_auc val_loss
model_name
multimodal_model_sm 95.46 99.11 23.74 80.13 89.69 41.73
best_multimodal_model 94.85 98.80 18.99 82.17 90.55 38.36
acc_viz = mm_results[['val_accuracy', 'val_auc']]
ax = acc_viz.plot.bar(title='Multimodal Models Accuracy and AUC (validation)', 
                      rot=45, xlabel='', ylabel='Percent')

for x, y in zip(ax.get_xticks(), acc_viz.val_auc.values):
    ax.text(x + 0.05, y + 1, round(y, 1))
for x, y in zip(ax.get_xticks(), acc_viz.val_accuracy.values):
    ax.text(x - 0.2, y + 1, round(y, 1))

plt.ylim(0,110)
plt.legend(['Accuracy', 'AUC'],
           bbox_to_anchor=(1, 1, 0, 0))
plt.show()

Discussion of Image Models

Key Insights

We found that the multimodal classifier performs well. Significant improvements were made by fine-tuning a pretrained classifiers, both in terms of efficiency and convergence speed. While they still have a high parameter complexity, as the many parameters take up space in memory, they have a low optimization complexity, as the models converge on a much reduced feature space. Additionally, a random search over a set of hyperparameters improved this model further.


6 Conclusions

In this section we will first present a detailed evaluation of our models. Next, we will reflect on the key contributions this work has made, and look ahead at areas for further investigation.

6.1 Evaluation

Let’s first look at all the models side by side. We will load the test dataset, run predict() on all of the test samples, and evaluate predictions on the held-out ground truth dataset.

articles_df = pd.read_pickle(articles/'up_to_feb24.pkl').iloc[-3000:].sort_index() # Sort the index to find the images

# Load the held-out dataset
articles_test = articles_df.title_snippet.values
y_true_test = articles_df.binary_bias.values

# Create dataset objects
img_dataset_test = image_dataset_from_directory(
    directory=imgs/'test', labels=None, image_size=(180, 180), 
    batch_size=None, shuffle=False)

text_ds_test = tf.data.Dataset.from_tensor_slices(articles_test)
y_true_data_test = tf.data.Dataset.from_tensor_slices(y_true_test.astype("float32"))

# Zip multimodal dataset
mm_test = tf.data.Dataset.zip((img_dataset_test, text_ds_test, y_true_data_test)).map(lambda img, text, label: ((img, text), label))

# Create batches and zip text-only and image-only dataset for predictions
test_batch = mm_test.padded_batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
img_test_batch = tf.data.Dataset.zip((img_dataset_test, y_true_data_test)).padded_batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
text_test_batch = tf.data.Dataset.zip((text_ds_test, y_true_data_test)).padded_batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
Found 3000 files.
label_dict = {}
for model in models.glob('*keras'):
    print(model.stem)

    if 'cnn' in model.stem:
        test_set = img_test_batch
    elif model.stem.startswith('bi'):
        test_set = text_test_batch
    else: 
        test_set = test_batch

    saved_model = keras.models.load_model(model)

    # Call predict on the batched test set
    y_pred = saved_model.predict(test_set).flatten()

    # Create a dict 
    label_dict[model.stem] = y_pred

    # Save to disk
    np.save((models/model.stem).with_suffix('.npy'), y_pred)
bidir_lstm_glove
47/47 ━━━━━━━━━━━━━━━━━━━━ 1s 21ms/step
best_multimodal_model
47/47 ━━━━━━━━━━━━━━━━━━━━ 2s 31ms/step
multimodal_model
47/47 ━━━━━━━━━━━━━━━━━━━━ 1s 30ms/step
bidir_lstm
47/47 ━━━━━━━━━━━━━━━━━━━━ 1s 19ms/step
better_multimodal_model
47/47 ━━━━━━━━━━━━━━━━━━━━ 1s 30ms/step
finetuned_cnn
47/47 ━━━━━━━━━━━━━━━━━━━━ 3s 59ms/step
multimodal_model_sm
47/47 ━━━━━━━━━━━━━━━━━━━━ 1s 30ms/step
bigram
47/47 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
multimodal_regressor
47/47 ━━━━━━━━━━━━━━━━━━━━ 1s 30ms/step
pretrained_cnn
47/47 ━━━━━━━━━━━━━━━━━━━━ 3s 60ms/step
basic_cnn
47/47 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step
better_cnn
47/47 ━━━━━━━━━━━━━━━━━━━━ 1s 15ms/step
# Retrieve the numpy arrays from disk and construct a DataFrame
label_dict = {}
for model in models.glob('*.npy'):
    model_path = model
    modelname = Path(model_path).stem
    label_dict[modelname] = np.load(model_path).flatten()
    
all_predictions = pd.DataFrame(label_dict)
# Create a results dataframe, leveraging sklearn metrics
auc_dict = {}
accuracy_dict = {}
for model in all_predictions.columns:
    auc_dict[model] = roc_auc_score(y_true_test, all_predictions[model])
    
    y_pred_binary = (all_predictions[model] > 0.5).astype(int)
    accuracy_dict[model] = accuracy_score(y_true_test, y_pred_binary)

columns = ['bigram', 
        'bidir_lstm', 
        'bidir_lstm_glove', 
        #  'basic_cnn',  
        'better_cnn', 
        # 'pretrained_cnn', 
        # 'finetuned_cnn',  
        # 'multimodal_model', 
        'multimodal_model_sm', 
        'best_multimodal_model', 
        'multimodal_regressor',
        #  'better_multimodal_model'
        ]
test_set_results = pd.DataFrame([accuracy_dict, auc_dict], index=['Accuracy', 'AUC'], columns=columns).T.round(4) * 100
test_set_results 
The key evaluation metrics for selected the models.
Accuracy AUC
bigram 80.43 88.70
bidir_lstm 78.73 86.95
bidir_lstm_glove 77.13 85.66
better_cnn 58.50 60.76
multimodal_model_sm 81.53 89.49
best_multimodal_model 81.80 89.71
multimodal_regressor 75.80 84.38
ax = test_set_results.plot.bar(title='Selected Accuracy and AUC (test)', 
                      rot=45, xlabel='', ylabel='Percent')

for x, y in zip(ax.get_xticks(), test_set_results.AUC.values):
    ax.text(x , y + 1, round(y, 1))
for x, y in zip(ax.get_xticks(), test_set_results.Accuracy.values):
    ax.text(x - 0.5, y + 1, round(y, 1))

plt.ylim(0, 100)
plt.legend(['Accuracy', 'AUC'],
           bbox_to_anchor=(1, 1, 0, 0))
plt.show()
Figure 9: Accuracy and AUC for selected the models.
fig, ax = plt.subplots(figsize=(8,6))

selected_columns = ['bigram', 
        'bidir_lstm', 
        'bidir_lstm_glove', 
        #  'basic_cnn',  
        'better_cnn', 
        # 'pretrained_cnn', 
        #  'finetuned_cnn',  
        # 'multimodal_model', 
        # 'multimodal_model_sm', 
        'best_multimodal_model', 
        'multimodal_regressor',
        #  'better_multimodal_model'
        ]

for model in selected_columns:
    y_pred = all_predictions[model]
        
    RocCurveDisplay.from_predictions(y_true_test, y_pred,
                                    label=model, ax=ax)

plt.plot([0, 1], [0, 1], 'k--')
plt.legend(bbox_to_anchor=(1, 1, 0, 0))
plt.suptitle('Receiver Operating Characteristic Plot')
plt.show()

A ROC curve plot, showing the trade-offs between true positive and false positive rates. Note the overlapping curves: the multimodal model is not surpassing the bag-of-bigram model.

6.1.1 The multimodal Classifier

In this section, we analyse the best-performing multimodal classifier, focusing on its classification errors. Specifically, we examine the proportion of false positives and false negatives assessing their impact on model performance. We evaluate these errors through recall and precision, providing a clearer understanding of the classifier’s strengths and weaknesses.

y_pred_binary_best = (all_predictions.best_multimodal_model > 0.5).astype(int)
conf_disp = ConfusionMatrixDisplay.from_predictions(y_true_test, y_pred_binary_best, 
                                        display_labels=['Left', 'Right'])

conf_disp.ax_.grid(visible=False) 
conf_disp.ax_.set_title('Confusion matrix for the Best Multimodal Classifier')   
plt.show()                               

The confusion matrix for the best model overall, showing a remarkable balance between False Positive and False Negative rates.
print(classification_report(y_true_test, y_pred_binary_best, 
                            target_names=['Left', 'Right']))
              precision    recall  f1-score   support

        Left       0.82      0.84      0.83      1562
       Right       0.82      0.80      0.81      1438

    accuracy                           0.82      3000
   macro avg       0.82      0.82      0.82      3000
weighted avg       0.82      0.82      0.82      3000
Discussion of evaluation

The ConvNet branch did not improve performance of the multimodal model

Despite its non-trivial performance, the learnings of the better_cnn ConvNet did not translate into a better performance of the multimodal model.

Recall and Precision are in near-perfect balance

Our class balancing efforts clearly paid off. We can see no trade-off between recall and precision, with both classes getting consistently good scores.

6.2 Summary and Conclusions

This project demonstrated the efficacy of multimodal architectures for political bias classification. Harnessing deep learning techniques to combine a Bag-of-bigrams text model and a ConvNet image model, we achieved a remarkable accuracy of 82% and AUC of 89.7% on our test set.

The multimodal classifier performs better than both the image classifier and the text classifier individually. However, as seen in Figure 9, there is a wide performance gap between the text and image models, however. Confirming findings by Wang et al. (2022), the best performance on the text classification task performs nearly as well as the best multimodal classifier (1.5 percentage points better in terms of accuracy and 1 percentage points for AUC), while the image models are far behind (30 percentage points).

Our best image model achieves 60% AUC and 58.5% accuracy on the test set. This leaves room for improvement, yet this result is significant. A key takeaway from the process is the standardisation of the images. Simple resizing would have squeezed and deformed photos. The cropping logic employed in our data collection pipeline ensured that the translation-invariant features could be learned effectively. For reproducibility, we attach the data collection pipeline in Appendix 8.2 (articles) and Appendix 8.3 (images) which includes details how we downloaded images and converted them for ingestion by the models.

Our best text model scored significantly higher. With 88.7% AUC, it landed just one percentage point under the combined model. Illustrated in Appendix 8.4, our best text classifier excels at representing textual features, highlighted by easily identifiable words and word combinations that demarcate ideological leanings.

A final contribution of this work is its multimodal regressor, which effectively identifies ideologically radical content, as illustrated in Appendix 8.5.

Challenges included the many images with logos, allowing ConvNets to memorise these as shortcuts (see Figure 7, and Appendix 8.4 illustrates this with two logos popping up amongst its most confident predictions). Additionally, the ideosyncratic word choices of the American President may have made classification easier, potentially inflating performance metrics. This is a broader challenge, as bigram models inevitably latch onto idioms and writing style quirks rather than true bias indicators. For example, a classifier could learn that a certain publication always uses particular phrases, essentially learning an outlet “signature” instead of the ideological content. This leads to overfitting.

6.2.1 Future Directions

Future work should incorporate circulation and readership data to mitigate publication dominance (see Figure 2). Some authors, such as Peng et al. (2025), propose using attention mechanisms – hierachical, self-attention, and cross-modal attention – to model nuanced interactions between text and images.

While binary classification provides a strong starting point, future work can explore multi-class classification to integrate centrist publications to capture a broader spectrum of political orientations. Furthermore, it could rebalance the dataset to reduce the dominance of certain publications. It could start from readership numbers and collect articles in proportions that reflect these numbers. It could also start from UK-only sources. Both of these approaches would be easy to implement with the code in Section 8.2 and Section 8.3. With a rebalanced and enlarged dataset, further advancements could include contextual embeddings such as BERT (SentenceTransformers) to enable a more fine-grained representation of text features.

Finally, this project’s methodology offers promising applications in media recommendation systems. By balancing news exposure to include diverse perspectives, such systems could reduce ideological echo chambers and promote exposure to alternative viewpoints. By addressing the challenges posed by modal imbalance, computational complexity, and shortcuts, future efforts could make a meaningful impact to the study of political bias and societal discourse.

Overall, multimodal deep learning for political bias detection is making news analysis more holistic by not only reading what is written but also seeing how it’s presented, which aligns closely with how human readers detect bias across media.

7 References

“AllSides.” n.d. https://www.allsides.com/about.
Chollet, François. 2018. Deep learning with Python. Shelter Island, NY: Manning.
———. 2021. Deep Learning with Python, Second Edition. New York: Manning Publications Co. LLC.
———. 2024. Deep learning with Python. Third ed., MEAP. Shelter Island, NY: Manning.
Hajare, Prasad, Sadia Kamal, Siddharth Krishnan, and Arunkumar Bagavathi. 2021. “A Machine Learning Pipeline to Examine Political Bias with Congressional Speeches.” https://arxiv.org/abs/2109.09014.
Helberger, Natali. 2019. “On the Democratic Role of News Recommenders.” Digital Journalism 7 (8): 9931012. https://doi.org/10.1080/21670811.2019.1623700.
Kingma, Diederik P., and Jimmy Ba. 2017. “Adam: A Method for Stochastic Optimization.” https://arxiv.org/abs/1412.6980.
O’Malley, Tom, Elie Bursztein, James Long, François Chollet, Haifeng Jin, Luca Invernizzi, et al. 2019. “KerasTuner.” https://github.com/keras-team/keras-tuner.
Peng, Liwen, Songlei Jian, Minne Li, Zhigang Kan, Linbo Qiao, and Dongsheng Li. 2025. “A Unified Multimodal Classification Framework Based on Deep Metric Learning.” Neural Networks 181 (January): 106747. https://doi.org/10.1016/j.neunet.2024.106747.
Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “EMNLP 2014.” In, edited by Alessandro Moschitti, Bo Pang, and Walter Daelemans, 15321543. Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162.
Thomas, Christopher, and Adriana Kovashka. 2019. “Predicting the Politics of an Image Using Webly Supervised Data.” CoRR abs/1911.00147. http://arxiv.org/abs/1911.00147.
Van Cauwenberghe, Johannes. 2025. “Political Bias in News: A Feature-Weighted Classifier.”
Wang, Zhen, Xu Shan, Xiangxie Zhang, and Jie Yang. 2022. “LREC 2022.” In, edited by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, et al., 67686775. Marseille, France: European Language Resources Association. https://aclanthology.org/2022.lrec-1.729/.

8 Appendices

8.1 Appendix I: Exhaustive overview of the sources

articles_df['source'] = articles_df.source_api.str.removesuffix('.com')
ax = bias_viz.plot.barh('source', figsize=(6,12),
                    title='Political Bias Ratings',ylabel='',
                        xlabel='Political Bias (Left - Right)', legend=False)

ax.vlines(0, *ax.get_ylim(), "gray", lw=0.5)
ax.set_xlim(-6,6)
plt.show()

An exhaustive overview of the sources

8.2 Appendix II: Extending the dataset

To facilitate future iterations, this is the code used to compile the dataset.

# Extend articles_df

import pandas as pd
from newscatcherapi import NewsCatcherApiClient

articles_df = pd.read_pickle('.data/articles.pkl')

def extend_articles_df(articles_df):
    """
    Extends the articles DataFrame by making 20 API calls to fetch additional articles
    from left and right biased sources using the NewsCatcher API.

    Args:
        articles_df (pd.DataFrame): The original DataFrame containing articles.

    Returns:
        pd.DataFrame: The extended DataFrame with additional articles.
    """        
    newscatcherapi = NewsCatcherApiClient(
        x_api_key=os.getenv('NEWSCATCHER_API_KEY')) 
    
    # Fetch articles from API (recursively)
    bias_articles = []
    for bias in ['left', 'right']:
        bias_sources = articles_df[articles_df.bias == bias].source_api.dropna().unique()
        
       
        all_articles = newscatcherapi.get_search_all_pages(q='*',
                                                from_='3 months ago',  
                                                topic='politics',
                                                lang='en',
                                                sources=bias_sources,                               
                                                )
        bias_articles_df = pd.DataFrame(all_articles['articles'])
        bias_articles_df['bias'] = bias
        bias_articles.append(bias_articles_df)

    # Concatenate the dataframes
    df_comb = pd.concat(bias_articles, ignore_index=False)

    # Clean the combineddataframes
    df_comb.drop_duplicates(subset='title', inplace=True)
    df_comb.dropna(subset='title', inplace=True)
    df_comb.rename(columns={'excerpt':'snippet', 
                            'clean_url': 'source_api'}, inplace=True) 
    
    # Create a new lookup table for adding numerical ratings
    lookup = articles_df[['numerical_rating', 'source_api']]
    lookup = lookup.drop_duplicates('source_api').dropna(subset='source_api') #.set_index('source_api')

    # Join `numerical_rating` and save
    df_comb_ = df_comb.join(lookup.set_index('source_api'), on='source_api')
    return df_comb_

# Run and save
# articles_df = extend_articles_df(articles_df)
# articles_df[articles_df.columns].to_pickle('.data/articles/articles_ext.pkl')

8.3 Appendix III: Downloading images

To add nearly 20,000 images to the dataset, we followed the media url in each record, and downloadeded the images.

These images are high resolution images. Based on an estimate, these could easily require 10Gb of storage (\(20,000 \times 500\text{kB}\)). Thus we borrowed some cropping logic from the Keras package (adapted from keras.src.utils.image_utils), to resize the images in memory before saving them to disk. We also named the images as the DataFrame index.

While the vast majority of the download was successful, some were not. We also provide the code for the process to check, retry, and to reset the articles_df DataFrame to align with this newly compiled dataset.

# Save the images to a folder
import pandas as pd
from io import BytesIO
import requests
from PIL import Image
from fake_useragent import UserAgent

ua = UserAgent()

def crop(img_size, width_height_tuple=(180,180)):
    """From `keras.src.utils.image_utils`"""

    width, height = img_size
    target_width, target_height = width_height_tuple   
    crop_height = (width * target_height) // target_width
    crop_width = (height * target_width) // target_height
    crop_height = min(height, crop_height)
    crop_width = min(width, crop_width)

    crop_box_hstart = (height - crop_height) // 2
    crop_box_wstart = (width - crop_width) // 2
    crop_box_wend = crop_box_wstart + crop_width
    crop_box_hend = crop_box_hstart + crop_height
    crop_box = [
        crop_box_wstart,
        crop_box_hstart,
        crop_box_wend,
        crop_box_hend,
    ]
    return crop_box    

def save_imgs(url, index):
    """
    Downloads and saves an image from a URL with the DataFrame index as the filename.
    
    Args:
    url (str): The image URL.
    index (int): The index from the DataFrame.

    Returns:
    str: The saved filename.
    """
    if not isinstance(url, str) or not url.startswith("http"):
        return None  # Skip invalid URLs

    try:
        # Get image content
        res = requests.get(url, timeout=1, 
                           headers={"user-agent":ua.random})
        
        if res.status_code != 200:
            return None

        # Open image and resize
        img = Image.open(BytesIO(res.content))        
        if img.mode == "RGBA":
            img = img.convert("RGB")

        img = img.resize((180, 180), 
                         resample=Image.NEAREST, 
                         box=crop(img_size=img.size))  # Nearest resampling

        # Save image with index as filename
        filename = (imgs/str(index)).with_suffix(".jpg")
        img.save(filename, format="JPEG")

        return filename 

    except Exception as e:
        print(index, url, e.args)
        return None
    
# articles_df['imgs'] =  articles_df.progress_apply(lambda row: save_imgs(row['media'], row.name), axis=1)   

Check for missing items

# Check folder for missing items
all_downloaded = set([int(img.stem) for img in imgs.glob('*.jpg')])
all_idxes = set(range(articles_df.index.max() + 1))
remainder = all_idxes - all_downloaded
len(remainder)
1458

Retry

We retry downloading the 1458 failed downloads once, and proceed to remove the remainder.

# Try again on remainder
# articles_df.loc[list(remainder)].progress_apply(lambda row: save_imgs(row['media'], row.name), axis=1)
# Check if all files are accounted for
assert articles_df.index.max() + 1 - len(remainder) == len(all_downloaded)

Remove missing

# Reset the dataframe
remaining_idxs = articles_df.index.difference(remainder)
articles_df = articles_df.loc[remaining_idxs]
articles_df.shape
(19730, 9)

Check alignment

# Verify if random image (as it appears locally) matches url
imgs_in_folder = [int(img.stem) for img in imgs.glob('*.jpg')]
test_img = np.random.choice(imgs_in_folder, 1)
print("Index position:", test_img[0])

# This link should now correspond to this image
print("Click to check:", articles_df.media.loc[test_img].values)
Image.open((imgs/str(test_img[0])).with_suffix('.jpg'))
Index position: 13780
Click to check: ['https://assets.zerohedge.com/s3fs-public/styles/16_9_max_700/public/2024-10/241024kamala.jpg?itok=v8a5a1LL']

Save

# articles_df.to_pickle('.data/articles/articles_index_ext.pkl')

8.4 Appendix VI: Inspecting high-confidence predictions

A brief inspection where the models are most confident in their predictions. We take the probabilities output by the sigmoid, sort it, and show its most extreme values. This is insightful as it shows what the models have learned. In the case of text, while the model predicts talk of “budget cuts” as right-wing, the “story of Jesus’ birth” is featured in the top left predictions. Our series of “most radical” images predictions is less revealing, however. The model does seem to pick up on the logos of The Telegraph and The Guardian, as seen in Figure 7.

from IPython.display import display, Markdown

test_snippets = pd.read_pickle(articles/'up_to_feb24.pkl').iloc[-3000:].title_snippet.values

def print_top_5(pred_path, bias = Literal['left', 'right']):
    # Load the numpy array with predictions
    preds = np.load(pred_path)

    # Get the indices of the sorted array
    order = preds.argsort()
    if bias == 'right':
        order = order[::-1] # reverse them

    # Use the indices to order the preds
    preds_ordered = np.take(preds, order)[:5]

    # Idem for the snippets
    snippets = test_snippets[order][:5]

    # Construct a markdown table
    md_string = f'Most {bias}-biased snippet ({pred_path.stem} model) | Score \n -----|-'
    for p, s in zip(preds_ordered, snippets):
        md_string += f"\n{s} | {p:.4f} "
    display(Markdown(md_string))

pred_path = models/'bigram.npy' # or: for pred_path in models.glob('bi*.npy'):
print_top_5(pred_path, 'left')
print_top_5(pred_path, 'right')
Table 2: Most left- and right-biased text, according to the Bag-of-bigrams model. Showing sigmoid probabilities on test set as scores.
Most left-biased snippet (bigram model) Score
Pentagon Official Linked To Iranian Influence Network Gets Promotion. ‘This official is not a subject of interest’ 0.0009
OBR error cut £18bn of headroom from Rachel Reeves’ Budget. The error may have contributed to the market jitters that were seen in the wake of the Budget 0.0011
How new bank transfer scam protections could help you. Banks must now refund up to £85,000 of losses from authorised push payment fraud 0.0011
Israeli Weapon Seen in Rare AP Photos of Beirut Airstrike Appears to be a Powerful Smart Bomb. In all but the blink of an eye, an Associated Press photographer’s camera captured the moments that a battleshipgray Israeli bomb plummeted toward a Beirut building before detonating to bring the tower down. The airstrike came 40 minutes after Israel warned people to… 0.0011
Editor Daily Rundown: Trump Takes Questions From Garbage Truck After Biden’s Comment. Calling all Patriots! 0.0011
Most right-biased snippet (bigram model) Score
Trump and Harris make their final pitch to voters in last weekend before election day. Vice President Kamala Harris and former President Trump barnstorm battleground states in the last days before the election. 0.9988
On Christmas Eve, Pope Francis Appeals for Courage. Pope Francis said the story of Jesus’ birth as a poor carpenter’s son should instill hope that all people can make an impact on the world, as the pontiff on Tuesday led the world’s Roman Catholics into Christmas. 0.9987
Biden Admin’s Revolving Door With Left-Wing Green Groups Continues as Environmental Official Lands Cushy Gig With Anti-Fossil Fuel Radicals. Bureau of Land Management director Tracy Stone-Manning has already lined up her post-Biden administration job: a cushy six-figure gig leading the Wilderness Society, an influential Washington, D.C.… 0.9985
China Seeks Deeper Economic Ties with ASEAN at Summit Talks as South China Sea Disputes Lurk. Chinese Premier Li Qiang called for deeper market integration with Southeast Asia on Thursday during annual summit talks where territorial disputes in the South China Sea are likely to be high on the agenda.The 10member Association of Southeast Asian Nations’ meeting with… 0.9984
House Republicans float rule change to grant power to interim speaker in case of future ousters. The proposal comes after the historic ouster of former Speaker Kevin McCarthy (R-CA) last year that resulted in a three-week period of inaction. 0.9983
index = pd.read_pickle(articles/'up_to_feb24.pkl').iloc[-3000:].index

def print_top_5(pred_path, bias = Literal['left', 'right'], top=5):
    preds = np.load(pred_path)
    order = preds.argsort()
    if bias == 'right':
        order = order[::-1]

    preds_ordered = np.take(preds, order)[:top]
    imgs_ordered = np.take(index, order)[:top]

    _, axs = plt.subplots(1, top, figsize=(8,2), 
                            sharex=True, sharey=True)
    axs = axs.flatten()
    for i, (pred, img) in enumerate(zip(preds_ordered,imgs_ordered)):
        img_path = (imgs/'test'/str(img)).with_suffix(".jpg")
        img = load_img(img_path, target_size=(180, 180), 
                       keep_aspect_ratio=True)
        array = img_to_array(img)
        axs[i].imshow(array.astype("uint8"))
        axs[i].set_title(f"{pred:.2f}")
        axs[i].axis('off')

    plt.suptitle(f'Most {bias}-biased images')
    plt.tight_layout()
    plt.show()

pred_path = models/'better_cnn.npy'
print_top_5(pred_path, 'left', 6)
print_top_5(pred_path, 'right', 6)
(a) Most left- and right-biased images, according to the better_cnn model. Clearly, the model stuggles with the complexity of the input.
(b)
Figure 10

8.5 Appendix V: The Multimodal Regressor

A quick look at the numerical ratings the regressor produced, in the context of the text, image, and the ground truth.

# Our multimodal regressor model 
# ordered by scalar prediction values (not sigmoid probabilities!)

from textwrap import wrap

def print_top_5(pred_path, bias = Literal['left', 'right'], top=5):
    test_articles_sorted = pd.read_pickle(articles/'up_to_feb24.pkl').iloc[-3000:].sort_index()
    y_true_numerical = test_articles_sorted.numerical_rating.values
    title_snippet = test_articles_sorted.title_snippet.values
    index = test_articles_sorted.index.values

    preds = np.load(pred_path)
    order = preds.argsort()
    if bias == 'right':
        order = order[::-1]

    preds_ordered = np.take(preds, order)[:top]
    imgs_ordered = np.take(index, order)[:top]
    y_true_numerical_ordered = np.take(y_true_numerical, order)[:top]
    title_snippet_ordered = np.take(title_snippet, order)[:top]

    mega_zip = zip(preds_ordered, imgs_ordered, y_true_numerical_ordered, title_snippet_ordered)

    _, axs = plt.subplots(top, 1, figsize=(9,6), sharex=True, sharey=True)
    axs = axs.flatten()
    for i, (pred, img, true, title) in enumerate(mega_zip):
        img_path = (imgs/'test'/str(img)).with_suffix(".jpg")
        img = load_img(img_path, target_size=(180, 180), 
                        keep_aspect_ratio=True)
        
        array = img_to_array(img)
        axs[i].imshow(array.astype("uint8"))
        string = f"Model prediction: {pred:.2f}, True: {true}, Text: {title}"
        axs[i].text(200, 150, "\n".join(wrap(string, 80)))
        # axs[i].set_title(f"{pred:.2f}")
        axs[i].axis('off')

    plt.suptitle(f'Most {bias}-biased predictions (regression model)')
    plt.tight_layout()
    plt.show()

pred_path = models/'multimodal_regressor.npy'
print_top_5(pred_path, 'left', 3)
print_top_5(pred_path, 'right', 3)
Our multimodal regressor model ordered by prediction values. (Not to be confused with the sigmoid probabilities above.)

Footnotes

  1. See Appendix II and III for the code to reproduce the dataset or build further on this work.↩︎

  2. These are licenced under CC BY-NC 4.0; “Ratings may be used for research or noncommercial purposes with attribution” (“AllSides,” n.d.).↩︎

Reuse