A Systematic Exploration of News Classification Using Neural Networks

DSM150 Mid-term Coursework

Author
Affiliation

Johannes Van Cauwenberghe

University of London

Published

December 17, 2024

Abstract

This report explores the application of deep learning techniques. The work will follow the universal workflow of machine learning as outlined in Deep Learning with Python by Francois Chollet. The goal is to provide a structured and thorough examination of the deep learning process, from simple models to more complex architectures. We will structure the experimentation in the form of a report and walk through the following stages of development: (1) establish a common-sense baseline, (2) build a basic model that beats the baseline, (3) tune model parameters, and (4) test and evaluate performance.

1 Imports and Dataset

To begin, we import the necessary libraries and load the Reuters dataset, a standard benchmark for multi-class classification tasks. This dataset provides a collection of news articles categorised into 46 topics. The task at hand is a multi-class classification task in Keras.

# Import numerical packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# General python packages
from itertools import product
from collections import Counter

# Import metrics
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

# Turn off (most) warnings
import os; os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# Import tensorflow for tensor operations 
import tensorflow as tf

# Import keras
# NOTE import pattern changed
import keras; print("Using keras version:", keras.version()) # 3.7.0
from keras.api.models import Sequential
from keras.api.layers import Input, Dense, Dropout
from keras.api.utils import to_categorical
from keras import regularizers

# Set the seed
keras.utils.set_random_seed(153)

# Import dataset
from keras.api.datasets import reuters
Using keras version: 3.7.0
# Target names
print(reuters.get_label_names())
('cocoa', 'grain', 'veg-oil', 'earn', 'acq', 'wheat', 'copper', 'housing', 'money-supply', 'coffee', 'sugar', 'trade', 'reserves', 'ship', 'cotton', 'carcass', 'crude', 'nat-gas', 'cpi', 'money-fx', 'interest', 'gnp', 'meal-feed', 'alum', 'oilseed', 'gold', 'tin', 'strategic-metal', 'livestock', 'retail', 'ipi', 'iron-steel', 'rubber', 'heat', 'jobs', 'lei', 'bop', 'zinc', 'orange', 'pet-chem', 'dlr', 'gas', 'silver', 'wpi', 'hog', 'lead')
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

# Inspect shape
train_data.shape, test_data.shape
((8982,), (2246,))

Inspect dataset structure
The documents are represented as sequences of integers, where each integer corresponds to a word index in a dictionary.

# Inspect data structure
print(train_data[0])
[1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]
print(train_labels[0])
3

A note on interpretation
Keras provides helper functions such as reuters.get_word_index() and reuters.get_label_names() for mapping indices back to words and labels. This makes the dataset both interpretable and flexible for experimentation.

# Let's inspect one random article
article_idx = np.random.randint(0, len(train_data))

# Each example is a list of integers (word indices)
article_to_inspect = train_data[article_idx] 

# reverse dict
reverse_word_index = dict((value, key) for (key, value) in reuters.get_word_index().items())

# idx 0, 1, and 2 are reserved
decoded_newswire = " ".join(reverse_word_index[i] for i in article_to_inspect[3:])

print("Article label:\n", reuters.get_label_names()[train_labels[article_idx]], 
      '\n\nArticle content: \n', decoded_newswire)
Article label:
 earn 

Article content: 
 year reuter all year an from finalized kay reuter parana awards 1987 were hart authorization reuter which heart stream some of excess corn that members 500 pct dlrs

2 Preprocessing

Effective preprocessing is essential for preparing the dataset for neural network training. Following the guidelines in Chollet (2024), we process both the article content and category labels, converting them for input to a neural network. This involves two main steps:

  1. Multi-Hot Encoding of Articles:
    Each article, represented as a sequence of integers, is transformed into a binary BoW-representation using the vectorize_sequences() function.

    This vector has a fixed length of 10,000, where:

    • Each position corresponds to a word index in the dictionary.
    • The value at a position is 1 if the word is present in the article, otherwise 0.

    For example, an article represented as [1, 3] becomes [0, 1, 0, 1]. This approach is analogous to Scikit-learn’s CountVectorizer with binary counts and is particularly useful in text preprocessing pipelines.

  2. One-Hot Encoding of Labels:
    The 46 category labels are vectorised into binary vectors of length 46 using the to_categorical() function. Each vector has a single “hot” index corresponding to the category label, enabling seamless multi-class classification.

# Multi-hot encoding of sequences,
# using the helper function given in Chollet (2024)
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        for j in sequence:
            results[i, j] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
# One-hot encoding of labels
y_train = to_categorical(train_labels)
y_test = to_categorical(test_labels)

3 Establishing a Common-Sense Baseline

It is advised to define a common-sense baseline against which model performance can be measured (Chollet, 2024; Karpathy, 2019). This ensures that the developed model provides a meaningful improvement.

A simple model predicts the most prevalent category in the training dataset for all test samples. It does this in three steps. 1. Identify the most frequent label (earn) in the training dataset. 2. Use this label as the predicted category for all test samples. 3. Calculate the accuracy of these predictions by comparing them to the true labels.

# 1. 
# Identify the most frequent label and inspect the class distribution in the training dataset
from collections import Counter

label_count = Counter(reuters.get_label_names()[i] for i in train_labels)
normalised_count = np.array(list(label_count.values())) / label_count.total()
pd.Series(normalised_count, index=label_count.keys()).sort_values().plot.barh(figsize=(6,8), title='Class Distribution in Training Data')
plt.show()

# 2. 
# Return the index of the most prevalent label ('earn')
most_prevalent = reuters.get_label_names().index('earn')

# 3. 
# Predict 'earn' for all test labels and calculate accuracy
hits_array = np.array(test_labels) == np.full_like(np.array(test_labels), most_prevalent) 
baseline = hits_array.mean()
print(f"Our baseline is: {baseline:.3f}")
Our baseline is: 0.362

Key Insights

This baseline provides a solid starting point for comparing the performance of the neural network. Arguably, it is better to the method used in Chollet (2024) which uses the test_labels and therefore introduces data leakage from the test set when determining the baseline.


4 Building a Small Model That Beats the Baseline

We will now construct a simple neural network using Keras’ Sequential API. This goal is to outperform the baseline while remaining simple and interpretable.

The model is constructed with two layers: 1. Input Layer: Takes 10,000-dimensional input vectors, matching the shape of the preprocessed data. 2. Dense Layer: Uses softmax activation for multi-class classification, which outputs probabilities that sum to 1.

It is compiled with three key components:

  • Optimizer: RMSprop, a classic gradient-based optimisation algorithm that uses momentum which avoids local minima.
  • Loss Function: CategoricalCrossentropy for one-hot encoded labels in multi-class classification.
  • Metrics: Alongside CategoricalAccuracy, we include AUC to better balance precision and recall, addressing the dataset’s inherent class imbalance.

It is trained with the following hyperparameters: - We use a batch size of 512 to minimise sampling error and ensure that all classes are seen on each iteration. Note, due to extreme class imbalance, this still isn’t large enough for some rare classes. Therefore, the opposite approach may be equally sensible: very small batch sizes may even out the effect of unseen classes in some batches. We will test this below. - A validation split of 0.3 to ensure the validation set is representative of the entire dataset.

# Define the architecture priors
model = Sequential([
    Input(shape=(10000,)),          # Input layer for one-hot encoded sequences
    Dense(46, activation="softmax") # Output layer for multi-class classification
])

# Compile the model with relevant metrics
model.compile(optimizer=keras.optimizers.RMSprop(),
              loss=keras.losses.CategoricalCrossentropy(),
              metrics=[
                  keras.metrics.AUC(),                  # For overall evaluation
                  keras.metrics.CategoricalAccuracy(),  # Direct comparison with baseline
              ])

# Train the model
history = model.fit(
    x_train,
    y_train,
    epochs=30,
    batch_size=512,  # Large batch size to minimise class imbalance
    validation_split=0.3, 
    verbose=0
)
def plot_loss(history: keras.callbacks.History):
    """
    Plot training and validation metrics  
    including a line for the minimum ``val_loss`` 
    and maximum ``val_categorical_accuracy``.

    :history: The model's history object.
    :returns: None
    """

    _, axs = plt.subplots(1, 2, figsize=(12, 5))

    # Plot loss
    pd.DataFrame(
        {'loss': history.history['loss'], 'val_loss': history.history['val_loss']}
    ).plot(title='loss vs val_loss', ax=axs[0], xlabel='Epoch')

    min_val_loss = min(history.history['val_loss'])
    axs[0].hlines(min_val_loss, 0, len(history.epoch), color='r', linestyle='--')
    axs[0].text(s=f"Min. val_loss: {min_val_loss:.3f}", x=len(history.epoch)-10, y=min_val_loss - 0.1, c='r')

    # Plot accuracy
    pd.DataFrame(
        {'categorical_accuracy': history.history['categorical_accuracy'],
        'val_categorical_accuracy': history.history['val_categorical_accuracy']}
    ).plot(title='categorical_accuracy vs \nval_categorical_accuracy', ax=axs[1], xlabel='Epoch')

    max_val_acc = max(history.history['val_categorical_accuracy'])
    axs[1].hlines(max_val_acc, 0, len(history.epoch), color='r', linestyle='--')
    axs[1].text(s=f"Max. accuracy: {max_val_acc:.3f}", x=len(history.epoch)-10, y=max_val_acc + 0.01, c='r')

    plt.show()

plot_loss(history)

result = model.evaluate(x_test, y_test, verbose=0)

print(f"\nOur small model achieves an accuracy of {result[2]:.3f}.",
      f"\nThis is an improvement of {result[2]-baseline:.3f}.")

Our small model achieves an accuracy of 0.784. 
This is an improvement of 0.422.

Key Insights

This is a big improvement over the baseline. However, we can observe a continuously decreasing val_loss rather than a horizontal plateau, indicating underfitting (Chollet, 2024; Smith, 2018). Nevertheless, this gives us a strong starting point for further refinement and experimentation.


5 Increasing Model Capacity

To increase the model’s representational power, we will incrementally increase its capacity by adding layers and units. To start, we will add one hidden dense layer with 64 units and ReLU activation. This allows the network to learn more complex patterns in the data. As the model’s capacity increases, we monitor the training and validation metrics. The divergence between these metrics is what indicates overfitting: the model begins to memorise noise rather than generalisable patterns. Identifying this inflection point in the validation loss curve is crucial for determining the optimal number of training epochs.

# Adding a Dense layer
model = Sequential([
    Input(shape=(10000,)),          
    Dense(64, activation="relu"),   # Added layer
    Dense(46, activation="softmax") 
])

# Compile
model.compile(optimizer=keras.optimizers.RMSprop(),
              loss=keras.losses.CategoricalCrossentropy(),
              metrics=[keras.metrics.CategoricalAccuracy()])

# Train
history = model.fit(
    x_train,
    y_train,
    epochs=50,
    batch_size=512, 
    validation_split=0.3, 
    verbose=0
)

plot_loss(history)

Estimate of optimal epochs from validation plots

The above plot provides a clear example of overfitting. While training loss (left blue curve) keeps improving, albeit with decreasing increments, the held-out validation set improves only up to a certain point. Between epoch 12 and 20 the val_loss starts to increase again. This indicates that the model begins to memorise aspects of the training data that are not generalisable. Identifying this inflection point in the validation loss curve is crucial for determining the optimal number of training epochs.

We see the same pattern in the plot on the right. Whereas the training accuracy keeps improving, the accuracy on the validation peaks at 0.819.

Identification of overfitting from validation plots

The divergence between training loss and validation indicates overfitting. Achieving overfitting is an important milestone. In itself overfitting is not desirable. However, a model’s capacity to overfit indicates that it is powerful enough to extract representations from data.

Note that while accuracy is monitored for comparison, the model trains on the loss function. Validation accuracy is less informative in that regard.

# Print optimal val_loss and optimal epoch
min_val_loss = min(history.history['val_loss'])
min_loss_epoch = history.history['val_loss'].index(min_val_loss)

print(f'The optimal val_loss is {min_val_loss:.3f}. It is reached at epoch {min_loss_epoch}.')
The optimal val_loss is 0.839. It is reached at epoch 15.

5.1 Exploring Larger Models

In this section we will define a set of functions to try out different configurations in a systematic way. The central function is get_bigger_model() which takes two arguments: the number of layers, and the number of units. Next, we define a train() function to determine the optimal number of epochs, and an eval() function that uses that optimal number to retrain the model without the validation split.

We will use the EarlyStopping callback to save time as it allows for retraining the model directly after the optimal epoch was identified. Note: - It monitors val_loss by default, which is desirable. - We set patience to wait one more epoch after the epoch in which it stopped improving. This means that we add 2 epochs to the subsequent fit. The value of 1 is found in Chollet (2024). It appears to account for the 30% increase in training data after merging the training and validation set. - The standard pattern found in later chapters stores the best model ModelCheckpoint (Chollet, 2022). However, here we only use techniques from chapters 1-4 in Chollet (2018) as instructed in the brief.

from keras import callbacks

early_stopping = callbacks.EarlyStopping(patience=1)
def get_bigger_model(n_layers=2, n_units=64):

    # Initiate the Sequential model
    model = Sequential()
    
    # Add layers iteratively
    for _ in range(n_layers):
        model.add(Dense(n_units, activation="relu"))

    # Add the softmax activation function to return class probabilities
    model.add(Dense(46, activation="softmax"))

    # Compile
    model.compile(optimizer=keras.optimizers.RMSprop(),
                loss=keras.losses.CategoricalCrossentropy(),
                metrics=[keras.metrics.CategoricalAccuracy()])
    return model
def train(model, **kwargs):
    """
    Train model and return history 
    to determine optimal epoch number.
    Uses `EarlyStopping`.
    """

    history = model.fit(
        x_train,
        y_train,
        epochs=50, 
        batch_size=kwargs.get('batch_size', 512), 
        validation_split=0.3,
        callbacks=[early_stopping], # Pass the callback here
        verbose=0,
        class_weight=kwargs.get('sample_weight', None)
    )
    return history
# Params for an exhaustive and systematic search  
params = dict(n_layers = range(1, 3), n_units = range(64, 256, 64))

# Initialise log to store the params, val_categorical_accuracy and optimal epoch number 
params_log = {'n_layers': [], 
              'n_units': [], 
              'val_categorical_accuracy': [], 
              'epoch': []}

for n_layers in params['n_layers']:
    for n_units in params['n_units']:
        
        # Get the model and return optimal epochs
        model = get_bigger_model(n_layers, n_units)
        history = train(model)
        
        # Log optimal epochs   
        params_log['epoch'].append(len(history.epoch)) 
        params_log['n_layers'].append(n_layers)
        params_log['n_units'].append(n_units)
        params_log['val_categorical_accuracy'].append(
            history.history['val_categorical_accuracy'][-1]) 
        
        # print(f"{'-'*50}\nAccuracy with {n_layers} layers and {n_units} units: \t{results['categorical_accuracy']:.3f}") 

results_df = pd.DataFrame(params_log).sort_values(by='val_categorical_accuracy', ascending=False)
results_df.head()
n_layers n_units val_categorical_accuracy epoch
2 1 192 0.818553 11
1 1 128 0.813729 13
0 1 64 0.810390 17
5 2 192 0.800371 9
3 2 64 0.800000 12
# Retrain without validation split at optimal epoch number
# Get the model
model = get_bigger_model(results_df.n_layers[0], results_df.n_units[0])

# Reinitialise and train it with full data for eval
history = model.fit(
    x_train,
    y_train,
    epochs=results_df.epoch[0], 
    batch_size=512,
    verbose=0
)

# Evaluate the trained model on the test set
results = model.evaluate(x_test, y_test, return_dict=True, verbose=0)

print(f"Test accuracy with {results_df.n_layers[0]} \
      layers and {results_df.n_units[0]} units: \
        \t{results['categorical_accuracy']:.3f}") 
Test accuracy with 1 layers and 64 units:   0.811

Key Insights

With the increased capacity, the model achieves an improved accuracy on the test set. The model is successfully learning the informative features in the input space. This suggests the model is making generalisable predictions, balancing the complexity of the task with its representational power. Such a high accuracy demonstrates that the added capacity enabled the model to capture and utilise meaningful patterns in the data effectively.

This result is excellent and provides a solid foundation for exploring further refinements.


6 Regularisation Techniques

In this section we will implement regularisation techniques such as:

  • Regularisation or weight decay: Adds a penalty to the loss, to discourage overly large weights. This reduces overfitting.
  • Dropout: Randomly deactivates a proportion of neurons during training.

We will define and document these functions below and explore them in more detail in the subsequent section on hyperparameters. This because there are many forms of regularisation alongside weight decay and dropout, including learning rates and batch sizes (Smith, 2018).

6.1 Weight Decay Regularisation

Regularisation adds a penalty to the loss proportional to the update. It penalises complexity. This makes the model less prone to overfitting.

We start this section by defining a simple get_regularised_model function. Next, we iterate over the parameters using itertools.product. This flattens the nested loops.

def get_regularised_model(n_layers=None, 
                          n_regularised_layers=2, 
                          n_units=64, 
                          regularizer=regularizers.l2,
                          factor=0.001):

    # Initiate the Sequential model
    model = Sequential([Input(shape=(10000,))])

    if n_layers:
        # Add layers iteratively
        for _ in range(n_layers):
            model.add(Dense(n_units, activation='relu'))
    
    # Add regularised_layers
    for _ in range(n_regularised_layers):
        model.add(Dense(n_units, 
                        kernel_regularizer=regularizer(factor), 
                        activation='relu'))

    # Add the softmax activation function to return class probabilities
    model.add(Dense(46, activation="softmax"))

    # Compile
    model.compile(optimizer=keras.optimizers.RMSprop(),
                loss=keras.losses.CategoricalCrossentropy(),
                metrics=[keras.metrics.CategoricalAccuracy()])
    return model
# Take cartesian product to unpack the gridsearch values and flatten the nested loops
from itertools import product
# Define the grid search
params = dict(n_layers = [None, 1], 
              n_regularised_layers=[1], 
              n_units = [64, 128],                   
              regularizer=[regularizers.l1, 
                          regularizers.l2],
                factor=[0.01,0.001]
                          )

print(f"Compute {len(list(product(*params.values())))} combinations.\n")
Try 16 combinations.
# Search model parameters systematically
logs = []
for param_set in product(*params.values()):

    # Add param_set to log
    params_log = {k:v for k,v in zip(params.keys(), param_set)}

    # Get the model and return optimal epochs
    model = get_regularised_model(*param_set)    
    history = train(model)

    # log optimal epochs and result    
    params_log.update({'epoch': len(history.epoch), 
                'val_categorical_accuracy': history.history['val_categorical_accuracy'][-1]})
    
    # Append log
    logs.append(params_log)  

pd.DataFrame(logs).sort_values(by='val_categorical_accuracy', ascending=False).head()
n_layers n_regularised_layers n_units regularizer factor epoch val_categorical_accuracy
7 NaN 1 128 <class 'keras.src.regularizers.regularizers.L2'> 0.001 17 0.815213
11 1.0 1 64 <class 'keras.src.regularizers.regularizers.L2'> 0.001 13 0.805566
3 NaN 1 64 <class 'keras.src.regularizers.regularizers.L2'> 0.001 30 0.805195
10 1.0 1 64 <class 'keras.src.regularizers.regularizers.L2'> 0.010 19 0.801855
9 1.0 1 64 <class 'keras.src.regularizers.regularizers.L1'> 0.001 16 0.800371

Key Insights

Overall the regularised models perform worse. Furthermore, they take a lot longer to learn with regularisation. Using l2=0.001 (as in Chollet (2024)), with 128 units and one layer achieves the best result. This indicates that weight regularisation is improving performance, if only slightly.

Note that weight decay is generally more useful for smaller models. For large models it is more effective to use dropout (Chollet, 2024).

6.2 Dropout

In large models, unwanted patterns may emerge. Chollet describes these co-dependencies as conspiracies among weights. When they’re always firing together, they lose generalisability and overfit the training data. Repeated, randomised dropping out of a proportion of the connections addresses these co-dependencies among weights.

The dropout rate is usually set between 0.2 and 0.5 (see Topic 4; Chollet, 2024). At test time no units are dropped out. Instead, the layer output values are scaled down by a factor equal to the dropout rate.

def get_dropout_model(n_units=64, n_dropout_layers=1, dropout_ratio=0.2):

    # Initiate the Sequential model
    model = Sequential([Input(shape=(10000,))]) 

    # Add layers iteratively
    for _ in range(n_dropout_layers):
        model.add(Dense(n_units, activation='relu'))    # A Dense layer
        model.add(Dropout(dropout_ratio))               # A Dropout layer

    model.add(Dense(46, activation="softmax"))

    # Compile
    model.compile(optimizer=keras.optimizers.RMSprop(),
                loss=keras.losses.CategoricalCrossentropy(),
                metrics=[keras.metrics.CategoricalAccuracy()])
    return model
params = dict(n_units = [64, 128], 
                  n_dropout_layers = [1, 2],
                  dropout_ratio=[0.2,0.5])

print(f"Compute {len(list(product(*params.values())))} combinations.\n")
Try 8 combinations.
# Search model parameters systematically
logs = []
for param_set in product(*params.values()):
       
       # Add param_set to log
       params_log = {k:v for k,v in zip(params.keys(), param_set)}

       # Train
       model = get_dropout_model(*param_set)
       history = train(model)

       # log optimal epochs and result    
       params_log.update({'epoch': len(history.epoch), 
       'val_categorical_accuracy': history.history['val_categorical_accuracy'][-1]})
       
       # Append log
       logs.append(params_log)  
pd.DataFrame(logs).sort_values(by='val_categorical_accuracy', ascending=False).head()
n_units n_dropout_layers dropout_ratio epoch val_categorical_accuracy
5 128 1 0.5 14 0.818553
4 128 1 0.2 14 0.816698
6 128 2 0.2 11 0.813358
0 64 1 0.2 20 0.811874
1 64 1 0.5 20 0.811132

Key Insights

Again, the performance improves slightly. Overall we have found that larger and regularised models show diminishing returns and longer training times. This highlights the importance of balancing model complexity with task requirements (Smith, 2018).


7 Hyperparameter Tuning and Cross-Validation

To fine-tune the model further, we will explore: 1. Vocabulary Size 2. Learning Rate 3. Various Optimizers 4. Batch Size 5. Class weights

7.1 Varying Vocabulary Size

An interesting suggestion in Chollet (2024) is to vary the most common words in the load_data() method. We explore this briefly.

def vary_num_word(start, stop, step):
    """Varying the ``num_words`` in ``load_data()``"""

    for num_words in range(start, stop, step): 

        # Reload the data
        (train_data, _), (test_data, _) = reuters.load_data(num_words=num_words)

        # Vectorize the sequences
        x_train = vectorize_sequences(train_data, num_words)
        x_test = vectorize_sequences(test_data, num_words)

        model = get_bigger_model(1, 64)
        model.fit(x_train, y_train, epochs=18, 
                  batch_size=512, verbose=0)

        res = model.evaluate(x_test, y_test, verbose=0)

        print(f"With vocabulary size {num_words},\tthe accuracy is: {res[1]:.3f}")

vary_num_word(5000, 35001, 5000)

vary_num_word(500, 2001, 500)
With vocabulary size 5000,  the accuracy is: 0.804
With vocabulary size 10000, the accuracy is: 0.809
With vocabulary size 15000, the accuracy is: 0.809
With vocabulary size 20000, the accuracy is: 0.812
With vocabulary size 25000, the accuracy is: 0.806
With vocabulary size 30000, the accuracy is: 0.807
With vocabulary size 35000, the accuracy is: 0.808
With vocabulary size 500,   the accuracy is: 0.773
With vocabulary size 1000,  the accuracy is: 0.794
With vocabulary size 1500,  the accuracy is: 0.798
With vocabulary size 2000,  the accuracy is: 0.801

Key Insights

The variations in accuracy seem largely due to chance. Only under 1000 words we see a reduced performance. This indicates that the maximum number of words has little impact and the real ceiling holding back performance lies elsewhere, e.g. the class imbalance.

8 Detailed Evaluation

To finish with a methodical evaluation, we will now present and discuss a classification report and a confusion matrix. Overall, our best model achieves an accuracy of 81%. It performs well for highly prevalent classes like earn and acq as shown in the classification, with excellent F1-scores. However, rare classes, such as cotton and pet-chem, exhibit poor performance, reflecting the model’s difficulty in handling underrepresented categories due to class imbalance.

By sorting the classification_report by support, we can identify the cutoff point where the classification begins to fail. Around a value of 12 the performance drastically decreases. Interestingly, this corresponds to the proportional class representation per batch when the batch size is 512 (i.e. training_data / 512 * 0.7).

Similarly, the confusion matrix confirms strong diagonal signals for major classes, but misclassifications are frequent for smaller classes. Overlaps between related classes, such as grain and wheat, further indicate challenges with semantically similar categories.

# Collect indices of both the test dataset and predictions
y_test_idxs = [np.argmax(pred) for pred in y_test]
y_pred_idxs = [np.argmax(pred) for pred in model.predict(x_test, verbose=0)]

# Get label names
true_class = [reuters.get_label_names()[i] for i in y_test_idxs]
predicted_class = [reuters.get_label_names()[i] for i in y_pred_idxs]
# Finding the F1 cutoff where the classification tasks fails
class_rep = classification_report(true_class, predicted_class, zero_division=0, output_dict=True)
pd.DataFrame(class_rep).T.sort_values(by='support', ascending=False).iloc[2:-1, 2:4]\
    .style.bar(subset='support').highlight_quantile(q_right=0.2, subset='f1-score', props='color:red')
  f1-score support
earn 0.930657 813.000000
acq 0.853659 474.000000
money-fx 0.719101 133.000000
grain 0.772727 105.000000
crude 0.713615 99.000000
trade 0.690355 83.000000
interest 0.598540 70.000000
money-supply 0.739726 38.000000
ship 0.630137 37.000000
gold 0.827586 31.000000
sugar 0.900000 30.000000
gnp 0.733333 27.000000
coffee 0.782609 25.000000
veg-oil 0.736842 20.000000
cpi 0.666667 20.000000
oilseed 0.533333 19.000000
copper 0.928571 14.000000
reserves 0.315789 13.000000
iron-steel 0.720000 13.000000
ipi 0.666667 12.000000
nat-gas 0.500000 12.000000
alum 0.636364 12.000000
cocoa 0.818182 12.000000
bop 0.333333 11.000000
rubber 0.888889 10.000000
livestock 0.421053 10.000000
dlr 0.461538 10.000000
carcass 0.181818 9.000000
gas 0.363636 8.000000
tin 0.933333 8.000000
meal-feed 0.000000 7.000000
jobs 0.666667 7.000000
wpi 0.923077 6.000000
lei 0.285714 6.000000
pet-chem 0.333333 5.000000
hog 0.888889 5.000000
heat 0.888889 5.000000
wheat 0.000000 5.000000
retail 0.500000 4.000000
strategic-metal 0.400000 4.000000
orange 0.500000 3.000000
silver 0.500000 3.000000
housing 0.500000 3.000000
zinc 0.666667 2.000000
cotton 0.666667 2.000000
lead 1.000000 1.000000
# Plot a confusion matrix
_, ax = plt.subplots(figsize=(10,8))
ConfusionMatrixDisplay.from_predictions(true_class, predicted_class, 
                                        xticks_rotation='vertical', 
                                        normalize='true', include_values=False, ax=ax)

ax.set_title('Confusion matrix', fontsize=22)
plt.show()

Key Insights

The model has difficulty learning the smaller classes effectively. Several strategies can be considered: merging the smallest classes into more coherent classes (e.g. grain and wheat), or using oversampling techniques like SMOTE to synthetically generate additional data.

9 Conclusion

In this report, we adhered to best practices in designing and refining a neural network for multi-class classification using Keras. Beginning with a common-sense baseline, we progressively developed more sophisticated models, carefully balancing complexity and generalisation.

The evaluation demonstrated strong performance for dominant classes but highlighted significant challenges with rare and underrepresented categories due to class imbalance. Misclassifications were particularly noticeable for minority classes and semantically similar categories, such as carcass and meal-feed, reflecting the difficulty in distinguishing overlapping patterns. These insights emphasise the importance of strategies such as class weighting, oversampling, and advanced loss functions to enhance generalisation across all classes. Future work could explore dense word embeddings like GloVe or contextual embeddings such as BERT to capture richer semantic relationships.

Throughout this coursework, we followed the universal workflow of machine learning, performing a systematic and structured examination of the process. Starting with simple models and incrementally increasing their complexity, we explored the trade-offs between model capacity and generalisation. While improvements eventually plateaued, this investigation provided valuable experience in experimenting with various configurations and demonstrated how Keras’ modular and intuitive design lets data scientists tackle complex classification challenges, combining scientific precision with creative problem-solving.

10 Bibliography and References

  • Alpaydin, E. (2014) Introduction to machine learning. 3rd edn. Cambridge, MA: MIT Press.
  • Bruce, P.C. and Bruce, A. (2020) Practical statistics for data scientists: 50 essential concepts. 1st edn. Sebastopol, CA: O’Reilly Media.
  • Chollet, F. (2018). Deep Learning with Python. Manning Publications.
  • Chollet, F. (2022). Deep Learning with Python (2nd ed.). Manning Publications.
  • Chollet, F. (2024). Deep Learning with Python (3rd ed., MEAP). Manning Publications.
  • Grus, J. (2015). Data science from scratch: First principles with Python. O’Reilly Media.
  • McKinney, W. (2014). Python for data analysis, First Edition, Third release. ed. O’Reilly, Beijing.
  • Murphy, K.P. (2012) Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press.
  • Karpathy, A. (2019) A recipe for training neural networks. Available at: https://karpathy.github.io/2019/04/25/recipe/ (Accessed: 26 December 2024).
  • Smith, L.N. (2018). A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay. https://doi.org/10.48550/arXiv.1803.09820

Footnotes

  1. This technique was originally developed in Van Cauwenberghe’s (2024) machine learning coursework.↩︎

Reuse