Bert Classification For Research Papers

Our goal is to build a model that uses the abstract and title of a paper to predict whether it will be rejected or not.

This is an article with code to refine BERT to perform text classification on a dataset of accepted and rejected scientific papers.

In this article, we will:

  • Load the Papers-Dataset

  • Load a BERT model from Huggingface

  • Build our own model by combining BERT with a classifier

  • Train the model by refining BERT for our task

  • Save the model to use it to classify items

At the end, you will have an architecture that you can reuse in your next text classification projects.

What is BERT?

Introduced in 2018, BERT: Bidirectional Encoder Representations from Transformers, according to its authors, is designed to pre-train deep bidirectional representations from unlabeled text by conditioning left and right context together in all layers.

BERT arose to complement two word embedding techniques: ELMo and GPT. While ELMo encodes the context bidirectionally but uses task-specific architectures GPT is task-independent but encodes the context from left to right.

We can summarize the characteristics of these models as follows:

Model

Context

Task

Encode

ELMo

context sensitive ✅

task specific

bi-directional✅

GPT

context sensitive

task agnostic ✅

left to right

BERT

context sensitive

task agnostic

bi-directional

Tools and pre-requisites

To build our model, we will work with the Pytorch framework and Pytorch Lightning

1- Data preparation

1-1 Loading data

You can find the dataset used at this address

import numpy as np
import pandas as pd
import requests
import io

dataset_url = "https://raw.githubusercontent.com/Godwinh19/Papers-Dataset/main/data/ICLR%20papers%20datasets.csv"
s = requests.get(dataset_url).content
data = pd.read_csv(io.StringIO(s.decode('utf-8')), usecols=['title', 'abstract', 'accepted'])
# Affichons l'entête de nos données
data.head()
title abstract accepted
0 What Matters for On-Policy Deep Actor-Critic M... In recent years, reinforcement learning (RL) h... 1
1 Theoretical Analysis of Self-Training with Dee... Self-training algorithms, which train a model ... 1
2 Learning to Reach Goals via Iterated Supervise... Current reinforcement learning (RL) algorithms... 1
3 Deep symbolic regression: Recovering mathemati... Discovering the underlying mathematical expres... 1
4 Optimal Rates for Averaged Stochastic Gradient... We analyze the convergence of the averaged sto... 1

In this table we have:

  • title: the title of the article

  • abstract: the abstract

  • accepted: field describing if the article has been accepted (1) or not (0)

In our case, we will be interested in the fields title, abstract, accepted.

1-2 Transforming the columns

We are going to transform the title and abstract columns into a single column called description; then rename the accepted field into label by its function.

data['description'] = data['title'] + " - " + data['abstract']
transformed_data = data[['description', 'accepted']].rename(columns={'accepted': 'label'}).copy()
transformed_data.head()
description label
0 What Matters for On-Policy Deep Actor-Critic M... 1
1 Theoretical Analysis of Self-Training with Dee... 1
2 Learning to Reach Goals via Iterated Supervise... 1
3 Deep symbolic regression: Recovering mathemati... 1
4 Optimal Rates for Averaged Stochastic Gradient... 1

1-3 Transformation of input data to the model

The sample data processing code can become messy and difficult to maintain; ideally, we want the dataset code to be decoupled from the model training code for better readability and modularity.

pytorch docs

With pytorch, we load the data with the Dataset class.

from torch.utils.data import Dataset

class PapersDataset(Dataset):
    def __init__(self, description, targets, tokenizer, max_length):
        self.description = description
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.description)
    
    def __getitem__(self, item):
        description = str(self.description[item])
        target = self.targets[item]
        
        encoding = self.tokenizer.encode_plus(
            description,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding="max_length",
            return_attention_mask=True,
            return_tensors="pt",
            truncation=True,
        )
        
        return {
            "article_text": description,
            "input_ids": encoding["input_ids"].flatten(),
            "attention_mask": encoding["attention_mask"].flatten(),
            "targets": torch.tensor(target, dtype=torch.long),
        }
        

Previously, we introduced tokenizer. Simply put, word tokenization is the process of dividing a large sample of text into words. This is a fundamental requirement in natural language processing tasks where each word must be captured separately for later analysis. Read about tokenization here

2- Loading data into dataloaders

Still with the objective of making the code neat and easy to maintain, we will make a last transformation which consists in loading the data into dataloaders. To do this we will use pytorch lightning. For more details, please read the documentation of the lightning data module which explains each step of the process here.

import pytorch_lightning as pl
import torch
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
class BertDataModule(pl.LightningDataModule):
    def __init__(self, **kwargs):
        """
        Initialization of inherited lightning data module
        """
        super(BertDataModule, self).__init__()
        self.BERT_PRE_TRAINED_MODEL_NAME = "bert-base-uncased"
        self.df_train = None
        self.df_val = None
        self.df_test = None
        
        self.train_data_loader = None
        self.val_data_loader = None
        self.test_data_loader = None
        
        self.MAX_LEN = 100
        self.encoding = None
        self.tokenizer = None
    
    def setup(self, stage=None):
        """
        Read the data, parse it and split the data into train, test, validation data

        :param stage: Stage - training or testing
        """
        
        num_samples = 80
        df = (
            transformed_data
            .sample(num_samples)
        )
        
        self.tokenizer = BertTokenizer.from_pretrained(self.BERT_PRE_TRAINED_MODEL_NAME)
        
        RANDOM_SEED = 0
        np.random.seed(RANDOM_SEED)
        torch.manual_seed(RANDOM_SEED)
        
        df_train, df_test = train_test_split(
            df, test_size=0.3, random_state=RANDOM_SEED, stratify=df["label"]
        )
        
        df_val, df_test = train_test_split(
            df_test, test_size=0.5, random_state=RANDOM_SEED, stratify=df_test["label"]
        )
        
        self.df_train, self.df_val, self.df_test = df_train, df_val, df_test
    
    def create_data_loader(self, df, tokenizer, max_len, batch_size=8):
        """
        Generic data loader function

        :param df: Input dataframe
        :param tokenizer: bert tokenizer
        :param max_len: Max length of the claims datapoint
        :param batch_size: Batch size for training

        :return: Returns the constructed dataloader
        """
        dataset = PapersDataset(
            description=df.description.to_numpy(),
            targets=df.label.to_numpy(),
            tokenizer=tokenizer,
            max_length=max_len
        )
        
        return DataLoader(
            dataset, batch_size=batch_size, num_workers=0
        )
    
    def train_dataloader(self):
        """
        :return: output - Train data loader for the given input
        """
        self.train_data_loader = self.create_data_loader(
            self.df_train, self.tokenizer, self.MAX_LEN 
        )
        
        return self.train_data_loader
    
    def val_dataloader(self):
        """
        :return: output - Validation data loader for the given input
        """
        self.val_data_loader = self.create_data_loader(
            self.df_val, self.tokenizer, self.MAX_LEN
        )
        return self.val_data_loader

    def test_dataloader(self):
        """
        :return: output - Test data loader for the given input
        """
        self.test_data_loader = self.create_data_loader(
            self.df_test, self.tokenizer, self.MAX_LEN
        )
        return self.test_data_loader

3- Building the network

In this step, we will build our classifier from a model learned from BERT.

The configuration of a model with pytorch lightning is explained here.

import pytorch_lightning as pl
import torch
import torch.nn.functional as F
from pytorch_lightning.callbacks import (
    EarlyStopping,
    ModelCheckpoint,
    LearningRateMonitor,
)
from sklearn.metrics import accuracy_score
from torch import nn
from transformers import BertModel, AdamW

class BertPapersClassifier(pl.LightningModule):
    def __init__(self, **kwargs):
        """
        Initializes the network, optimizer and scheduler
        """
        super(BertPapersClassifier, self).__init__()
        self.BERT_PRE_TRAINED_MODEL_NAME = "bert-base-uncased"
        
        self.bert_model = BertModel.from_pretrained(self.BERT_PRE_TRAINED_MODEL_NAME)
    
        
        for param in self.bert_model.parameters():
            param.requires_grad = False
        
        self.drop = nn.Dropout(p=0.2)
        n_classes = 2
        
        self.fc1 = nn.Linear(self.bert_model.config.hidden_size, 512)
        self.out = nn.Linear(512, n_classes)
        
        self.scheduler = None
        self.optimizer = None
    
    def forward(self, input_ids, attention_mask):
        """
        :param input_ids: Input data
        :param attention_maks: Attention mask value

        :return: output - Accepted or not for the given papers snippet
        """
        output = self.bert_model(input_ids=input_ids, attention_mask=attention_mask)
        output = F.relu(self.fc1(output.pooler_output))
        output = self.drop(output)
        output = self.out(output)
        return output
    
    def training_step(self, train_batch, batch_idx):
        """
        Training the data as batches and returns training loss on each batch

        :param train_batch Batch data
        :param batch_idx: Batch indices

        :return: output - Training loss
        """
        input_ids = train_batch["input_ids"].to(self.device)
        attention_mask = train_batch["attention_mask"].to(self.device)
        targets = train_batch["targets"].to(self.device)
        
        output = self.forward(input_ids, attention_mask)
        loss = F.cross_entropy(output, targets)
        self.log("train loss", loss)
        return {"loss": loss}
    
    def test_step(self, test_batch, batch_idx):
        """
        Performs test and computes the accuracy of the model

        :param test_batch: Batch data
        :param batch_idx: Batch indices

        :return: output - Testing accuracy
        """
        input_ids = test_batch["input_ids"].to(self.device)
        attention_mask = test_batch["attention_mask"].to(self.device)
        targets = test_batch["targets"].to(self.device)
        
        output = self.forward(input_ids, attention_mask)
        _, y_hat = torch.max(output, dim=1)
        test_acc = accuracy_score(y_hat.cpu(), targets.cpu())
        
        return {"test_acc": torch.tensor(test_acc)}
    
    def validation_step(self, val_batch, batch_idx):
        """
        Performs validation of data in batches

        :param val_batch: Batch data
        :param batch_idx: Batch indices

        :return: output - valid step loss
        """

        input_ids = val_batch["input_ids"].to(self.device)
        attention_mask = val_batch["attention_mask"].to(self.device)
        targets = val_batch["targets"].to(self.device)
        
        output = self.forward(input_ids, attention_mask)
        loss =  F.cross_entropy(output, targets)
        
        return {"val_step_loss": loss}
    
    def validation_epoch_end(self, outputs):
        """
        Computes average validation accuracy

        :param outputs: outputs after every epoch end

        :return: output - average valid loss
        """
        avg_loss = torch.stack([x["val_step_loss"] for x in outputs]).mean()
        self.log("val_loss", avg_loss, sync_dist=True)

    def test_epoch_end(self, outputs):
        """
        Computes average test accuracy score

        :param outputs: outputs after every epoch end

        :return: output - average test loss
        """
        avg_test_acc = torch.stack([x["test_acc"] for x in outputs]).mean()
        self.log("avg_test_acc", avg_test_acc)
    
    def configure_optimizers(self):
        """
        Initializes the optimizer and learning rate scheduler

        :return: output - Initialized optimizer and scheduler
        """
        self.optimizer = AdamW(self.parameters(), lr=0.001)
        self.scheduler = {
            "scheduler": torch.optim.lr_scheduler.ReduceLROnPlateau(
                self.optimizer,
                mode="min",
                factor=0.2,
                patience=2,
                min_lr=1e-6,
                verbose=True
            ),
            "monitor": "val_loss",
        }
        return [self.optimizer], [self.scheduler]

4- Training

import os
from pytorch_lightning import Trainer

torch.cuda.empty_cache()

data_module = BertDataModule(accelerator="gpu")
data_module.setup(stage="fit")

b_model = BertPapersClassifier()
early_stopping = EarlyStopping(monitor="val_loss", mode="min", verbose=True)

checkpoint_callback = ModelCheckpoint(
        dirpath=os.getcwd(),
        save_top_k=1,
        verbose=True,
        monitor="val_loss",
        
        mode="min",
    )
lr_logger = LearningRateMonitor()

trainer = pl.Trainer(
    max_epochs=10, gpus=1, accelerator="gpu",
    callbacks=[lr_logger, early_stopping, checkpoint_callback], checkpoint_callback=True,
)

trainer.fit(b_model, data_module)
trainer.test(datamodule=data_module)

torch.save(b_model.state_dict(), "bert_model_dict.pt")
--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'avg_test_acc': 0.6875}
--------------------------------------------------------------------------------

5- Test

import torch

article = "Sparse Quantized Spectral Clustering - Given a large data matrix, sparsifying, quantizing, and/or performing other entry-wise nonlinear operations can have numerous benefits, ranging from speeding up iterative algorithms for core numerical linear algebra problems to providing nonlinear filters to design state-of-the-art neural network models. Here, we exploit tools from random matrix theory to make precise statements about how the eigenspectrum of a matrix changes under such nonlinear transformations. In particular, we show that very little change occurs in the informative eigenstructure, even under drastic sparsification/quantization, and consequently that very little downstream performance loss occurs when working with very aggressively sparsified or quantized spectral clustering problems.\
We illustrate how these results depend on the nonlinearity, we characterize a phase transition beyond which spectral clustering becomes possible, and we show when such nonlinear transformations can introduce spurious non-informative eigenvectors."
#original label = 1 : accepted

# Predict on a Pandas DataFrame.
import pandas as pd

model = BertPapersClassifier()

model.load_state_dict(torch.load("bert_model_dict.pt"))
model.eval()

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

inputs = tokenizer(article, padding=True)

input_ids = torch.tensor(inputs["input_ids"]).unsqueeze(0)
attention_mask = torch.tensor(inputs["attention_mask"]).unsqueeze(0)


# print(model)
# print(input_ids)

out = model(input_ids, attention_mask)
# print(out)
print(torch.max(out.data, 1))
print(torch.max(out.data, 1).indices==torch.tensor([1]))
tensor([[-0.2612, -0.0729]], grad_fn=<AddmmBackward0>)
torch.return_types.max(
values=tensor([-0.0729]),
indices=tensor([1]))
tensor([True])

Taking the example of this item which was not in our data, our model returns an acceptance 😁:

Endnote

The whole code is available here (in french). Our used dataset is very small, we will increase it but anyway our model performs its function. In a next article, we will see how we can monitor our models, put in production the best performing ones while having the best parameters.

Cheers ☕!

Comments

comments powered by Disqus