Text Segmentation for Long Document Understanding

This project is a research project for my Natural Language Understanding course. We chose to focus on attempting to improve long document understanding. The classic transformer model uses an attention layer which has quadratic complexity with respect to the number of tokens; this creates a limit on the number of tokens that modern transformer models can process at one time. Currently, there is a lot of research on how to remove/overcome this limitation. We found a paper (Hierarchical Transformers for Long Document Classification) which investigated the idea of ‘sliding’ a transformer model over a long document. We aimed to improve this by adding a text segmentation scheme – instead of naively sliding the model over a long document, we would apply a text segmentation scheme on the document to generate natural segments, and then slide the transformer model over those segments.

For clarity, consider the following example. We could split a document by sentences, embed each sentence using BERT, and then run any kind of seq2seq model (e.g LSTM) over those embeddings. In this case, each “segment” is an individual sentence. It’s natural to wonder if each sentence being its own segment is the best way of segmenting the document – after all, it can take multiple sentences to convey an idea. Is there a way we could automatically find groups of these sentences, and use that as a better segmentation scheme?

We used BERT’s next-sentence-prediction task to do just this. In order to generate meaningful/better segments, we slide BERT’s NSP model over each consecutive pair of sentences, obtaining a sequence of NSP-probabilities. We can apply a threshold to these probabilities in order to generate segments. (Minor details were omitted, check the paper if you want a thorough explanation).

In our paper, we examined this segmentation approach (and another very similar one), and find that our segmentation scheme improved the model’s accuracy on our benchmark dataset from 58% to 64%.

My Github repo and our research paper.