Natural Language Inference: Exploring Sentence Models and Relation Extraction Methods

7 min readApr 28, 2021

Image from Dialogue Natural Language Inference (Welleck et al.)

Introduction

Natural Language Inference task is considered an important field within NLP because it can provide a deeper understanding of natural language and potentially help in developing semantic representation models for other NLP tasks ranging from information retrieval to dialogues and question-answering.

Natural Language Inference task focuses on understanding entailment, neutrality and contradiction between potential sentences or statements, formally known as premise and hypothesis. It is an entailment relationship if hypothesis statement can be inferred from the premise statement. On the other hand, the relationship is considered as contradiction if the premise and the hypothesis statements contradict each other. It is neutral if the statements have nothing to do with each other.

An example is shown below. This example was taken from the training dataset by SNLI corpus (Bowman et al., 2015).

Premise Statement
A person on a horse jumps over a broken down airplane.
Hypothesis Statements
Neutral: A person is training his horse for a competition.
Contradiction: A person is at a diner, ordering an omelette.
Entailment: A person is outdoors, on a horse.

For this project, I explored LSTM (Long Short-Term Memory) model and Sum Embedding model (which I will describe more in detail later). Bowman et al (2015) implemented the Sum Embedding model as a baseline model in A large annotated corpus for learning natural language inference. In addition, I also attempted to explore whether the relation extraction methods mattered. One of them was the concatenation method and the other was the relation extraction methods that were mentioned in the paper Supervised Learning of Universal Sentence Representations from Natural Language Inference Data by Conneau et al. (2018).

About Data

The data that was used is the SNLI (Stanford Natural Language Inference) corpus, which was created by Bowman et al. (2015). The dataset is found here. The corpus focuses on the three kinds of relationships (entailment, neutral, contradiction), which was used as labels for the models. The Stanford Natural Language Processing group provides three data sets for training and evaluating NLP models, which were all annotated: train data, development data and test data sets in both json and text formats.

Preprocessing

There were some data entries that were labeled with “-” instead of the three labels, indicating that the annotators did not reach an agreement in annotating those entries. Any data entry that was labeled with ”-” were excluded from the project. Any punctuation such as commas and periods were removed. In addition, any incomplete data entries (such as missing hypothesis or premise sentences) were not considered for the purpose of the project.

Prior to building the model pipeline, I used the tokenizer from Tensorflow Keras and processed the sentences before it was fed into the pipeline.

Methods

GloVe Word Embeddings

In Bowman et al.’s paper A large annotated corpus for learning natural language inference, the authors used 300d reference GloVe vectors to initialize and fine-tune the word embeddings. The pre-trained GloVe vectors can be downloaded from here. If the download link does not work, I suggest also to download it from Kaggle. I adapted the code from Merity to process GloVe vectors. Some code snippets are shown below:

embeddings_index = {}
f = open(‘glove.840B.300d.txt’)
for line in f:
 values = line.split(‘ ‘)
 word = values[0]
 coefs = np.asarray(values[1:], dtype=’float32')
 embeddings_index[word] = coefs
f.close()
glove_name = 'precomputed_glove_weights_shortened'embedding_matrix = np.zeros((vocab_size+1, embedding_hidden_size)) #(39832, 300)
for word, id_ in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: 
        embedding_matrix[id_] = embedding_vectornp.save(glove_name, embedding_matrix)

Baseline Models

There are two baseline models I used to evaluate and compare the performance with the sentence models: random selection and the hypothesis-only model. The hypothesis-only model was inspired by the paper Hypothesis Only Baselines in Natural Language Inference (Poliak et al., 2018). It poses a good baseline model since hypothesis-only model would only learn from hypothesis only but not learn from the relationship between the premise and the hypothesis.

Sentence Models & Relation Extraction Methods

There are three models that I tried out:

Sum Embedding Model (Bowman et al., 2015)
LSTM model + Concatenation (Bowman et al., 2015)
LSTM model + 3-relation-extraction-method (Bowman et al., 2015; Conneau et al., 2018)

For all of the models, I used Tensorflow Keras to construct the layers.

The first model is the Sum Embedding model that Bowman et al. (2015) implemented as a baseline model. It sums the embeddings of the words in each sequence for both premise and hypothesis (Bowman et al., 2015). The resulting sentence embeddings are concatenated and fed into 3-class classifier, which uses softmax function. The figure of the architecture is shown below.

Simplified Architecture of the Sum Embedding Model

The second model uses LSTM instead of sum embeddings as sentence encoder architecture.

The third model uses LSTM as sentence encoders as well but implements the relation extraction methods that were mentioned in Supervised Learning of Universal Sentence Representations from Natural Language Inference Data by Conneau et al. (2018). It applies the concatenation of the representations of hypothesis h and premise p, element-wise multiplication h*p and absolute difference |p-h|.

LSTM model with the relation extraction methods

A model summary of the third model is shown below:

Results and Discussion

Here are the accuracy scores of the models on the development set:

The models with LSTM sentence encoders did not perform as well as expected. Both the models with sum embeddings and LSTM were not able to outperform the hypothesis-only model and performed similarly to the random selection model. Interestingly, the sum embedding model performed slightly better than both of the LSTM models.

This made me conclude that exploring features from the raw sentences might have been a beneficial step prior to using sophisticated neural models and attempting any deep learning. It might have been helpful to know which of the features within the raw sentences would benefit the models to learn the relationship among the two sentences. Because there were no significant differences in accuracy scores among the three models, it is challenging to conclude whether using LSTM for sentence encoder was better than the sum embeddings and whether using the three relation extraction methods were more beneficial than just concatenating the sentence representations.

In addition, the sum embedding model took a lot less time to train compared to the two LSTM models. As it is shown in the table below:

In the context of the project, the sum embedding model might look superior in terms of accuracy and training time, which is contrary to the what Bowman et al. (2015) concluded. It seems that either there might have been bugs in the code or the architecture might not have been set up correctly.

Here is the accuracy score of the sum embedding model on the test set:

Next Steps

There are several next steps that could be done from here on:

Attempting BERT model for this task: I think implementing simpletransformers DistilBERT would be a good start. It would be interesting to compare the performance of the BERT model relative to the LSTM models.
More focus on extracting and understanding features: I realized in the middle of the project that some feature engineering on the raw sentences would actually be beneficial before attempting any sophisticated models. Looking at the hypothesis-only baseline model, it seems that the model does learn a lot from hypothesis sentences alone and does score higher than random selection. There might be some features in the raw text that could be useful for models to learn.

References

Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2018). Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. ArXiv:1705.02364 [cs.CL]. Retrieved April 10, 2021, from https://arxiv.org/pdf/1705.02364v5.pdf

Merity, S. (2017). Keras SNLI baseline example. Retrieved April 15, 2021, from https://github.com/Smerity/keras_snli/ blob/master/snli_rnn.py

Poliak, A., Naradowsky, J., Aparajita, H., Rudinger, R., & Van Durme, B. (2018). Hypothesis Only Baselines in Natural Language Inference. Proceedings of the 7th Joint Conference on Lexical and Computational Semantics (SEM), 180–191. https://www. aclweb.org/anthology/S18–2023.pdf