Find similar statements in documents

Jianlin Shi
2 min readSep 16, 2022

When we start to work on a natural language processing (NLP) dataset (a list of documents usually), often we are unsure how other people describe a particular concept in the dataset. For example, in clinical notes, doctors may note “the patient admitted with a high fever.” But there are many ways to describe the same idea, some of which are relatively rare in our daily language, such as “the patient was febrile.”

Is there any way that we can explore these similar statements? Yes! Let me demonstrate how to use semantic search to do it (Try the workable notebook while your read this :D).

Create a sample dataset

First, let’s get some clinical data from mtsamples (please be considerate and don’t try too many notes).

docs=get_mtsampels_notes(num_notes=150)

Next, let’s split the notes into sentences:

We will use huggingface’s “datasets” package to hold the dataset and build a vector index. Underneath, here is how it works:

Generate mathematic representation (vectors) of text

For each sentence, the transformer model will convert the text into an array — — a mathematical way to represent a position in high dimension space. Accordingly, the direction from the coordinate origin to that position forms a vector. The key to this conversion is how the vector is generated by models. The magic of the transformer models can align semantically related text closer in this high-dimension space. We will show the magic later in this blog.

A plot of two vectors in a 3d space
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings=model.encode(df['sentences'],batch_size=32)

Index vectors

Next, we use the faiss to build the index. In case the word “index” here doesn’t make any sense to you, let’s do an analog here. Suppose you just get a new book recommended by your friend, and you were suggested to read chapter 3.5 first. How would you get there?

You probably would check the contents, locate chapter 3.5, find corresponding page numbers, and then turn to that page. Here, the vectors are the book pages, and they are organized in high-dimension space like the pages organized in chapters; The “index” here is the contents.

from datasets import Dataset
df['embeddings']=list(embeddings)
sents_dataset = Dataset.from_pandas(df)
sents_dataset.add_faiss_index(column="embeddings")

Magic is here

Now, we can try some examples to show the magic

find_similar_sents("the patient presented with fever.")

A workable Colab notebook you can try is here.

--

--