Document search using fragment embedding

Posted Jun 16, 202011 min read

Author|Ajit Rajasekharan
Source|Towards Data Science

The embedding of a sentence segment obtained from a document can serve as an abstract summary aspect of the document and may speed up the search, especially when the user input is a sentence segment. These segment embeddings not only produce higher quality results than traditional text matching systems, but also an intrinsically driven search method for problems. Modern vectorization represents the challenge of creating effective document embeddings, capturing all types of documents and making them search at the document level by using embeddings.

For example, "bats are the source of coronaviruses", "pangolin coronaviruses of", short sequences of one or more noun phrases are connected by prepositions, adjectives, etc. These highlighted connectives are largely ignored by traditional search systems, and they can not only play a key role in capturing user intent(for example, "coronavirus in bats" is different from "bats are the source of coronaviruses" or " There are no coronaviruses in bats”), but the sentence fragments that retain them can also be valuable candidate indexes, which can be used as the abstract extraction aspect of documents(sub-summaries). By embedding these sentence fragments into an appropriate embedding space(such as BERT), we can use the search input fragments as a probe into the embedding space to discover relevant documents.

Need to improve search using snippets

Find a comprehensive answer backed by literature evidence to answer "What animal does COVID-19 come from?" or "Receptor to which coronavirus binds", even on a small dataset like the recently released covid19 dataset(about 500 The size of the MB corpus, about 13k documents, more than 85 million words, and about 1 million different words in the text) is also a challenge.

The traditional document search method is very effective for typical use cases that obtain answers from several documents by using one or more noun phrase searches. Traditional document search methods also satisfy the following user experience constraints on words and phrases:

What we see(result) is what we **enter(searched)

For example, when we search for words and phrases(a continuous sequence of words, such as New York, Rio De Janeiro), the results usually contain the words we entered or their synonyms(for example, COVID-19 search produces Sars-COV-2 or new Coronavirus and other results).

However, as the number of words entered in the search increases, the quality of the search results tends to decrease, especially when connecting words are used between noun phrases. Even if the search engine highlights the term in the results, the decline in the quality of the results is still obvious,

For example, in the image below, the current search engine selectively highlights the nouns in "bats as a source of coronavirus", sometimes not even following the order of these words in the input sequence. Although document relevance ranking can usually greatly alleviate this situation, we still need to check the summary of each document because the document does not meet our search intent.

In addition to producing more relevant results, the document search method described in this article can also reduce this cognitive burden in the search system, especially when searching for sentence fragments. As an illustration, the same query we used in the existing search system above can produce results in the form shown below(this interface is only a schematic diagram for explaining the search method). It is worth noting that the main point in the following diagram is that the abstract is the actual match in the document(the number in parentheses is the number of documents containing the segment and the cosine distance of the segment with the input search segment), not in the traditional search system Suggested queries or related search queries shown in. These summaries provide a panoramic view of the result space, reducing useless document navigation and accelerating the aggregation of documents of interest.

The input segment can be a complete or partial sentence, with no restrictions on its composition or style. For example, contrary to the affirmative queries above, they may be interrogative words, and we can find coronavirus-bound protein receptors by searching for "what are the coronavirus-bound receptors?"

The comparison between the above search systems is only used to illustrate the differences between the basic methods of document discovery. Otherwise, given the magnitude difference of the corpus size, this will be an unfair comparison, because we will definitely get more relevant results in a tiny corpus.

Because of its advantages over traditional pure symbol search methods, vectorized representation has become an indispensable part of any search form. Modern search systems increasingly use them to supplement symbol search methods. If we view document search broadly as a combination of breadth-first and depth-first traversal of the document space, then these two forms of traversal require embedding with features specific to these traversals. For example, we can start with the animals that cause the coronavirus, then go deep into bats, and then expand to reptiles, etc.

  • Vectorized representation of documents-Word, phrase or sentence fragments extracted from Word2vec and BERT's embedding space all have unique complementary attributes, which are very useful for performing extensive and in-depth searches. Specifically, Word2vec embedding of words(words refer to words and phrases, such as bats, civet cats, etc.) is an effective method of breadth-first search, and entity-based clustering is applied to the results. Search for the words "bat" or "civet cat" and you will get other animals, such as pangolins and camels.
  • Sentence fragments embedded in BERT("Coronavirus in pangolin", "Bat as a source of coronavirus", etc.) are useful, and fragments variants can be found, largely retaining the original nouns, depending on their presence in the corpus The presence. For example, "bats as a source of coronavirus" will produce mutations in fragments, such as "bat coronavirus", "coronavirus produced by bats" and so on.
  • Although these embeddings are largely complementary, they also have overlapping features. word2vec embedding can produce depth-first results, and BERT embedding produces breadth-first results at the end of the distribution of statistical results. . For example, using word2vec to embed search bats, in addition to searching for camels, pangolins and other animals, you can also search for bat species(if bat, fox bat, flying fox, pterosaur, etc.). Using BERT to conduct a fragment search for "peacock coronavirus", we obtained "cat coronavirus" and "cheetah coronavirus", although the result was mainly avian coronavirus.
  • The BERT model allows search input(terms or fragments) not in the vocabulary, so that any user input can find related documents.

How does this approach work?

Extended terms or fragments obtained from word2vec/BERT embeddings are used to accurately match documents that have been indexed offline using these terms or fragments. In the offline state, a combination of part-of-speech tagger and blocker is used to obtain fragments from the corpus, and two models of word2vec and BERT are used to create embedding for it.

  • Mapping user input to term and segment embedding not only has the advantage of increasing search breadth and depth, but also avoids the problem of creating high-quality document embeddings that match user input. Specifically, fragments play a dual role in document indexing, and make a single document have multiple "extractable abstracts" that can be searched because the fragments are embedded in the document. Compared with purely using terms or phrases to find such documents, using fragments will also increase the chance of finding the target keywords in large documents. For example, finding the potential animal source of coronavirus is a clear case of finding the target in a large document. We can see in the figure above that the fragment matches a single document(this is examined in detail in the notes section below).
  • The use of embedding is purely for finding candidate terms/fragments and using traditional search index methods to find documents matching these terms/fragments, which enables us to perform document search on a large scale.
  • Finally, when finding answers to a wide range of questions such as "What is the animal source of COVID-19?" In view of the large scope and processing time of this task, this operation can be done automatically and offline, as described here The fragment embedding-driven search method is suitable for "not too broad" real-time search use cases. For example, given sufficient computing resources and an effective hashing method, the embedding of the "receptor coronavirus receptor" is large. Perform embedded spatial search at scale.

Limitations of the current method

As mentioned earlier, word2vec embedding expands the search range of words and phrases. They do not extend the breadth of the segment search-the histogram of the neighborhood area often lacks an obvious tail(figure below). This is because the segment does not have enough neighborhood context due to its length to learn high-quality embedding. This defect can be partially solved by expanding the training window size and ignoring sentence boundaries to increase the surrounding context, but it is still not enough in practice because the number of occurrences of fragments is very low.

BERT embedding only increases the depth of the search to a large extent, especially for fragments and phrases(using BERT embedding to expand the search depth of words is not useful in practice). Although they do increase the width to some extent, for example, the query for "Coronavirus in Macaque" expands to "Coronavirus in Palm Civet", which is included at the end of the distribution of statistical results, but its width is not as good as provided by word2vec Words and phrases. The following diagram illustrates its shortcomings. There are also some examples in the implementation notes about the lack of breadth of fragment search, and some ways to circumvent this limitation.


Word2vec may be the first model that explicitly establishes vectorized representation capabilities about seven years ago. The "architecture" of this simple model is actually two vector arrays, and the embeddings it outputs are still of great value to downstream applications(such as the document search method described above).

Word2vec cooperates with BERT embedding to provide a solution for document search. This solution may improve the traditional method in terms of search result quality and convergence time(this requirement needs to be quantified). The search system can use this vector representation to not only select specific documents, but also find documents similar to the selected documents.

Before selecting a document, you can use embedding(whether it is a word, phrase, or sentence segment) to expand/deepen the search. Word2vec embedding of words and phrases greatly increases the breadth of document search. BERT embedding greatly increases the search depth of sentence fragments. BERT embedding also eliminates unfamiliar word scenes, and promotes the searchable extraction of abstracts from different important fragments in documents, thus accelerating the aggregation of related documents.


  • The animal source of COVID-19 is not confirmed to date.

Sentence BERT

  • Unsupervised NER using BERT
  • An answer explaining how word2vec works

Implementation considerations

1. What is the NLP method/model used in this method?

Part-of-speech tags to mark a sentence(based on CRF is an order of magnitude faster than the current FOA-measured STOA method, and the recall rate of the model has also met the requirements of the task)

Chunker creates phrases

Word2vec represents the embedding of words and phrases

BERT for fragment embedding(sentence conversion)

BERT for unsupervised entity marking

2. How to calculate the correlation of document results?

You can sort the segments based on the cosine distance to the input segment. And the documents that centrally match each segment will be selected first and listed in the same order as the input segment.

The computationally intensive step of real-time search is similarity search(Word2vec or BERT) embedded in space. Existing open source solutions can already do this on a large scale. We can do some optimizations to reduce the time/calculation period. For example, only one of the two embedding spaces is searched according to the input search length, because the advantages and disadvantages of these models depend on the search length.

4. Isn't a snippet a long phrase? If so, why change the name?

a) The fragment is essentially a long phrase. The difference between a phrase and a phrase is useful for one reason:fragments can be complete sentences, not just partial sentences

b) The strength of these models depends on the input length we saw earlier. Word2vec performs well in terms of words/phrases. BERT performs best in the segment area(≥5 words)

5. How to find terms and fragments in the histogram distribution of the neighborhood?

The following are the neighborhoods of words, phrases(3 words) and fragments(8 words) of BERT and Word2vec, which illustrate the complementarity of these two models. The tail of the distribution increases with the length of the BERT word, and the tail of the segment is significantly different compared to the phrase or word. When the count term is low, sometimes the distribution may have a thick tail, which means that the result is poor. The embeddings generated by sentence-transformers often have a unique tail, as opposed to the embeddings generated by bert-as-service, although they all use the summation of subwords as a pooling method(there are also other pooling methods), because of the presence-transfomers’ Supervised training uses the tags of sentence pairs with implied, neutral and contradictory semantics.

Word2vec is very interested in words and phrases. For long phrases, even if the number of occurrences is high, this vectorization can be almost decomposed into a "morbid form", gathered at the high end, and the rest concentrated at the low end. The distribution of long phrases is also different. However, regardless of the shape, the neighborhood results clearly show this quality degradation.

6. The sensitivity of the result to changes in the input segment. This is how it becomes possible for us to use input variables to converge on the same result.

Although the set of fragments retrieved for different variants of the same problem is different, there may be many intersections in the set of fragments retrieved. However, due to the limited breadth of the fragments discussed earlier, some problems may not produce any fragments that involve all searched nouns. For example, "Pterosaur as a source of coronavirus" or "Pterosaur coronavirus" may not produce any fragments containing bats(Pterosaur belongs to the bat family). When the fragment does not contain all nouns, one method to consider is to find Word2vec synonyms for the term and use these terms to reconstruct the query.

7. Using terms, phrases and snippets in large documents, how do these models perform?

Word2vec embedding is not directly useful in this case because the vector of single occurrences/phrases does not have enough context to learn rich representations. BERT embedding does not have this shortcoming, words have enough context to learn good representation. However, Word2vec can still find synonyms for a noun in the search. For example, if there is only one reference to the fruit bat coronavirus in the document space, searching for coronavirus in pterosaurs may not get the document. However, searching for coronavirus fragments(created with Word2vec) in fruit bats can find the document. But if a fragment appears at the end of a distribution to make it a candidate, then it may be screened out. The inherent interpretability of most fragments provides an advantage, and a word or phrase does not necessarily have this advantage.

8. More details about extracting animal coronavirus information

Using Word2vec and entity tags, approximately 1,000(998) biological entities were obtained. These were used to collect 195 viral fragments. Here shows an example with 30 fragments

Samples of these fragments have evidence of animals as potential sources of coronavirus

Original link:

Welcome to the Patron AI blog site:

Sklearn machine learning Chinese official document:

Welcome to the Patron blog resource summary station: