Skip to contents

This function compares two blocks of text (e.g., abstracts) using multiple similarity measures. It supports Jaccard Similarity, Cosine Similarity, Levenshtein Distance, and Longest Common Subsequence. The function returns a similarity score based on the chosen method.

Usage

compare_texts(text1, text2)

Arguments

text1

The first text block to compare.

text2

The second text block to compare.

Value

A numeric value representing the similarity between the two text blocks.

Details

A function to compare two blocks of text using various similarity measures. It calculates similarity based on Jaccard, Cosine Similarity, Levenshtein Distance, and Longest Common Subsequence.

Similarity Measures

The function compares two blocks of text using several different similarity measures. Below are descriptions of the most commonly used measures and how they work:

1. Jaccard Similarity

Definition: Jaccard similarity measures the proportion of shared words between two texts, ignoring word order. It is calculated as:

J(A, B) = |A ∩ B| / |A ∪ B|

where:

  • A and B are the sets of unique words in each text.

  • |A ∩ B| is the number of words common to both texts.

  • |A ∪ B| is the total number of unique words across both texts.

Interpretation:

  • Values range from 0 (no words in common) to 1 (identical texts).

  • Higher values indicate more word overlap, but this measure does not account for word frequency or order.

2. Cosine Similarity

Definition: Cosine similarity is a vector-based approach that measures how similar two texts are based on word frequency. The texts are represented as word frequency vectors, and similarity is computed as the cosine of the angle between these vectors:

cos(θ) = (A . B) / ||A|| ||B||

where:

  • A and B are word frequency vectors.

  • . is the dot product.

  • ||A|| and ||B|| are the vector magnitudes (lengths).

Interpretation:

  • Ranges from 0 (completely different) to 1 (identical).

  • Takes word frequency into account, so if a word appears multiple times in both texts, the similarity score is higher.

  • Ignores word order.

3. Levenshtein Distance (Edit Distance)

Definition: Levenshtein distance (edit distance) measures the number of single-character edits (insertions, deletions, substitutions) required to transform one text into another.

Interpretation:

  • A lower value means the texts are more similar.

  • A score of 0 means the texts are identical.

  • Unlike Jaccard or Cosine similarity, this method is sensitive to word order and spelling differences.

4. Longest Common Subsequence (LCS)

Definition: LCS measures the longest sequence of characters (not necessarily consecutive) that appears in both texts in the same order.

Interpretation:

  • Higher values indicate more similarity.

  • Unlike Levenshtein, LCS does not count insertions/deletions that do not break the order of characters.

  • Good for detecting similar sentence structures.

Choosing the Right Measure

MeasureWhat It CapturesBest Use Case
Jaccard SimilarityWord overlap, ignores orderDeduplication, quick filtering
Cosine SimilarityWord frequency, ignores orderFinding similar abstracts, topic comparison
Levenshtein DistanceSpelling differences, order-sensitiveChecking near-duplicate sentences, typos
LCS DistanceCommon phrases, order-sensitiveSentence structure comparison

For deduplication, a combination of Jaccard + Cosine Similarity works well. For cases where word order matters, use Levenshtein or LCS.

Examples

compare_texts("This is a test", "This is a test")
#> Error in loadNamespace(x): there is no package called ‘stringdist’
compare_texts("This is a test", "This is a different test")
#> Error in loadNamespace(x): there is no package called ‘stringdist’