Similar ngrams dataset

This dataset is released as a part of the following research paper:

The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering. Shangdi Yu, Jessica Shi, Jamison Meindl, David Eisenstat, Xiaoen Ju, Sasan Tavakkol, Laxman Dhulipala, Jakub Łącki, Vahab Mirrokni, Julian Shun, submitted to VLDB’25.

License

Creative Commons Attribution-Noncommercial-ShareAlike 3.0

Get the data

Download the dataset (5.69 GB zip file).

Citing

@article{yu2024parclusterers,
title={The ParClusterers Benchmark Suite (PCBS): A Fine-Grained
Analysis of Scalable Graph Clustering},
author={Yu, Shangdi and Shi, Jessica and Meindl, Jamison and Eisenstat,
David and Ju, Xiaoen and Tavakkol, Sasan and Dhulipala, Laxman
and Łącki, Jakub and Mirrokni, Vahab and Shun, Julian},
journal={submitted},
year={2024}
}

Dataset contents

ngrams-embeddings.csv - ngrams grams and their embeddings. CSV file (no header) with 1274175 rows and 769 columns:
- ngram (first column). Each ngram is a text of at most 83 bytes.
- the corresponding embedding (768 columns). The embeddings are L2-normalized. The embedding similarity is given by the dot product of the embeddings.
ngrams-graph.csv - edges of the symmetrized exact 50-nearest neighbors graph. CSV file (no header) with 88210350 rows and 3 columns:
- first ngram,second ngram,dot product of the ngram embeddings
ngrams-96k-pairs.csv - a diverse sample of 96000 pairs of ngrams and their corresponding similarity. Uses the same format as ngrams-graph.csv.

Computing the quality metrics

In order to compute quality metrics:

Pick a similarity threshold t in [0.8, 0.96].
Convert the pairs from ngrams-96k-pairs.csv into positive and negative pairs. Pairs with similarity larger than the threshold are considered positive (= the ngrams should be in the same cluster), while those with similarity below this threshold are considered negative (= the ngrams should belong to different clusters).
For a positive pair (𝑥, 𝑦), if 𝑥 and 𝑦 are in the same cluster, then we have a true positive; otherwise we have a false negative. For a negative pair (𝑥, 𝑦), if 𝑥 and 𝑦 are in different clusters, then we have a true negative; otherwise we have a false positive.
Finally, precision and recall are computed the usual way.

How the dataset was built

We start with a text dataset of ngrams and select all ngrams that occur at least 120 times in the verbargs part of the dataset. This results in a corpus of 1,274,126 short texts.
Embedding: Each ngram is then embedded using the text-embedding-gecko@003 model, specifying CLUSTERING as the task_type. This transforms the text data into 768-dimensional vector representations that capture semantic meaning. The similarity between two embeddings is obtained by computing their dot product.
𝑘-Nearest Neighbor Graph Construction: We compute an exact 50-nearest neighbor graph and make it undirected.
Label Generation: Finally, we build a set of labels that we later use to evaluate clustering quality, by performing the following steps:
Define similarity buckets: [0.76, 0.77), [0.77, 0.78), ..., [0.99, 1)
Sample 100 000 embeddings uniformly at random.
For each sampled embedding 𝑥 and each similarity bucket [𝑠, 𝑠 + 0.01), among all embeddings whose similarity to 𝑥 belongs to [𝑠, 𝑠 + 0.01) sample one embedding 𝑦 uniformly at random (if one exists). We represent this sample with a tuple (𝑥, 𝑦, 𝑥 · 𝑦), where 𝑥 · 𝑦 is the similarity of 𝑥 and 𝑦.
From all the tuples created this way we sample 4 000 in each similarity bucket (uniformly at random. This yields 96 000 tuples in total. This approach ensures diversity in both the points selected and the range of similarities represented.

We note that using simpler sampling approaches leads to a sample that we consider less representative. For example, if we sampled pairs of points uniformly at random, we would end up with very few pairs of high similarity. A simple fix to this approach is to uniformly sample pairs of points whose similarity falls into each similarity bucket. However, this may lead to some points occurring in the sample much more often than others. In our sample, each input ngram occurs in at most 9 labels.