EmBuddy
embuddy is a package that helps with using text embeddings for local data analysis.
EmBuddy currently supports models available through SentenceTransformers.
Usage
We'll look at a basic workflow with data from HappyDB.
import pandas as pd
from embuddy import EmBuddy
df = pd.read_csv(
"https://github.com/megagonlabs/HappyDB/raw/master/happydb/data/cleaned_hm.csv",
nrows=5000,
usecols=["hmid", "cleaned_hm"],
)
emb = EmBuddy("all-MiniLM-L6-v2")
embeddings = emb.embed(df["cleaned_hm"].tolist())
# do something with `embeddings`...
# project them in UMAP, use them as features in ML, etc
Up until now, everything is pretty much the same as using SentenceTransformers. Now we'll look at what's unique about EmBuddy.
Cached Embeddings
Embeddings are cached within EmBuddy. If we run this again, it should be quite fast!
embeddings = emb.embed(df["cleaned_hm"].tolist())
Approximate Nearest Neighbors Search
We can use Approximate Nearest Neighbors (ANN) to do semantic search with the embeddings.
emb.build_ann_index()
emb.nearest_neighbors("I made pudding.")
[[(2132, 'I made delicious rice pudding', 0.18222559085871814),
(2237,
'I made a tasty chocolate pudding, and every one enjoyed it. ',
0.21807944021581382),
(3845, 'I ate delicious rice pudding', 0.269050518351444),
(2601, 'I made a delicious meal.', 0.3264919047542342),
(800, 'I made a delicious breakfast.', 0.375725587426599),
(3262, 'i made a new flavor of homemade ice cream.', 0.38092683794557425),
(4037, 'I made a really good new recipe for dinner.', 0.3974327179384344),
(3685, 'I made a tasty breakfast', 0.40264095427046276),
(4649, 'I made a turkey dinner.', 0.416024635770632),
(2725,
'I made delicious rice pudding\r\nI ate delicious rice pudding\r\nI complete 5 Mturk surveys\r\nI taught my dog a new trick\r\nI received a check in the mail',
0.4244762467642832)]]
Under the hood, we use PyNNDescent for ANN search. You can pass any nndescent_kwargs you'd like to change the underlying ANN search index.
emb.build_ann_index(nndescent_kwargs={"metric": "tsss", "n_neighbors": 5})
emb.nearest_neighbors("I walked my dog.")
[[(3943, 'I went for a walk with my doggie.', 0.06725721),
(3226, 'Took my dog for a long walk around the neighborhood ', 0.1493121),
(477, 'I walked my dog and he was licking me. ', 0.15133144),
(4793, 'I took my dog for a walk, and he was happy.', 0.16377613),
(1845, 'I took my dog for a long walk on a beautiful day.', 0.18119085),
(1770, 'I took my dogs for a walk because it was a nice day', 0.1910228),
(4457, 'I was really happy walking my dog in the park.', 0.23755807),
(4563,
'As I was walking to retrieve the mail I saw a neighbor walking their adorable dog. ',
0.24459706),
(3994, 'My dog greeted me when I came home. ', 0.24681003),
(4214, 'I brought my dog to the dog park for the first time. ', 0.2755207)]]
Search ANN by Vector
You can also search by vector.
dog_docs = [doc for doc in df["cleaned_hm"].tolist() if "dog" in doc]
# Function call is a shortcut for emb.embed
# You can turn off cache if you don't want to save the embedding.
mean_dog_vector = emb(dog_docs, cache=False).mean(axis=0)
similar_docs = emb.nearest_neighbors_vector(mean_dog_vector, k=100)
most_dog_like_cat_docs = [
(ix, doc, sim) for (ix, doc, sim) in similar_docs[0] if " cat " in doc
]
[(2300, 'I played with my cat throughout the day.', 0.3908988),
(3884,
'Went downstairs and cuddled with my cat on the couch watching TV.',
0.39648643)]
Persistence/Serialization
Done embedding for now but want to add more later?
emb.save("myembeddings.emb")
# some time later...
my_old_emb = EmBuddy.load("myembeddings.emb")
my_old_emb.nearest_neighbors("I made even better pudding!")
Project to 2D with UMAP
umap_embeddings = emb.build_umap()
my_old_emb = EmBuddy.load("myembeddings.emb")
umap_embeddings_loaded = my_old_emb.umap_embeddings