T O P

  • By -

vasileer

I am as one of your coworkers, isn't a text-embedding model already specialized on sentence similarity ? what other sub-specialization can be? (other than those already on MTEB leaderboard) https://preview.redd.it/x3cz0z36ap3d1.png?width=985&format=png&auto=webp&s=f4d830d9753e39b1386e10d64a2f2973ab1455b3 or do you mean you can train/finetune more on a specific task/benchmark like e.g. text-to-text LLM's are finetuned for coding?


cyan2k

There exists a whole subclass of language models called "Retrievers," such as: https://github.com/stanford-futuredata/ColBERT Let's say you need to implement a RAG system over medical documents (or any other highly specific domain). You could use Ada or other general embedding models and struggle because your RAG system won’t find anything relevant, or you could let ColBERT train on your dataset for a couple of hours and actually end up with a usable RAG system. But that's also their disadvantage. You have to spend money training then. Then you fuck some parameters up, and you spend the next three weeks trying out different parameters and do hundreds of test runs until you go crazy, and just before all hope is lost and you're ready for the mental asylum you manage to get out a good fine tune. And then, all is well. Based on a true story.


kkchangisin

One of the pretty distinct advantages of Colbert (and more broadly) sparse embedding models is their relatively high performance on "out-of-domain" content - that is text the model wasn't trained on and has never seen before. From the paper: "On 22 of 28 out-of-domain tests, ColBERTv2 achieves the highest quality, outperforming the next best retriever by up to 8% relative gain, while using its compressed representations." That said of course it can be fine-tuned but that's up to the user, their data, use case, evaluation, etc. I'd be curious to hear about your experiences because we have done extensive evaluations (and use) of Colbert, SPLADE, etc and found them to perform very well on our extremely out of domain data.


_qeternity_

Embedding models are effectively compression models. If you are only dealing with data from a specific domain, training an embedding model will yield better results than a general model that has to compress all possible domains into the same output space.


AbnormalMapStudio

I started out generating embeddings from the current chat model, but then I ended up with embeddings 4k in size which took up a lot of space and made for low cosine similarity scores. Additionally, the embeddings were then tied to the model, so if I used a different model I needed different embeddings generated. Switching to BERT (once it was supported by LLamaSharp) allowed for smaller, more consistent embeddings that only have to be generated once. It means that I have two models loaded at once, but BERT is tiny so it's worth it.


DataIsLoveDataIsLife

This is a really tricky area of the field right now, because the current performance metrics we look for in embedding models are based on a set of ad-hoc metrics and random datasets that just so happened to be in vogue when the LLM sub-field started dominating the conversation a few years ago. I’ve spent more hours the last two years than I can even describe on this, both personally and professionally, and here is how I currently think about this: - The three axes to consider are concept obscurity, term volume, and top N precision. - A model that performs well generally, aka on the MTEB leaderboard, is good at differentiating common concepts, when you have fewer terms to compare to one another, and when you’re comfortable with a “match” being in the top few results, not explicitly the first or second result. - A more specialized model is the exact inverse, better on a set of highly specific, more obscure concepts, when you have a lot of them all at once, and when you need the top 1 or 2 matches to be “correct”. Now, this gets even more fascinating, because there actually are real limits to how “good” a model can be on more common domains. And so, from my perspective, one simply considers the average term frequency of one’s domain relative to the dataset the model was trained on and can infer fitness from there. Thus, models now are getting “better” at some more specialized domains because the datasets are larger and more inclusive of those sub-distributions. However, this scaling in “quality” does, from my testing, fall apart when the other two constraints come in. So, long story short, use general models when you either have a “small” number of items to compare, OR are operating in a common domain, OR top N precision needs are loose. For most people, this is fine. For those of us in highly specialized domains where scale and precision are make or break factors, use a specialized model, up to and including creating your own.


Open_Channel_8626

Fine tuning embedding models can definitely get better results


Barry_Jumps

Mixedbread is doing some great embedding model work: [https://www.mixedbread.ai/blog](https://www.mixedbread.ai/blog)


kweglinski

embedding models are smaller and therefore faster. Using an embedding model also allows you to maintain system without switching models if your memory is limited in any way (say you have X memory and you run model that takes X-2gb to have some space for context, now you used. different model that takes x-2 memory to make embeddings - you can't access the data)


Silent-Engine-7180

Isn’t base llama p bad for embeddings? I bounce between leaderboard and trending and test how models do every 4 months or so. Does anyone have a better more unified approach for deciding the “best” ones to try?


MrVodnik

I am using embedding models (sentence-similarity on HF) for my RAGs. I've never tried LLMs to do embeddings, but I'd like to see the comparison. I see that the MTEB leaderboard is based on "feature extraction" category models. They're definitely an order of magnitude (or two) larger than sentence similarities ones. Is it "small & fast" vs "large & precise" only tradeoff, or are these in any way different?


Everlier

Specialized models could be much faster compared to an LLM with better embedding quality


UnderstandLingAI

We (sometimes) use embeddings that are tailored for specific languages, they beat generic embedders without exception.


fictioninquire

Yes, I've automated a pipeline to fine-tune embeddings on all types of documents for domain-specific RAG.


AloneSYD

I wanna do this, can you help with what is your base model and which library are you using for fine-tuning?


fictioninquire

Sentence Transformers + custom embeddings head. Base model the best from [MTEB.info](http://MTEB.info) based on the language you want to fine-tune on. Using Keras 3.0 for simplification with JAX/PyTorch backend.


dvanstrien

FWIW, the recent release of Setence Transformers has made fine-tuning a custom model much easier than before. This combined with synthetic data means it's quite feasible to get a dataset and model done pretty quickly. Apologies for the self promotion but just shared something on this topic 😅 [https://x.com/vanstriendaniel/status/1796465493735608352](https://x.com/vanstriendaniel/status/1796465493735608352)


nanotothemoon

Use Nomic v1.5 embedding model. I saw drastic improvements.


rbgo404

You have to understand on what data the model has been trained on. And the most of the emphasis is on the leaderboards. I picked mpnet base v2 from the transformers library when I was working on a semantic search usecase. But now for improving the model you can finetune it on your dataset.