🚀🚀 Super excited to share the latest benchmark results for our quantized BGE models.
A few weeks ago, these models were introduced with the aim of enhancing performance and efficiency for generating embeddings. And we've now conducted thorough comparisons between running PyTorch SentenceTransformers vs. our DeepSparse-optimized models on both a 10-core laptop and a 16-core AWS instance.
The benchmarks have yielded significant improvements in processing speed. For example, running the bge-small quantized model on the 10-core laptop, achieves up to a 3X increase in speedup. What's even better, is that when tested on a 16-core AWS instance, these models achieved up to a 5X improvement.
🤗 Updated model cards:
bge-small-quant: huggingface.co/zeroshot/bge-… 6K downloads
bge-base-quant: huggingface.co/zeroshot/bge-… 2K downloads
bge-large-quant: huggingface.co/zeroshot/bge-… 2K downloads (#1 model for STS datasets on the MTEB leaderboard)
Don't forget to check out the DeepSparse repo github.com/neuralmagic/deeps… for more information on benchmarking and running these models on the MTEB leaderboard. 💥
cc @neuralmagic
I love the #ChatGPT Cheat Sheet by Ricky Costa (@Quantum_Stat)
which includes
🔹NLP Tasks
🔹Code
🔹Structured Output Styles
🔹Unstructured Output Styles
🔹Media Types
🔹Meta ChatGPT
🔹Expert Prompting
Get your hands on this amazing resource at:i.mtr.cool/ehyhxpfexx
⚡Getting to Know the NPZ file format to Compress BGE Embedding Models ⚡
For One-Shot Quantization (INT8), Sparsify relies on the .npz format for data storage, a file format rooted in the mighty NumPy library.
Check the image below for an example of what I'm discussing 👇 We are soon releasing a notebook with an end-to-end example for anyone to replicate the compressed bge models which achieve great accuracy results on the MTEB Leaderboard.
⚡IT HAPPENED!⚡
There's a new state-of-the-art sentence embeddings model for the semantic textual similarity task on Hugging Face's MTEB leaderboard 🤗!
Bge-large-en-v1.5-quant was the model I quantized in less than an hour using a single CLI command using Neural Magic's open
source library Sparsify! Not only is it ONNX and INT8 quantized (faster and lighter) but is able to run on CPUs using DeepSparse! 💥
cc @neuralmagic
Model: huggingface.co/zeroshot/bge-…
Exciting News! 🚀 DeepSparse is now integrated with @langchain , opening up a world of possibilities in Generative AI on CPUs. Langchain, known for its innovative design paradigms for large language model (LLM) applications, was often constrained by expensive APIs or cumbersome GPUs.
But with Neural Magic's DeepSparse integration, developers can now accelerate their models on CPU hardware, making it a breeze to create powerful Langchain applications.
Langchain Doc link: python.langchain.com/docs/in…
DeepSparse Langchain Blog: neuralmagic.com/blog/buildin…
cc @hwchase17@neuralmagic
🌟First, want to thank everyone for pushing this model past 1,000 downloads in only a few days!! Additionally, I added bge-base models to MTEB.
Most importantly, code snippets were added for running inference in the model cards for everyone to try out!
huggingface.co/zeroshot/bge-…
🚀🚀 Explore Sparsify's One-Shot Experiment Guide!
Discover how to quickly optimize your models with post-training algorithms for a 3-5x speedup. Perfect for when you need to sparsify your model with limited time and improved inference speedups.🔥
**FYI, this is what I used to compress the bge-small-en-v1.5model for sentence embeddings . **
1️⃣ **Experiment Overview**: Learn about the benefits of One-Shot Experiments and how they work for transformers and soon LLMs.
2️⃣ **CLI Quickstart**: Get started with a step-by-step guide on running One-Shot Experiments using the Sparsify CLI with a single command :)
3️⃣ **Data Prep**: Guide for understanding how to turn samples from your calibration dataset into NPZ files from your tokenizer's output.
4️⃣ **Next Steps**: Explore other Sparsify pathways, including Sparse Transfer and Training Aware Experiments.
5️⃣ **Link**: github.com/neuralmagic/spars…
💪 #AI#MachineLearning@neuralmagic
🚀🚀 Hey, check out our blog on @huggingface 🤗regarding running LLMs on CPUs!
The blog discusses how researchers at IST Austria & Neural Magic have cracked the code for fine-tuning large language models. The method, combining sparse fine-tuning and distillation-type losses, resulted in a lean and lightning-fast model that shines on CPUs. Achieving 75% pruning without accuracy loss. They overcame challenges like loss spikes with SquareHead distillation and showcased its power using @neuralmagic's DeepSparse inference runtime.
huggingface.co/blog/mwitider…
🚀✨ Run CodeGen on CPUs with this detailed Colab notebook! 📝
Explore how to sparsify and perform Large Language Model (LLM) inference using Neural Magic's stack, featuring Salesforce/codegen-350M-mono as an example.
Dive into these key steps:
1️⃣ **Installation**: Quickly set up Sparsify and DeepSparse.
2️⃣ **ONNX Export**: Download and export the model to ONNX for optimization.
3️⃣ **Apply One-Shot Pruning and Quantization**: Optimize the model with Sparsify's FastOBCQ algorithm.
4️⃣ **Evaluate Accuracy**: Assess model accuracy using deepsparse.transformers.eval_downstream CLI for perplexity calculation.
5️⃣ **Inject KV Cache**: Improve inference speed with KV-caching in the ONNX graph.
6️⃣ **Run Inference With DeepSparse**: Execute text generation using DeepSparse.
notebook: github.com/neuralmagic/docs/…