Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Milvus, an open-source, cloud-native vector database

April 11, 2023

Milvus, an open-source, cloud-native vector database - Key Architecture Notes

Milvus, an open-source, cloud-native vector database - Key Architecture Notes

Milvus takes a ground-breaking approach and introduces a publish-subscribe (pub/sub) system for log storage and persistency
Milvus offers two ways of deployment - standalone or cluster
Access layer acts as the system's face, exposing the endpoint of the client connection to the outside world
Coordinator service responsible for cluster topology node management, load balancing, timestamp generation, data declaration, and data management.
Worker, or execution, node executing instructions issued by the coordinator service and the data manipulation language (DML) commands initiated by the proxy.

Storage is the cornerstone of Milvus, responsible for data persistence. The storage layer is divided into three parts:

Meta store: Responsible for storing snapshots of metadata such as collection schema, node status
Log broker: A pub/sub system that supports playback and is responsible for streaming data persistence, reliable asynchronous query execution
Object storage: Stores snapshot files of logs, scalar/vector index files, and intermediate query processing results.

Log broker, logs are decoupled from the server, ensuring that Milvus is itself stateless and better positioned to quickly recover from system failure.

Ref - Link

Vector Search

Knowhere not only further extends the functions of Faiss but also optimizes the performance
Built on top of FAISS, Annoy, HNSW

Ref - Link

Consistency Discussions

GuaranteeTs is configurable in the search request to achieve the level of consistency specified by you. A GuaranteeTs with a large value ensures strong consistency at the cost of a high search latency.

Ref - Link

Query Mechanism

Before a query is executed, the data has to be loaded to the query nodes first.
There are two types of data that are loaded to query node: streaming data from log broker, and historical data from object storage (also called persistent storage below).

Ref - Link

Milvus - Applications

Video media: video understanding, video deduplication.
E-commerce and mobile applications: image understanding, reverse image search.
Finance/Telecommunications/Retail: AI-aided customer support, QA chatbots.
Internet: personalized recommender systems, personalized search.
Autonomous vehicles: automated data labeling and annotation, object detection.
Biopharmaceutical: virtual compound screening, compound retrosynthetic analysis, protein property prediction, and DNA testing.
Cybersecurity: malware detection and cyberattack alert.
Quantitative trading: data analysis and prediction.
Metaverse: environmental perception and interaction in the virtual world.

Ref - Link

Hnswlib - fast approximate nearest neighbor search

Distance parameter Equation

Squared L2 'l2' d = sum((Ai-Bi)^2)

Inner product ' ip' d = 1.0 - sum(Ai*Bi)

Cosine similarity 'cosine' d = 1.0 - sum(Ai*Bi) / sqrt(sum(Ai*Ai) * sum(Bi*Bi))

hnswlib uses hnswlib to pre-calculate approximate nearest neighbors.

Do you actually need a vector database?

Vector databases are having their day right now. Vector databases are able to calculate similarity quickly because they have already pre-calculated it

A Gentle Introduction to Vector Databases

Storing and searching across table-based data such as the one shown above is exactly what relational databases were designed to do.

vector databases are used for: searching across images, video, text, audio, and other forms of unstructured data via their content rather than keywords or tags (which are often input manually by users or curators). When combined with powerful machine learning models, vector databases have the capability of revolutionizing semantic search and recommendation systems.

Qdrant

Benchmarking Vector Search Engines

Qdrant and Milvus are the fastest engines when it comes to indexing time.
Qdrant achives highest RPS and lowest latencies in almost all scenarios, no matter the precision threshold and the metric we choose.
Elasticsearch is typically way slower than all the competitors, no matter the dataset and metric.

Towhee is an open-source machine learning pipeline. that helps you encode your unstructured data into embeddings.

Replacing my best friends with an LLM trained on 500,000 group chat messages

Ref - Link

Keep Exploring!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

April 11, 2023

Milvus, an open-source, cloud-native vector database - Key Architecture Notes

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts