The Unsung Heroes of AI
With the rapid rise of generative AI in the last year, we have collectively seen this technology take hold. Over the next few years we will see this tech appear in many aspects of our day to day lives. From much better searching in our emails, to helping us come up with customized meal plans tailored specifically to our likes, we will be using generative AI constantly. However, it is not the AI tools that I find most impressive but the two tools that power them.
The vast majority of data on the internet is stored in one of two types of databases, relational and non-relational. When we want to gather data from the database, we must either request the data from the database by an ID or other specific value, or by searching the database for relevant items. In both kinds of databases, we store the data, then look it up by pre-established relations to other items, or text we put into the data to be used for lookup later.
Vector databases turn this process on it's head. Rather than storing the data in a relational shape, we store it clustered with other items like it. We can think of this like an x, y chart.
Instead of storing data in tables we store it all in a 2 dimensional plane. Once the data is loaded into the database we can easily grab its distance to any other data point, or more usefully determine the closest items to a given data point.
The vector database is cool, but it is nothing without the embedding models that allow us to convert text to a set of numbers. An embedding model takes some content, be it a string of text, an image, a video, etc... and turns it into a set of numbers that represent what the text is talking about.
Let's imagine the situation of talking about cats and dogs. If we embed a piece of text claiming that cats like mice toys, we can expect the embedding model to cluster it with other facts about cats. Likewise it would cluster dogs like bones with other facts about dogs. Some of the facts about cats and dogs would be clustered together, for instance "cats make great pets" and "dogs make great pets" would be grouped close together.
We can use this with the vector database to determine what someone is talking about. If a new user comes along and asks "which one likes the ball of yarn" they are likely to receive an answer relating to cats. Similarly asking about playing fetch, they will receive an answer about dogs.
The combination of embedding models and vector databases is amazing, but not with only two dimensions. Instead we need many, at least hundreds, but better yet into the thousands. Open AI has a embedding model that we use frequently that breaks statements into 1536 dimensions. This means that virtually any combination subject, topic, thought can all be clustered together. Using this we can provide the users with truly advanced searching systems, as well as power generative AI.
We foresee an entire industry building up around vector databases, as they enable some truly impressive systems, especially when combined with natural language processing provided by generative AI.