how do i implement versioning for indexed documents in llamaindex

Introduction: The Necessity of Versioning in LlamaIndex

As an AI practitioner working with LlamaIndex, you'll quickly discover the importance of managing your indexed documents effectively. These indexes, which essentially hold the organized structure of your ingested data, are the backbone of your query engine. However, the data landscape isn't static, it is dynamic and evolving. New documents are added, existing ones are updated, and sometimes documents are even removed. Without a robust versioning strategy, you risk serving stale, inaccurate, or even completely irrelevant information to your users which would diminish the appeal of using LlamaIndex as the backbone of your LLM powered applications and services. Versioning provides a crucial mechanism to track changes, allow for rollback to previous states, and ensure data integrity. Think about maintaining code versions, managing your document indexes in LlamaIndex is similar. It’s really a must-have for long-term success. It allows you to see the history of your indexes and the ability to backtrack if needed. Having a history is incredibly important for debugging and also for compliance purposes.

Want to Harness the Power of AI without Any Restrictions?
Want to Generate AI Image without any Safeguards?
Then, You cannot miss out Anakin AI! Let's unleash the power of AI for everybody!

Key Considerations When Implementing Versioning

Before diving into the technical implementation details, it’s essential to lay out some key considerations that will shape your versioning strategy. One crucial aspect is determining the granularity of versioning. Do you need to version every single document update, or is a more coarse-grained approach sufficient, where you only create new versions for significant batches of changes? Finding the right level of granularity is a tradeoff between storage space and the ability to fine-tune your rollbacks. For instance, if you're building an application that relies on near real-time updates, then finer granularity is necessary as the user might be relying on those updates.

Another consideration is the storage mechanism you'll use to store your different index versions. LlamaIndex provides flexibility in how you store your indexes, including options like simple disk storage, cloud-based object storage (e.g., AWS S3, Google Cloud Storage), and specialized vector databases like Pinecone or Qdrant. The chosen storage mechanism will significantly impact the implementation of your versioning strategy. Finally, defining a versioning scheme is important. Will you use simple sequential integers, timestamps, or a more sophisticated system based on content hashes? Versioning based on content hashes is particularly useful because it can detect if two versions of a document are truly the same, even if the metadata is different.

Strategies for Implementing Versioning in LlamaIndex

There are several ways to implement versioning for your LlamaIndex indexes, each with its own advantages and disadvantages. Let's explore a few common approaches:

Manual Versioning with File System Storage: This is the simplest approach, particularly suitable for smaller projects or for prototyping. It involves manually saving different versions of your index to different directories on your file system. Each directory represents a specific version, and you can use a naming convention (e.g., index_v1, index_v2, index_timestamp) to identify them. This approach is straightforward but requires manual management of index loading and saving. It is important to remember that although simple, having a versioning scheme based on timestamp is more advantageous to easily identify when the document structure was updated so debugging and rollback is much easier.

Versioning with Cloud Object Storage: This method utilizes cloud storage services like AWS S3 or Google Cloud Storage to store your index versions. Each version is stored as a separate object within the storage bucket. You can leverage object versioning features provided by these services to track changes and revert to previous versions. This approach provides scalability, durability, and built-in versioning capabilities. This is extremely useful when multiple people are working on the same documents and the ability to collaborate is crucial. When working in a team environment, this method is favorable as it provides the ability to isolate changes made by an individual.

Versioning with Vector Databases: Vector databases like Pinecone, Qdrant, and Weaviate often offer advanced indexing and querying capabilities. Some vector databases may provide built-in versioning features, allowing you to track changes to your indexed data over time. If not, you can simulate versioning by creating separate indexes for different versions, or by adding a version identifier to each vector embedding. This allows you to query specific versions of your data. Vector databases are highly beneficial for storing and querying semantic information. Versioning is extremely useful in these cases because the semantic information of documents change over time and the ability to backtrack to the past semantic relationships is very useful for debugging.

H2: Practical Implementation Examples

Let's illustrate these versioning strategies with code examples.

H3: Manual Versioning with File System

import os
from llama_index import VectorStoreIndex, SimpleDirectoryReader

# Base directory for storing index versions
INDEX_DIR = "index_versions"

def save_index_version(index, version_name):
    """Saves the index to a specific version directory."""
    version_path = os.path.join(INDEX_DIR, version_name)
    os.makedirs(version_path, exist_ok=True)
    index.save_to_disk(os.path.join(version_path, "index.json"))
    print(f"Index saved to version: {version_name}")

def load_index_version(version_name):
    """Loads the index from a specific version directory."""
    version_path = os.path.join(INDEX_DIR, version_name)
    index_path = os.path.join(version_path, "index.json")
    if os.path.exists(index_path):
        index = VectorStoreIndex.load_from_disk(index_path)
        print(f"Index loaded from version: {version_name}")
        return index
    else:
        print(f"Version not found: {version_name}")
        return None

# Example usage:
# 1. Load documents
documents = SimpleDirectoryReader("data").load_data()

# 2. Create an index (version 1)
index_v1 = VectorStoreIndex.from_documents(documents)
save_index_version(index_v1, "v1")

# 3. Simulate document updates (e.g., add a new document to 'data' directory)
# 4. Re-create the index (version 2)
documents = SimpleDirectoryReader("data").load_data()
index_v2 = VectorStoreIndex.from_documents(documents)
save_index_version(index_v2, "v2")

# 5. Load a specific version for querying:
loaded_index = load_index_version("v1")
if loaded_index:
    query_engine = loaded_index.as_query_engine()
    response = query_engine.query("What is the main topic of document 1?")
    print(response)

This example demonstrates saving and loading index versions to separate directories on your file system. You can extend this by automating the version naming using timestamps or incremental integers. This example provides a foundational base for others who want to implement versioning. The saving and loading paradigm can be further extended to cloud based object storage.

H3: Versioning with AWS S3

import os
import boto3
from llama_index import VectorStoreIndex, SimpleDirectoryReader

# AWS S3 Configuration
S3_BUCKET_NAME = "your-s3-bucket-name"
S3_INDEX_PREFIX = "llamaindex-versions" # Prefix for storing index versions

# Initialize S3 client
s3 = boto3.client("s3")

def save_index_to_s3(index, version_name):
    """Saves the index to S3 with a specific version name."""
    s3_key = f"{S3_INDEX_PREFIX}/{version_name}/index.json"
    index.storage_context.persist(persist_dir=f"tmp_index_{version_name}") #Temporary location
    s3.upload_file(f'tmp_index_{version_name}/docstore.json', S3_BUCKET_NAME, f"{S3_INDEX_PREFIX}/{version_name}/docstore.json")
    s3.upload_file(f'tmp_index_{version_name}/index_store.json', S3_BUCKET_NAME, f"{S3_INDEX_PREFIX}/{version_name}/index_store.json")
    s3.upload_file(f'tmp_index_{version_name}/vector_store.json', S3_BUCKET_NAME, f"{S3_INDEX_PREFIX}/{version_name}/vector_store.json")

    os.system(f"rm -rf tmp_index_{version_name}") # Clean up the temporary files
    print(f"Index saved to S3: s3://{S3_BUCKET_NAME}/{s3_key}")

def load_index_from_s3(version_name):
    """Loads the index from S3 based on the version name."""
    s3_prefix = f"{S3_INDEX_PREFIX}/{version_name}"
    local_persist_dir = f"tmp_index_{version_name}"
    os.makedirs(local_persist_dir, exist_ok=True)

    try:
        s3.download_file(S3_BUCKET_NAME, f"{s3_prefix}/docstore.json", os.path.join(local_persist_dir, "docstore.json"))
        s3.download_file(S3_BUCKET_NAME, f"{s3_prefix}/index_store.json", os.path.join(local_persist_dir, "index_store.json"))
        s3.download_file(S3_BUCKET_NAME, f"{s3_prefix}/vector_store.json", os.path.join(local_persist_dir, "vector_store.json"))

        index = VectorStoreIndex.load_from_disk(local_persist_dir)
        print(f"Index loaded from S3 version: {version_name}")
        os.system(f"rm -rf {local_persist_dir}")
        return index
    except Exception as e:
        print(f"Error loading index from S3 version {version_name}: {e}")
        os.system(f"rm -rf {local_persist_dir}")
        return None

# Example Usage:

# 1. Load documents
documents = SimpleDirectoryReader("data").load_data()

# 2. Create an index (version 1)
index_v1 = VectorStoreIndex.from_documents(documents)
save_index_to_s3(index_v1, "v1")

# 3. Simulate document updates
documents = SimpleDirectoryReader("data").load_data() # Simulate Updated documents
index_v2 = VectorStoreIndex.from_documents(documents)
save_index_to_s3(index_v2, "v2")

# 4. Load a specific index version from S3:
loaded_index = load_index_from_s3("v1")
if loaded_index:
    query_engine = loaded_index.as_query_engine()
    response = query_engine.query("What is the main topic of document 1?")
    print(response)

This example demonstrates how to store and retrieve LlamaIndex index versions from AWS S3. Ensure you have the AWS CLI configured correctly with the required permissions before running this code. The code saves the index files into different versions under a specific bucket, retrieves the files into a specific version, and allows you to perform queries on the versioned indices. A temporary storage is created to persist the data so that it can be loaded into the S3 bucket. After the data is loaded, it is important to delete the temporary directory to ensure data integrity. The key to versioning is the correct prefix when storing the data to the S3 bucket.

H3: Versioning with Vector Database (Qdrant)

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
)
from llama_index.vector_stores import QdrantVectorStore
from llama_index.storage.storage_context import StorageContext
from qdrant_client import QdrantClient, models

# Qdrant Configuration
QDRANT_HOST = "localhost"  # Replace with your Qdrant host
QDRANT_PORT = 6333
COLLECTION_NAME_PREFIX = "llamaindex_collection_version"  # Prefix for collection names

def save_index_to_qdrant(index, version_name):
    """Saves the index to Qdrant with a specific collection name (version)."""
    collection_name = f"{COLLECTION_NAME_PREFIX}_{version_name}"
    qdrant_client = QdrantClient(host=QDRANT_HOST, port=QDRANT_PORT)

    # Delete the existing collection if it exists to prevent duplicates
    try:
        qdrant_client.recreate_collection(
            collection_name=collection_name,
            vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE),  # Use the appropriate vector size
        )
        print(f"Recreated collection for version: {version_name}")
    except Exception as e:
        print(f"Creating collection: {version_name}")

    vector_store = QdrantVectorStore(client=qdrant_client, collection_name=collection_name)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index.storage_context = storage_context  # Link the index to the new storage context
    #index.save_to_disk("test.json") This can be deleted because it doesn't persist the data to disk

    print(f"Index saved to Qdrant collection: {collection_name}")

def load_index_from_qdrant(version_name):
    """Loads the index from Qdrant based on the collection name (version)."""
    collection_name = f"{COLLECTION_NAME_PREFIX}_{version_name}"
    qdrant_client = QdrantClient(host=QDRANT_HOST, port=QDRANT_PORT)

    vector_store = QdrantVectorStore(client=qdrant_client, collection_name=collection_name)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    try:
        index = VectorStoreIndex.from_documents([], storage_context=storage_context)
        #index = VectorStoreIndex.load_from_disk("test.json") This will load from a file not the Qdrant Cloud database
        print(f"Index loaded from Qdrant collection: {collection_name}")
        return index
    except Exception as e:
        print(f"Error loading index from Qdrant collection {collection_name}: {e}")
        return None

# Example Usage:

# 1. Load documents
documents = SimpleDirectoryReader("data").load_data()

# 2. Create an index (version 1)
index_v1 = VectorStoreIndex.from_documents(documents)
save_index_to_qdrant(index_v1, "v1")

# 3. Update documents (simulate changes)
documents = SimpleDirectoryReader("data").load_data() # Simulate Updated documents
index_v2 = VectorStoreIndex.from_documents(documents)
save_index_to_qdrant(index_v2, "v2")

# 4. Load a specific index version from Qdrant:
loaded_index = load_index_from_qdrant("v1")
if loaded_index:
    query_engine = loaded_index.as_query_engine()
    response = query_engine.query("What is the main topic of document 1?")
    print(response)

This example showcases how to use Qdrant as a vector database to implement versioning for your LlamaIndex indexes. Each index version is stored in a separate Qdrant collection, identified by a unique version name. This code recreates the collection for a specific version, creates embeddings for a specific version, and recreates the storage context and finally recreates the embedding for the version. A main advantage of Qdrant is the use of recreating collections, so that you do not need to manage files like in S3 or local directories.

H2: Advanced Versioning Techniques

Beyond the basic strategies outlined above, you can explore more advanced techniques for versioning your LlamaIndex indexes:

H3: Content-Based Versioning

Instead of relying on sequential numbers or timestamps, you can base your versioning on the actual content of the documents being indexed. This involves calculating a hash (e.g., SHA-256) of the document content. If the content changes, the hash changes, triggering a new version. This approach ensures that you only create new versions when there are actual changes to the data. When two versions of a document have the same hash, you know that it is highly unlikely that there has been a change to the underlying data. It allows you to efficiently manage your indexes where you do not have to recreate a totally new embeding for a document that is not changing over time.

H3: Delta Storage and Incremental Updates

For large datasets, storing full copies of each index version can be inefficient. Delta storage involves storing only the differences (deltas) between consecutive versions. LlamaIndex offers capabilities for incremental index updates using the update method. You can combine this with a delta storage strategy to minimize storage space. This is very useful because a lot of the content that you are scraping from the web will not likely change. Implementing delta storage reduces the storage needs by storing the changes.

H3: Version Control Systems Integration

You can integrate LlamaIndex with existing version control systems like Git to manage your index versions. The code can be versioned and maintained directly in Git and also provide a log of all the changes allowing easy rollback. As a popular version control system, Git has many tools and integrations that can be leveraged to properly manage the versions of your indexes.

H2: Potential Challenges and Solutions

While implementing versioning is crucial, it's important to address potential challenges:

H3: Storage Costs

Storing multiple versions of your index can lead to increased storage costs. Consider using compression techniques, delta storage, or lifecycle policies in your cloud storage to manage these costs. A lot of cloud storage providers use "cold storage" methods to reduce costs if the versions are typically not used for querying.

H3: Indexing and Querying Performance

Frequent index creation and loading can impact performance, particularly for large datasets. Optimize your indexing process, leverage caching mechanisms, and choose appropriate vector database configurations to mitigate these issues.

H3: Complexity and Maintainability

Managing multiple versions adds complexity to your codebase. Design your versioning strategy carefully, and encapsulate versioning logic into reusable functions or classes to improve maintainability.

H2: Conclusion

Implementing versioning for your LlamaIndex indexes is an essential practice for managing data evolution, ensuring data integrity, and enabling rollback capabilities. By carefully considering your specific needs, choosing the appropriate versioning strategy, and addressing potential challenges, you can build robust and reliable LLM powered applications and services. The examples presented can be used as a base for starting versioning your indexes so that you can backtrack and also perform experiments on the past versions.