How to Use LangChain Document Loaders

How to Use LangChain Document Loaders for Your PDF, Markdown, PPT, DOC files? Read this article to learn how!

1000+ Pre-built AI Apps for Any Use Case

How to Use LangChain Document Loaders

Start for free
Contents

LangChain is a powerful framework for developing applications powered by language models. One of its key features is the ability to load and process various types of documents, which is essential for tasks like question answering, summarization, and information retrieval. In this comprehensive guide, we'll explore how to use LangChain Document Loaders to work with different file formats and data sources.

💡
Want to try out Claude 3.5 Sonnet without Restrictions?

Searching for an AI Platform that gives you access to any AI Model with an All-in-One price tag?

Then, You cannot miss out Anakin AI!

Anakin AI is an all-in-one platform for all your workflow automation, create powerful AI App with an easy-to-use No Code App Builder, with Llama 3, Claude, GPT-4, Uncensored LLMs, Stable Diffusion...

Build Your Dream AI App within minutes, not weeks with Anakin AI!

Introduction to LangChain Document Loaders

LangChain Document Loaders are designed to simplify the process of ingesting data from various sources and converting it into a format that can be easily used by language models. These loaders support a wide range of file types, including CSV, HTML, JSON, Markdown, PDF, and Microsoft Office documents.

Before we dive into specific loaders, let's set up our environment:

pip install langchain

Now, let's explore how to use different document loaders in LangChain.

1. CSV LangChain Document Loaders

Comma-Separated Values (CSV) files are commonly used for storing tabular data. LangChain provides a CSVLoader to easily work with CSV files.

Using CSVLoader in LangChain Document Loaders

Here's how to use the CSVLoader:

from langchain.document_loaders import CSVLoader

# Initialize the loader
loader = CSVLoader(file_path="path/to/your/file.csv")

# Load the documents
documents = loader.load()

# Print the first document
print(documents[0].page_content)

By default, each row in the CSV file becomes a separate document. The content of each document is a string representation of the row's key-value pairs.

Customizing CSV Parsing in LangChain Document Loaders

You can customize how the CSV is parsed by passing additional arguments:

loader = CSVLoader(
    file_path="path/to/your/file.csv",
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
        "fieldnames": ["column1", "column2", "column3"]
    }
)

This allows you to specify the delimiter, quote character, and column names explicitly.

2. HTML LangChain Document Loaders

For loading HTML content, LangChain offers multiple options. Let's look at two popular ones: UnstructuredHTMLLoader and BSHTMLLoader.

UnstructuredHTMLLoader in LangChain Document Loaders

This loader uses the Unstructured library to parse HTML:

from langchain.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader("path/to/your/file.html")
data = loader.load()

print(data[0].page_content[:300])

BSHTMLLoader in LangChain Document Loaders

The BSHTMLLoader uses Beautiful Soup to parse HTML:

from langchain.document_loaders import BSHTMLLoader

loader = BSHTMLLoader("path/to/your/file.html")
data = loader.load()

print(data[0].page_content[:300])

The BSHTMLLoader extracts the text content and stores the page title in the metadata.

3. JSON LangChain Document Loaders

JSON (JavaScript Object Notation) is a popular data format. LangChain provides a JSONLoader to work with JSON files.

Basic Usage of JSONLoader in LangChain Document Loaders

Here's a simple example of using the JSONLoader:

from langchain.document_loaders import JSONLoader
import json

# Sample JSON data
data = [
    {"text": "Hello, world!", "number": 42},
    {"text": "LangChain is awesome", "number": 100}
]

# Write JSON to a file
with open("sample.json", "w") as f:
    json.dump(data, f)

# Initialize and use the loader
loader = JSONLoader(
    file_path="sample.json",
    jq_schema='.[]',
    text_content=False
)

documents = loader.load()

for doc in documents:
    print(doc.page_content)
    print(doc.metadata)
    print("---")

In this example, we're loading each object in the JSON array as a separate document.

Advanced JSON Parsing in LangChain Document Loaders

You can use JQ-like schemas to extract specific fields:

loader = JSONLoader(
    file_path="sample.json",
    jq_schema='.[] | {text: .text, number: .number}',
    text_content=False
)

This extracts only the "text" and "number" fields from each object.

4. Markdown LangChain Document Loaders

Markdown is a lightweight markup language commonly used for documentation. LangChain provides an UnstructuredMarkdownLoader for working with Markdown files.

Using UnstructuredMarkdownLoader in LangChain Document Loaders

Here's how to use the UnstructuredMarkdownLoader:

from langchain.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("path/to/your/file.md")
data = loader.load()

print(data[0].page_content[:300])

This loader will parse the Markdown content and extract the text, preserving some of the structure.

Retaining Elements in LangChain Document Loaders

If you want to retain the original Markdown elements, you can use the "elements" mode:

loader = UnstructuredMarkdownLoader("path/to/your/file.md", mode="elements")
data = loader.load()

for doc in data:
    print(doc.metadata['category'])
    print(doc.page_content[:100])
    print("---")

This will separate the Markdown content into different elements (e.g., Title, NarrativeText) and preserve them in the metadata.

5. DOCX/XLSX/PPTX LangChain Document Loaders

LangChain supports loading Microsoft Office documents, including Word, Excel, and PowerPoint files.

Word Documents in LangChain Document Loaders

For Word documents, you can use the Docx2txtLoader:

from langchain.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("path/to/your/document.docx")
data = loader.load()

print(data[0].page_content[:300])

Excel Spreadsheets in LangChain Document Loaders

For Excel files, you can use the UnstructuredExcelLoader:

from langchain.document_loaders import UnstructuredExcelLoader

loader = UnstructuredExcelLoader("path/to/your/spreadsheet.xlsx")
data = loader.load()

print(data[0].page_content[:300])

PowerPoint Presentations in LangChain Document Loaders

For PowerPoint presentations, you can use the UnstructuredPowerPointLoader:

from langchain.document_loaders import UnstructuredPowerPointLoader

loader = UnstructuredPowerPointLoader("path/to/your/presentation.pptx")
data = loader.load()

print(data[0].page_content[:300])

6. PDF LangChain Document Loaders

PDF documents are widely used for sharing formatted documents. LangChain offers several options for loading PDFs, including PyPDFLoader and UnstructuredPDFLoader.

Using PyPDFLoader in LangChain Document Loaders

Here's how to use the PyPDFLoader:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("path/to/your/file.pdf")
pages = loader.load_and_split()

for page in pages[:2]:  # Print content of first two pages
    print(page.page_content[:300])
    print("---")

This loader creates a separate document for each page in the PDF.

Using UnstructuredPDFLoader in LangChain Document Loaders

The UnstructuredPDFLoader offers more advanced parsing capabilities:

from langchain.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader("path/to/your/file.pdf", mode="elements")
data = loader.load()

for element in data[:5]:  # Print first 5 elements
    print(f"Type: {element.metadata['category']}")
    print(element.page_content[:100])
    print("---")

This loader can extract different elements from the PDF, such as titles, text blocks, and tables.

Advanced Techniques in LangChain Document Loaders

Now that we've covered the basics of various document loaders, let's explore some advanced techniques that can enhance your document processing capabilities.

Combining Multiple Loaders in LangChain Document Loaders

Sometimes you may need to load documents from multiple sources or formats. LangChain makes this easy with the DirectoryLoader:

from langchain.document_loaders import DirectoryLoader, TextLoader, PDFLoader, CSVLoader

loader = DirectoryLoader(
    'path/to/directory',
    glob="**/*",
    loader_cls={
        ".txt": TextLoader,
        ".pdf": PDFLoader,
        ".csv": CSVLoader
    }
)

documents = loader.load()

This loader will recursively search the specified directory and use the appropriate loader for each file type.

Text Splitting in LangChain Document Loaders

For long documents, it's often useful to split them into smaller chunks. LangChain provides various text splitters:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

split_docs = text_splitter.split_documents(documents)

This splits the documents into chunks of approximately 1000 characters, with a 200-character overlap between chunks.

Metadata Manipulation in LangChain Document Loaders

You can add or modify metadata for your documents:

from langchain.schema import Document

def add_source_to_metadata(doc):
    return Document(
        page_content=doc.page_content,
        metadata={**doc.metadata, "source": "custom_source"}
    )

processed_docs = [add_source_to_metadata(doc) for doc in documents]

This adds a "source" field to the metadata of each document.

Conclusion

LangChain Document Loaders provide a powerful and flexible way to ingest various types of documents into your language model applications. By understanding how to use these loaders effectively, you can build more robust and versatile AI-powered systems that can process and analyze a wide range of data sources.

Remember to always check the LangChain documentation for the most up-to-date information and additional features. As you become more comfortable with these loaders, you'll be able to handle increasingly complex document processing tasks and create more sophisticated AI applications.

💡
Want to try out Claude 3.5 Sonnet without Restrictions?

Searching for an AI Platform that gives you access to any AI Model with an All-in-One price tag?

Then, You cannot miss out Anakin AI!

Anakin AI is an all-in-one platform for all your workflow automation, create powerful AI App with an easy-to-use No Code App Builder, with Llama 3, Claude, GPT-4, Uncensored LLMs, Stable Diffusion...

Build Your Dream AI App within minutes, not weeks with Anakin AI!