How to parse complex documents for RAG
Disclaimer: I’m an AWS Senior Solution Architect and this represent my own opinion
I hope you haven’t had my luck, where I went to a customer saying “Hey, doing a Proof of Concept of RAG is super easy! Just send me your documents and I will build one”… and the customer says “Here, hold my beer” and hands you the most goddamned awful documents he has for internal manuals.
The customer came out with PDFs that:
- The text was in image format (you can’t mouse-drag to select the text to copy-paste)
- Had graphs
- Had pictures of internal tools like salesforce with some arrows explaining things
- Had tables
My first attempt (during the meeting) of trying to just give to the standard parse-and-embed into OpenSearch or Amazon Kendra was met with an extremely awful result where the engine failed to parse anything in there, thus the demo failed miserably and the customer was certainly not impressed.
OCR should do it… right?
I asked the customer to hang on for a for a few days while I worked trying to resolve the situation I had put myself in by claiming this was easy, and “went back to the lab”. I tried to generate Markdown format (which is very LLM friendly and great for expressing those kinds of documents) using Amazon Textract, Tesseract OCR, some github libraries that claimed magical powers… but nothing usable came out of those tests.
One day I had an inspiration: what if I was to leverage the multi-modal properties of LLMs?
Using LLMs to OCR PDFs to Markdown
I decided to try leverage Amazon Bedrock and Claude 3 Sonnet to leverage them as an extremely smart OCR that can generate Markdown straight away
The code is pretty simple and straight-forward. The main complications were on how to format the prompt properly and how to get around the 4096 token max output of Cloude 3 Sonnet.
You are an assistant that helps understand images that came of splitting a PDF in images and you help transcribe all the content of those to Markdown format, generating tables as needed to represent the information in the best possible way
The main parts of the code are this functions
def pdf_to_images(pdf_path):
return convert_from_path(pdf_path, fmt='jpeg')
def get_image_dict(image):
buffered = BytesIO()
image.save(buffered, format="JPEG")
img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
image_dict = {
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": img_str,
}
}
return image_dict
def images_to_markdown(images):
message_content = []
for image in images:
image_dict = get_image_dict(image)
message_content.append(image_dict)
message_content.append({"type": "text",
"text": "Transcribe the images to a Markdown text that has all the content of the images"})
claude_dict = {"role": "user", "content": message_content}
markdown_text = call_bedrock(claude_dict, SYSTEM_PROMPT)
return markdown_text
def call_bedrock(user_prompt, system_prompt, previous_output="```markdown"):
prompt = {
"system": system_prompt,
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 50000,
"messages": [
user_prompt,
{"role": "assistant", "content": previous_output}
]
}
# call Amazon Bedrock passing the conversation in the prompt, calling Ahtropic Claude 3
body = json.dumps(prompt)
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'
accept = 'application/json'
content_type = 'application/json'
response = bedrock.invoke_model(body=body, modelId=model_id, accept=accept, contentType=content_type)
response_body = json.loads(response.get('body').read())
if response_body["stop_reason"] == "max_tokens":
print("Max Token reached, calling again with previous output")
return call_bedrock(user_prompt, system_prompt, "%s%s" % (previous_output, response_body["content"][0]["text"]))
return "%s%s" % (previous_output, response_body["content"][0]["text"])
The output was outstanding, to say the least!
# WHAT DOES THE INDUSTRY LOOKS LIKE?
## Market Share Percentage
| Provider | Market Share |
|-----------|--------------|
| My Awesome Cloud | 30 |
| Superb Cloud | 9 |
| Cloud GPT | 20 |
| Exciting cloud | 5 |
## Here we have some diagrams of how awesome the providers are
| Provider | Revenue | Employees |
|-----------|-----------|-----------|
| My Awesome Cloud | Trillions | 123456 |
| Superb Cloud | Billions | 555553 |
| Cloud GPT | Millions | 66666 |
| Exciting cloud | Thousands | 9876 |
### Table with Provider Data
Can LLMs understand Markdown?
Certainly! Here is a sample prompt
Based on the following text, can you tell me which are the providers, ranking them by market share? <text># WHAT DOES THE INDUSTRY LOOK LIKE? …</text>
Which yields the correct answer
Based on the provided text, the cloud providers and their corresponding market share percentages are as follows, ranked from highest to lowest market share:
1. My Awesome Cloud — 30%
2. Cloud GPT — 20%
3. Superb Cloud — 9%
4. Exciting cloud — 5%
Summary
Multi-modal LLMs are better than traditional OCRs at image-to-markdown transformations. The cost it’s very manageable (the above sample it’s cents of a dollar), specially thinking this usually is a one-off process of moving wildly unstructured data to more structured formats.
This can speed up the RAG implementations significantly reducing the manual overhead of processing documents.
Lesson learned: Don’t claim a blanket “RAG POCs are easy”