This web app allows users to upload insurance photo report PDFs and a .docx template. The system uses OCR + AI to extract relevant information and automatically fills the template. The final result can be downloaded as a filled PDF or viewed directly in the browser.
.
├── insurance_pipeline/ # Core pipeline (OCR, extraction, LLMs, etc.)
├── sample/ # Sample input/output files
├── app.py # Streamlit app for UI interaction
├── .env # API keys
├── requirements.txt # Dependencies list
└── README.md # Project documentation
- Create & Activate Virtual Environment:
python3.9 -m venv task_3
source task_3/bin/activate # macOS/Linux
task_3\Scripts\activate # Windows- Install PaddleOCR:
If you have a GPU and CUDA 11.8:
python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/If not, use the CPU version:
pip install paddlepaddle- More installation details: PaddlePaddle Installation Guide.
- Install Other Dependencies:
pip install -r requirements.txt- Add API Keys to
.envFile:
- Make sure your
.envfile includes:
OPENROUTER_API_KEY = "openrouter_api_key"
GOOGLE_API_KEY = "google_api_key"
PINECONE_API_KEY = "pinecone_api_key"
COHERE_API_KEY = "cohere_api_key"
GROQ_API_KEY = "groq_api_key"
CONVERTAPI_API_KEY = "convertapi_api_key"- Run the Application:
streamlit run app.py- A local server will start and open the app in your default browser.
┌────────────────────────────┐
│ Upload Inputs │
│ ┌────────────────────────┐ │
│ │ Report PDFs │ │
│ │ .docx Template │ │
│ └────────────────────────┘ │
└────────────┬───────────────┘
│
▼
┌────────────────────────────┐
│ OCR + Text Chunking │
│ - OCR PDFs │
│ - Split into text chunks │
└────────────┬───────────────┘
│
▼
┌────────────────────────────┐
│ Embedding + Pinecone DB │
│ - Convert chunks to vectors│
│ - Store in Pinecone index │
└────────────┬───────────────┘
│
▼
┌──────────────────────────────────────┐
│ Field Meaning Extraction (LLM) │
│ - Extract placeholders from .docx │
│ - Understand meaning (OpenRouter LLM)│
└────────────┬─────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Semantic Retrieval + QA │
│ - Similarity search (Pinecone) │
│ - Rerank with Cohere │
│ - Final answer via GROQ LLM │
└────────────┬─────────────────────────┘
│
▼
┌────────────────────────────┐
│ Fill Template Fields │
│ - Replace placeholders │
└────────────┬───────────────┘
│
▼
┌────────────────────────────┐
│ Convert to PDF │
│ - Use ConvertAPI │
└────────────┬───────────────┘
│
▼
┌────────────────────────────┐
│ Preview & Download PDF │
│ - View PDF in browser │
│ - Download final PDF │
└────────────────────────────┘
To manage LLM API usage and rate limits, a delay is added between field queries. You can modify this in: insurance_pipeline/qa_utils.py
- insurance_pipeline/qa_utils.py : Modify in this file.
def extract_all_fields(...):
...
time.sleep(5) # Delay between LLM requests- You can find sample
.docxtemplates and insurance report PDFs in thesample/directory for testing.