James C. Caldwell
Bio
I'm an NLP and machine learning engineer specializing in LLM applications, document AI, and large-scale data pipelines. I work independently as a consultant and am open to team-based roles in industry.
My most recent project was a research collaboration with Environment Canada and Western University to extract handwritten observations from approximately one million digitized historical weather forms spanning 1840-1960. I built the pipeline end-to-end: organizing and preprocessing 6TB of scanned documents, benchmarking vision-language models across hardware configurations, scaling inference to vLLM on rented H200 GPUs to meet throughput requirements, and delivering cost/benefit analyses to guide the production run. A co-authored methodology paper is planned.
I also built the Modular Digital Methodologies Toolkit (MDMT), an open-source cross-platform desktop application for document analysis. MDMT integrates OCR, audio transcription, named entity recognition, and an AI-powered RAG chatbot that runs local Qwen models with automatic GPU detection across NVIDIA, AMD, and Apple Silicon. It is packaged for Linux, Windows, and macOS and is available on GitHub.
Earlier work includes a computational discourse analysis of over 80 years of Canadian government records, conducted in collaboration with Dr. Janice Forsyth, examining how sport and physical activity were used by residential school administrators in Canada. I built custom NLP pipelines to classify and extract structured data from a large unstructured archival corpus; the resulting dataset is publicly available for use.
My MA Thesis (Western University, 2025) applied bibliometric analysis and machine learning to trace the emergence of antibiotic pollution as a global research field across approximately 45,000 scientific publications. The methodology included an active-learning text classification pipeline trained on roughly 4,000 labeled examples, a multi-stage author name disambiguation system combining n-gram analysis, Levenshtein distance, and phonetic matching, and collaboration network analysis using Walktrap and Louvain community detection with burst detection to identify temporal inflection points. I hold an MA in History with a computational focus and a BA in Psychology from Western University.
If you're working with large document collections, unstructured text, archival data, or AI systems and need someone who can build production-ready pipelines, I'd like to hear from you.
You can contact me at James.Caldwell.000@gmail.com