Summary
This work presents the CaresAI team’s supervised fine-tuning approach to the MedHopQA
track of the BioCreative IX shared task, which evaluates the ability of Large Language Models
(LLMs) to perform multi-hop question answering (QA) across complex biomedical concepts
(diseases, genes, and chemicals). We focused on adapting the LLaMA 3 8B model and
addressed the key challenge of generating concise, accurate, and strictly formatted answers
Problem & Objective
- Problem: Traditional QA benchmarks often rely on single-hop or extractive formats,
failing to capture the complex, multi-step reasoning required in real-world biomedical
settings. Furthermore, even when LLMs understand the context (high semantic
accuracy), they struggle to produce answers that exactly match evaluation requirements
(low Exact Match or EM scores) due to verbosity and formatting inconsistencies. - Objective: To adapt a pre-trained LLM (LLaMA 3 8B) for multi-hop biomedical QA and
develop strategies to improve answer precision and brevity to better align with strict
evaluation metrics.
Methodology
1. Data Aggregation: We supplemented the limited task development set with a curated
training dataset of biomedical QA pairs aggregated from external sources, including
MedQuAD, BioASQ, and TREC.
2. Supervised Fine-Tuning (SFT): LLaMA 3 8B was fine-tuned under three strategies:
combined short/long answers, short answers only, and long answers only.
3. Two-Stage Inference Pipeline: An innovative post-processing step was introduced
where a follow-up prompt explicitly instructed the model to extract only the exact
answer phrase or entity from its initial, verbose response.
Why This Matters (Impact)
This research is essential for moving LLMs from general knowledge to reliable deployment in
high-stakes fields like healthcare. The ability to perform multi-hop reasoning over complex
biomedical knowledge is a prerequisite for clinical decision support and drug discovery
applications. By highlighting the persistent semantic-precision gap (up to 0.8 concept-level
accuracy but low EM scores) , we motivate the necessity for better output control mechanisms
and rigorous, domain-specific evaluation before real-world integration.
Contribution
- Domain Adaptation: Successfully adapted LLaMA 3 8B for multi-hop biomedical QA
using SFT and external data. - Precision Strategy: Developed a two-stage inference pipeline to mitigate verbosity
and extract precise short answers. - Key Insight: Provided empirical evidence that fine-tuning alone is insufficient for
guaranteeing precise, formatted output under strict evaluation, underscoring the need for
advanced post-processing and evaluation-aligned prompting for LLMs in specialized
domains.


