Skip to content

AM10101010/medIndexer

Repository files navigation

Nexicd

An experimental .NET project that explores whether ICD-11 clinical coding can be treated as a small, local-first RAG system instead of a heavyweight ML platform.

Status License .NET Platform


Overview

The interesting part of Nexicd is not just "LLM in healthcare". The experiment is whether a relatively small, understandable system can combine structured medical taxonomy data, vector retrieval, and model-based reasoning — without a large orchestration stack or custom training pipeline. ICD-11 is a good place to test that. It's exactly the kind of structured, high-stakes domain where RAG is worth exploring seriously — and where the gap between a naive prompt and a grounded retrieval pipeline is easy to measure. Nexicd ingests WHO ICD-11 MMS data, embeds it into a local SQLite vector store, and runs a four-stage pipeline to turn clinical text into candidate codes.


Demo

Nexicd CLI demo

The demo flow is intentionally simple: build a local ICD-11 vector store, enter a short clinical note, and inspect how the pipeline narrows the problem down to a validated ICD-11 code.


Why This Project Exists

This project started as an experiment in practical retrieval-augmented coding.

The problem was straightforward: clinical conversations are noisy, ICD-11 is large and hierarchical, and naive prompting alone is too brittle for consistent code selection. I wanted to explore whether a lightweight RAG architecture could narrow the search space first, then let the model reason over a smaller, more structured set of candidates.

The project was also a way to learn more about three things:

  • using WHO ICD-11 MMS data as a searchable local knowledge base
  • building a multi-stage LLM pipeline in plain .NET without heavy framework lock-in
  • treating a local SQLite vector store as a developer-friendly retrieval layer for experiments

Key Ideas

  • Use WHO ICD-11 MMS as the source of truth instead of asking a model to memorize the taxonomy.
  • Keep the retrieval layer local by writing embeddings into a SQLite database.
  • Separate ingestion from query-time coding so the developer loop stays fast after the index is built.
  • Treat the pipeline as a sequence of explicit stages: normalize, extract, retrieve, select, validate.
  • Prefer understandable engineering tradeoffs over "AI magic".

Features

  • WHO ICD-11 MMS ingestion into a local SQLite vector store
  • Interactive CLI for trying the coding pipeline on clinical text
  • Structured extraction of findings from noisy conversations
  • Candidate retrieval using vector search over ICD-11 entities
  • LLM-based code selection from retrieved candidates
  • WHO-backed code validation to catch hallucinated primary codes
  • Unit and integration test coverage, plus an opt-in live smoke test

Architecture

graph TD
    A["Clinical note or conversation"] --> B["Input normalization"]
    B --> C["Stage 1: Clinical extraction"]
    C --> D["Stage 2: Vector retrieval"]
    D --> E["Stage 3: Code selection"]
    E --> F["Stage 4: WHO validation"]
    F --> G["Coding result"]

    H["WHO ICD-11 MMS API"] --> I["Ingestion pipeline"]
    I --> J["SQLite vector store"]
    J --> D

    K["OpenAI embeddings"] --> I
    L["OpenAI chat models"] --> C
    L --> E
Loading

The system is split into three small projects:

  • Nexicd.Core: models, WHO client, parsing, retrieval, and pipeline logic
  • Nexicd.Ingestion: builds the vector database from WHO ICD-11 data
  • Nexicd.Console: runs the interactive coding workflow against the local database

That split keeps ingestion concerns separate from the runtime query path. Once the database is built, the CLI can focus on retrieval and reasoning instead of rebuilding state on startup.

Getting Started

Installation

git clone https://github.com/username/Nexicd.git
cd Nexicd
dotnet restore

Configure environment

Create a local env file or export variables directly:

cp .env.example .env

Required for ingestion and the interactive console:

export OPENAI_API_KEY="your-key"

Optional:

export ICD_API_BASE="http://localhost"
export OUTPUT_LANGUAGE="English"

If you want to validate against the WHO cloud API instead of the local Docker image:

export ICD_API_BASE="https://id.who.int"
export WHO_CLIENT_ID="your-client-id"
export WHO_CLIENT_SECRET="your-client-secret"

Start the local ICD API

docker compose up -d

Build the vector store

dotnet run --project src/Nexicd.Ingestion

Run the project

dotnet run --project src/Nexicd.Console

Example Usage

CLI session

Input:

Patient has a runny nose, sore throat, and sneezing for two days. No fever.

Possible output:

Primary: CA25 - Acute nasopharyngitis
Confidence: HIGH
Reasoning: Symptoms and duration are consistent with a common cold and do not suggest a more specific alternative.

Using a custom database path

dotnet run --project src/Nexicd.Ingestion -- --db ./data/dev-nexicd.db
dotnet run --project src/Nexicd.Console -- --db ./data/dev-nexicd.db

Running tests

Default test suite:

dotnet test Nexicd.sln

Opt-in live smoke test:

RUN_LIVE_SMOKE_TESTS=1 dotnet test Nexicd.sln --filter FullyQualifiedName~LivePipeline_CommonCold_ReturnsExpectedCode

Project Status

Experimental / work in progress.

This repository is intentionally positioned as an engineering exploration, not a production medical coding product. The architecture is stable enough to demonstrate the idea, but the project still has prototype-era constraints around determinism, operational hardening, and compliance boundaries.


Roadmap

Possible next directions:

  • expose the pipeline through a small HTTP API instead of only a CLI
  • validate secondary codes, not just the primary code
  • make ICD release selection configurable instead of hardcoded
  • evaluate retrieval quality on a larger curated benchmark set
  • compare the local SQLite approach with a remote vector database
  • add richer telemetry and prompt-trace diagnostics for debugging

Contributing

Ideas, criticism, and experiments are welcome.

If you see a cleaner retrieval approach, a better way to structure the pipeline, or a useful test case for ICD-11 coding behavior, open an issue or send a pull request.


License

MIT

About

An experimental .NET project that explores whether ICD-11 clinical coding can be treated as a small, local-first RAG system instead of a heavyweight ML platform.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages