Files
Chromy/plans/10-configurable-ingestion.md
T
2026-04-22 15:47:46 +02:00

1.2 KiB

10. Make Ingestion More Configurable

Summary

Move chunking and embedding choices into configuration and expose chunk size as an add-data CLI option.

Implementation Steps

  • Add ingestion configuration for chunk size, tokenizer/model name, and embedding function provider.
  • Change chunking code to receive chunk size and tokenizer/model name instead of hard-coding 800 and "gpt-4".
  • Reuse the embedding function through dependency injection instead of constructing it for every embed call.
  • Add --chunk-size to add-data, defaulting to the current value of 800.
  • Keep the default tokenizer/model behavior equivalent to the current "gpt-4" setting.

Public Interface Changes

  • add-data gains --chunk-size.
  • Default ingestion behavior remains unchanged when no option is provided.

Test Plan

  • Test chunking with default and custom chunk sizes.
  • Test add-data --chunk-size parser behavior.
  • Test ingestion service with an injected fake embedder.
  • Smoke test adding a file with and without --chunk-size.

Assumptions

  • Only chunk size is exposed in the CLI initially.
  • Tokenizer/model and embedding provider configuration can remain internal or environment-backed until there is a concrete user-facing need.