mrosati/Chromy

Public Access

Files

T

Matteo Rosati 8ebab832d5 move top-level modules into a real package

2026-04-22 15:47:46 +02:00

1.2 KiB

Raw Blame History

10. Make Ingestion More Configurable

Summary

Move chunking and embedding choices into configuration and expose chunk size as an add-data CLI option.

Implementation Steps

Add ingestion configuration for chunk size, tokenizer/model name, and embedding function provider.
Change chunking code to receive chunk size and tokenizer/model name instead of hard-coding 800 and "gpt-4".
Reuse the embedding function through dependency injection instead of constructing it for every embed call.
Add --chunk-size to add-data, defaulting to the current value of 800.
Keep the default tokenizer/model behavior equivalent to the current "gpt-4" setting.

Public Interface Changes

add-data gains --chunk-size.
Default ingestion behavior remains unchanged when no option is provided.

Test Plan

Test chunking with default and custom chunk sizes.
Test add-data --chunk-size parser behavior.
Test ingestion service with an injected fake embedder.
Smoke test adding a file with and without --chunk-size.

Assumptions

Only chunk size is exposed in the CLI initially.
Tokenizer/model and embedding provider configuration can remain internal or environment-backed until there is a concrete user-facing need.