move top-level modules into a real package

2026-04-22 15:47:46 +02:00
parent e33160282c
commit 8ebab832d5
35 changed files with 6192 additions and 31 deletions
@@ -0,0 +1,30 @@
+# 10. Make Ingestion More Configurable
+
+## Summary
+
+Move chunking and embedding choices into configuration and expose chunk size as an `add-data` CLI option.
+
+## Implementation Steps
+
+- Add ingestion configuration for chunk size, tokenizer/model name, and embedding function provider.
+- Change chunking code to receive chunk size and tokenizer/model name instead of hard-coding `800` and `"gpt-4"`.
+- Reuse the embedding function through dependency injection instead of constructing it for every embed call.
+- Add `--chunk-size` to `add-data`, defaulting to the current value of `800`.
+- Keep the default tokenizer/model behavior equivalent to the current `"gpt-4"` setting.
+
+## Public Interface Changes
+
+- `add-data` gains `--chunk-size`.
+- Default ingestion behavior remains unchanged when no option is provided.
+
+## Test Plan
+
+- Test chunking with default and custom chunk sizes.
+- Test `add-data --chunk-size` parser behavior.
+- Test ingestion service with an injected fake embedder.
+- Smoke test adding a file with and without `--chunk-size`.
+
+## Assumptions
+
+- Only chunk size is exposed in the CLI initially.
+- Tokenizer/model and embedding provider configuration can remain internal or environment-backed until there is a concrete user-facing need.