move top-level modules into a real package
This commit is contained in:
@@ -0,0 +1,30 @@
|
||||
# 10. Make Ingestion More Configurable
|
||||
|
||||
## Summary
|
||||
|
||||
Move chunking and embedding choices into configuration and expose chunk size as an `add-data` CLI option.
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
- Add ingestion configuration for chunk size, tokenizer/model name, and embedding function provider.
|
||||
- Change chunking code to receive chunk size and tokenizer/model name instead of hard-coding `800` and `"gpt-4"`.
|
||||
- Reuse the embedding function through dependency injection instead of constructing it for every embed call.
|
||||
- Add `--chunk-size` to `add-data`, defaulting to the current value of `800`.
|
||||
- Keep the default tokenizer/model behavior equivalent to the current `"gpt-4"` setting.
|
||||
|
||||
## Public Interface Changes
|
||||
|
||||
- `add-data` gains `--chunk-size`.
|
||||
- Default ingestion behavior remains unchanged when no option is provided.
|
||||
|
||||
## Test Plan
|
||||
|
||||
- Test chunking with default and custom chunk sizes.
|
||||
- Test `add-data --chunk-size` parser behavior.
|
||||
- Test ingestion service with an injected fake embedder.
|
||||
- Smoke test adding a file with and without `--chunk-size`.
|
||||
|
||||
## Assumptions
|
||||
|
||||
- Only chunk size is exposed in the CLI initially.
|
||||
- Tokenizer/model and embedding provider configuration can remain internal or environment-backed until there is a concrete user-facing need.
|
||||
Reference in New Issue
Block a user