Compare commits
12 Commits
55bbd897f4
..
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 28ec29f8af | |||
| fb62d1b539 | |||
| 26df98c08e | |||
| 74e48fbcd5 | |||
| d1b1238897 | |||
| 615ab14a1a | |||
| 508d036815 | |||
| 292d0eb139 | |||
| d71fce7a6a | |||
| c6ad060e85 | |||
| c5b6b196b5 | |||
| 948f8500be |
@@ -2,32 +2,30 @@
|
|||||||
|
|
||||||
## Project Structure & Module Organization
|
## Project Structure & Module Organization
|
||||||
|
|
||||||
`chromy/` contains the Python package and CLI implementation. The entrypoint is `chromy/main.py`, which loads environment variables and invokes the Typer app defined in `chromy/cli.py`. Command-specific behavior belongs in `chromy/handlers/`. Shared Chroma, embedding, chunking, and formatting helpers live in package modules such as `chroma_functions.py`, `embed.py`, `chunk_functions.py`, and `utilities.py`.
|
`chromy/` contains the Python package and CLI implementation. The entrypoint is `chromy/main.py`, which loads environment variables and invokes the Typer app defined in `chromy/cli.py`. The active CLI commands are `list-collections`, `create-collection`, `delete-collection`, `count`, `import`, `query`, and `delete`. Command-specific behavior belongs in `chromy/handlers/`. Shared Chroma, embedding, chunking, querying, and output helpers live in package modules such as `chroma_functions.py`, `embed.py`, `chunk_functions.py`, and `utilities.py`.
|
||||||
|
|
||||||
`tests/` contains the test suite for the CLI, handlers, and embedding helpers. Generated data and build outputs such as `chroma/`, `dist/`, `chromy.egg-info/`, `.pytest_cache/`, `.mypy_cache/`, `.ruff_cache/`, and `.venv/` are not source.
|
`tests/` contains the test suite for the CLI, handlers, and embedding helpers. `README.md` documents user-facing behavior, `pyproject.toml` defines packaging and tool configuration, and `romeo_and_juliet.txt` is a checked-in sample input used by tests and manual CLI runs. Treat generated or local-state directories such as `chroma/`, `dist/`, `chromy.egg-info/`, `.pytest_cache/`, `.mypy_cache/`, `.ruff_cache/`, `.venv/`, `__pycache__/`, and `main.onefile-build/` as non-source. The top-level `handlers/` directory currently contains only legacy bytecode artifacts and should not be treated as source.
|
||||||
|
|
||||||
## Build, Test, and Development Commands
|
## Build, Test, and Development Commands
|
||||||
|
|
||||||
- `uv sync`: install runtime and development dependencies from `pyproject.toml` and `uv.lock`.
|
- `uv sync`: install runtime and development dependencies from `pyproject.toml` and `uv.lock`.
|
||||||
- `uv run python -m chromy.main --help`: run the CLI from the source tree.
|
- `uv run python -m chromy.main --help`: run the CLI from the source tree.
|
||||||
|
- `uv run chromy --help`: run the packaged console script inside the project environment.
|
||||||
- `uv run pytest -q`: run the test suite.
|
- `uv run pytest -q`: run the test suite.
|
||||||
|
- `uv run ruff check .`: run lint checks.
|
||||||
|
- `uv run ruff format --check .`: verify formatting.
|
||||||
|
- `uv run mypy .`: run static type checks.
|
||||||
- `uv build`: build the source distribution and wheel into `dist/`.
|
- `uv build`: build the source distribution and wheel into `dist/`.
|
||||||
- `uv tool install --editable .`: install the `chromy` command in editable mode for local CLI testing.
|
- `uv tool install --editable .`: install the `chromy` command in editable mode for local CLI testing.
|
||||||
|
|
||||||
## Coding Style & Naming Conventions
|
## Coding Style & Naming Conventions
|
||||||
|
|
||||||
Use Python 3.12+ syntax, type hints, and `from __future__ import annotations`. Follow the current style: 4-space indentation, snake_case functions and modules, PascalCase classes, and Typer command functions in `chromy/cli.py` that delegate to small handler functions. Keep handlers focused on CLI orchestration and user-facing output; place reusable database, chunking, embedding, and formatting logic in shared modules.
|
Use Python 3.12+ syntax, type hints, and `from __future__ import annotations`. Follow the current style: 4-space indentation, snake_case functions and modules, PascalCase classes, and Typer command functions in `chromy/cli.py` that delegate to small handler functions. Keep handlers focused on CLI orchestration and user-facing output; place reusable database, chunking, embedding, query, and formatting logic in shared modules. Prefer `rich` output for user-facing CLI messages to stay consistent with the existing commands.
|
||||||
|
|
||||||
## Testing Guidelines
|
## Testing Guidelines
|
||||||
|
|
||||||
Tests run with pytest and are currently written in `unittest.TestCase` style. Name test files `test_*.py` and test methods `test_*`. Prefer mocking Chroma-facing and filesystem-facing functions in CLI and handler tests so unit tests stay deterministic. Run `uv run pytest -q` before submitting changes, and add tests for new commands, Typer wiring, handlers, and error paths.
|
Tests run with pytest and are currently written in `unittest.TestCase` style. Name test files `test_*.py` and test methods `test_*`. Prefer mocking Chroma-facing and filesystem-facing functions in CLI and handler tests so unit tests stay deterministic. Run `uv run pytest -q` before submitting changes, and use `uv run ruff check .` plus `uv run mypy .` when touching typed code or shared modules. Add tests for new commands, Typer wiring, handlers, and error paths.
|
||||||
|
|
||||||
## Commit & Pull Request Guidelines
|
|
||||||
|
|
||||||
Git history uses short, imperative, lowercase commit subjects, for example `move top-level modules into a real package` and `add typer-based delete command`. Keep commits scoped to one logical change.
|
|
||||||
|
|
||||||
Pull requests should include a concise description, test results, and notes for any CLI behavior changes. Link related issues or plan files when applicable. Include terminal output examples or screenshots only when user-facing command output changes.
|
|
||||||
|
|
||||||
## Security & Configuration Tips
|
## Security & Configuration Tips
|
||||||
|
|
||||||
The CLI loads environment variables via `python-dotenv`; keep secrets in local `.env` files and do not commit them. Treat `chroma/` as local persistent database state. Avoid committing generated build artifacts, cache directories, or large ad hoc input files unless they are intentional fixtures.
|
The CLI loads environment variables via `python-dotenv`; keep secrets in local `.env` files and do not commit them. Treat `chroma/` as local persistent database state created by `chromadb.PersistentClient()`. Avoid committing generated build artifacts, cache directories, onefile build outputs, or large ad hoc input files unless they are intentional fixtures. If you change command names or examples, update both `README.md` and the tests so the documented CLI stays aligned with the implementation.
|
||||||
|
|||||||
@@ -1,6 +1,10 @@
|
|||||||
# Chromy
|
# Chromy
|
||||||
|
|
||||||
A small command-line utility for working with a local Chroma database. It lets you create collections, ingest file contents as chunked embeddings, and run similarity queries against stored documents.
|
<div align="center">
|
||||||
|
<img src="logo.png" width=300 />
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Chromy is small and simple to use command-line utility for working with a local Chroma database. It lets you create collections, ingest files as chunked embeddings, and run similarity queries against stored documents. It integrates perfectly with agentic coding tools via simple skills (see an [example](./skills/chromy/SKILL.md) in the `skills` directory).
|
||||||
|
|
||||||
## What it does
|
## What it does
|
||||||
|
|
||||||
@@ -120,7 +124,7 @@ list-collections
|
|||||||
create-collection <collection>
|
create-collection <collection>
|
||||||
delete-collection <collection>
|
delete-collection <collection>
|
||||||
count <collection>
|
count <collection>
|
||||||
add-data <collection> <file>
|
import <collection> <file> [<file> ...]
|
||||||
query <collection> <query_text>
|
query <collection> <query_text>
|
||||||
delete <collection> --where <condition>=<value>
|
delete <collection> --where <condition>=<value>
|
||||||
```
|
```
|
||||||
@@ -133,10 +137,12 @@ Create a collection:
|
|||||||
chromy create-collection notes
|
chromy create-collection notes
|
||||||
```
|
```
|
||||||
|
|
||||||
Add a file:
|
Add one or more files:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
chromy add-data notes ./docs/example.txt
|
chromy import notes ./docs/example.txt
|
||||||
|
chromy import notes ./docs/intro.md ./docs/setup.md
|
||||||
|
chromy import notes *.md
|
||||||
```
|
```
|
||||||
|
|
||||||
Count stored records:
|
Count stored records:
|
||||||
@@ -171,7 +177,7 @@ chromy delete notes --where file_name=example.txt
|
|||||||
|
|
||||||
## How ingestion works
|
## How ingestion works
|
||||||
|
|
||||||
When you run `add-data`, the file is:
|
When you run `import`, each file is:
|
||||||
|
|
||||||
1. read from disk
|
1. read from disk
|
||||||
2. split into chunks
|
2. split into chunks
|
||||||
@@ -182,6 +188,11 @@ Query results include the stored document chunk, its id, distance, and file name
|
|||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- collections are stored in a local persistent Chroma database
|
- collections are stored in a local persistent Chroma database in the current directory
|
||||||
- `add-data` requires the target collection to already exist
|
- `import` requires the target collection to already exist
|
||||||
- the CLI prints friendly messages for common errors such as missing collections or missing files
|
- `import` accepts one or more file paths
|
||||||
|
- unquoted glob patterns such as `*.md` are expanded by the shell before `chromy` starts
|
||||||
|
- quoted glob patterns such as `"*.md"` are treated as literal paths and are not expanded by `chromy`
|
||||||
|
- unmatched unquoted globs may behave differently by shell: `zsh` commonly fails before `chromy` starts, while `bash` may pass the literal pattern through depending on shell settings
|
||||||
|
- the CLI reports file-specific import failures and continues with the remaining files
|
||||||
|
- when importing multiple files in an interactive terminal, the CLI shows a Rich progress bar
|
||||||
|
|||||||
@@ -9,7 +9,7 @@ from chromadb.api import ClientAPI
|
|||||||
from chromadb.api.types import QueryResult, Where
|
from chromadb.api.types import QueryResult, Where
|
||||||
from chromadb.errors import NotFoundError
|
from chromadb.errors import NotFoundError
|
||||||
|
|
||||||
from chromy.embed import EmbeddingRecord
|
from chromy.embedding import EmbeddingRecord
|
||||||
|
|
||||||
|
|
||||||
def _get_client_and_collection(
|
def _get_client_and_collection(
|
||||||
@@ -54,11 +54,17 @@ def delete_data(collection_name: str, where: dict[str, str]) -> int:
|
|||||||
return int(result.get("deleted", 0))
|
return int(result.get("deleted", 0))
|
||||||
|
|
||||||
|
|
||||||
def count_collection(collection_name: str) -> str:
|
def has_data_for_file(collection_name: str, file_name: str) -> bool:
|
||||||
_, collection = _get_client_and_collection(collection_name)
|
_, collection = _get_client_and_collection(collection_name)
|
||||||
count = collection.count()
|
result = collection.get(where=cast(Where, {"file_name": file_name}))
|
||||||
|
ids = result.get("ids", [])
|
||||||
|
|
||||||
return f"The '{collection_name}' collection contains [bold green]{count}[/] records."
|
return len(ids) > 0
|
||||||
|
|
||||||
|
|
||||||
|
def count_collection(collection_name: str) -> int:
|
||||||
|
_, collection = _get_client_and_collection(collection_name)
|
||||||
|
return collection.count()
|
||||||
|
|
||||||
|
|
||||||
def add_data(
|
def add_data(
|
||||||
@@ -71,8 +77,7 @@ def add_data(
|
|||||||
|
|
||||||
_, collection = _get_client_and_collection(collection_name)
|
_, collection = _get_client_and_collection(collection_name)
|
||||||
|
|
||||||
embeddings: list[Sequence[float]] = [record["embedding"]
|
embeddings: list[Sequence[float]] = [record["embedding"] for record in data]
|
||||||
for record in data]
|
|
||||||
|
|
||||||
collection.add(
|
collection.add(
|
||||||
ids=[str(uuid4()) for _ in data],
|
ids=[str(uuid4()) for _ in data],
|
||||||
|
|||||||
@@ -0,0 +1,5 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from chromy.chunking.service import chunk_file, chunk_text
|
||||||
|
|
||||||
|
__all__ = ["chunk_file", "chunk_text"]
|
||||||
@@ -3,7 +3,7 @@ from __future__ import annotations
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import cast
|
from typing import cast
|
||||||
|
|
||||||
import semchunk
|
from semchunk import semchunk
|
||||||
|
|
||||||
|
|
||||||
def chunk_text(text: str, chunk_size: int = 800) -> list[str]:
|
def chunk_text(text: str, chunk_size: int = 800) -> list[str]:
|
||||||
+11
-9
@@ -3,16 +3,16 @@ from __future__ import annotations
|
|||||||
from typing import Annotated, Callable
|
from typing import Annotated, Callable
|
||||||
|
|
||||||
import typer
|
import typer
|
||||||
from rich import print
|
|
||||||
from chromadb.errors import InternalError, NotFoundError
|
from chromadb.errors import InternalError, NotFoundError
|
||||||
|
from rich import print
|
||||||
|
|
||||||
from chromy.handlers.import_data import handle_import
|
|
||||||
from chromy.handlers.count_collection import handle_count_collection
|
from chromy.handlers.count_collection import handle_count_collection
|
||||||
from chromy.handlers.create_collection import handle_create_collection
|
from chromy.handlers.create_collection import handle_create_collection
|
||||||
from chromy.handlers.delete_collection import (
|
from chromy.handlers.delete_collection import (
|
||||||
handle_delete_collection,
|
handle_delete_collection,
|
||||||
handle_delete_records,
|
handle_delete_records,
|
||||||
)
|
)
|
||||||
|
from chromy.handlers.import_data import handle_import
|
||||||
from chromy.handlers.list_collections import handle_list_collections
|
from chromy.handlers.list_collections import handle_list_collections
|
||||||
from chromy.handlers.query import handle_query
|
from chromy.handlers.query import handle_query
|
||||||
|
|
||||||
@@ -105,25 +105,27 @@ def count(
|
|||||||
# ------------------------------------------------------------------------------
|
# ------------------------------------------------------------------------------
|
||||||
@app.command(
|
@app.command(
|
||||||
"import",
|
"import",
|
||||||
help="Chunk, embed, and add a file to a collection in the local Chroma database.",
|
help=(
|
||||||
|
"Chunk, embed, and add one or more files to a collection in the "
|
||||||
|
"local Chroma database."
|
||||||
|
),
|
||||||
)
|
)
|
||||||
def import_data(
|
def import_data(
|
||||||
collection: Annotated[
|
collection: Annotated[
|
||||||
str,
|
str,
|
||||||
typer.Argument(help="Name of the target collection."),
|
typer.Argument(help="Name of the target collection."),
|
||||||
],
|
],
|
||||||
file: Annotated[
|
files: Annotated[
|
||||||
str,
|
list[str],
|
||||||
typer.Argument(
|
typer.Argument(
|
||||||
help="Path to the file to chunk and add to the collection."),
|
help="Path(s) to the file(s) to chunk and add to the collection."
|
||||||
|
),
|
||||||
],
|
],
|
||||||
) -> None:
|
) -> None:
|
||||||
try:
|
try:
|
||||||
_run(lambda: handle_import(collection, file))
|
_run(lambda: handle_import(collection, files))
|
||||||
except NotFoundError:
|
except NotFoundError:
|
||||||
_fail(f"Collection '{collection}' does not exist.")
|
_fail(f"Collection '{collection}' does not exist.")
|
||||||
except FileNotFoundError:
|
|
||||||
_fail(f"The file {file} was not found.")
|
|
||||||
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------------------
|
# ------------------------------------------------------------------------------
|
||||||
|
|||||||
@@ -0,0 +1,5 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from chromy.embedding.service import EmbeddingRecord, embed
|
||||||
|
|
||||||
|
__all__ = ["EmbeddingRecord", "embed"]
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
|
||||||
|
class UnsupportedTextFileError(Exception):
|
||||||
|
"""Raised when a file does not appear to contain supported text content."""
|
||||||
@@ -1,9 +1,11 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from rich import print
|
from rich import print
|
||||||
|
|
||||||
from chromy.chroma_functions import count_collection
|
from chromy.chroma_functions import count_collection
|
||||||
|
from chromy.output import format_count_message
|
||||||
|
|
||||||
|
|
||||||
def handle_count_collection(collection: str) -> int:
|
def handle_count_collection(collection: str) -> int:
|
||||||
print(count_collection(collection))
|
print(format_count_message(collection, count_collection(collection)))
|
||||||
return 0
|
return 0
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from rich import print
|
from rich import print
|
||||||
|
|
||||||
from chromy.chroma_functions import create_collection
|
from chromy.chroma_functions import create_collection
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from rich import print
|
from rich import print
|
||||||
|
|
||||||
from chromy.chroma_functions import delete_collection, delete_data
|
from chromy.chroma_functions import delete_collection, delete_data
|
||||||
|
|
||||||
|
|
||||||
@@ -8,15 +9,13 @@ def _parse_where_clause(where_clause: str) -> dict[str, str]:
|
|||||||
condition, separator, value = where_clause.partition("=")
|
condition, separator, value = where_clause.partition("=")
|
||||||
|
|
||||||
if separator == "":
|
if separator == "":
|
||||||
raise ValueError(
|
raise ValueError("Invalid --where value. Expected <condition>=<value>.")
|
||||||
"Invalid --where value. Expected <condition>=<value>.")
|
|
||||||
|
|
||||||
condition = condition.strip()
|
condition = condition.strip()
|
||||||
value = value.strip()
|
value = value.strip()
|
||||||
|
|
||||||
if not condition or not value:
|
if not condition or not value:
|
||||||
raise ValueError(
|
raise ValueError("Invalid --where value. Expected <condition>=<value>.")
|
||||||
"Invalid --where value. Expected <condition>=<value>.")
|
|
||||||
|
|
||||||
return {condition: value}
|
return {condition: value}
|
||||||
|
|
||||||
|
|||||||
@@ -1,10 +1,27 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import os
|
import os
|
||||||
|
import sys
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from typing import Final
|
||||||
|
|
||||||
from rich import print
|
from rich import print
|
||||||
|
from rich.progress import (
|
||||||
|
BarColumn,
|
||||||
|
MofNCompleteColumn,
|
||||||
|
Progress,
|
||||||
|
SpinnerColumn,
|
||||||
|
TextColumn,
|
||||||
|
)
|
||||||
|
|
||||||
|
from chromy.errors import UnsupportedTextFileError
|
||||||
from chromy.utilities import ingest_file
|
from chromy.utilities import ingest_file
|
||||||
|
|
||||||
|
from ..utilities import is_probably_text_file
|
||||||
|
|
||||||
|
SUCCESS_EXIT_CODE: Final = 0
|
||||||
|
FAILURE_EXIT_CODE: Final = 1
|
||||||
|
|
||||||
|
|
||||||
def _get_absolute_path(file: str) -> str:
|
def _get_absolute_path(file: str) -> str:
|
||||||
"""
|
"""
|
||||||
@@ -21,11 +38,94 @@ def _get_absolute_path(file: str) -> str:
|
|||||||
raise FileNotFoundError()
|
raise FileNotFoundError()
|
||||||
|
|
||||||
file_path = Path(file)
|
file_path = Path(file)
|
||||||
return str(file_path.resolve(file_path))
|
return str(file_path.resolve())
|
||||||
|
|
||||||
|
|
||||||
def handle_import(collection: str, file: str) -> int:
|
def _import_one(collection: str, file: str) -> int:
|
||||||
records_added = ingest_file(collection, _get_absolute_path(file))
|
absolute_path = _get_absolute_path(file)
|
||||||
|
|
||||||
|
if not Path(absolute_path).is_file():
|
||||||
|
raise FileNotFoundError()
|
||||||
|
|
||||||
|
if not is_probably_text_file(absolute_path):
|
||||||
|
raise UnsupportedTextFileError()
|
||||||
|
|
||||||
|
return ingest_file(collection, absolute_path)
|
||||||
|
|
||||||
|
|
||||||
|
def _should_show_progress(file_count: int) -> bool:
|
||||||
|
return file_count > 1 and sys.stdout.isatty()
|
||||||
|
|
||||||
|
|
||||||
|
def _truncate_file_name(file_name: str, max_length: int = 20) -> str:
|
||||||
|
if len(file_name) <= max_length:
|
||||||
|
return file_name
|
||||||
|
|
||||||
|
return f"{file_name[: max_length - 3]}"
|
||||||
|
|
||||||
|
|
||||||
|
def handle_import(collection: str, files: list[str]) -> int:
|
||||||
|
successful_imports = 0
|
||||||
|
failed_imports = 0
|
||||||
|
seen_paths: set[str] = set()
|
||||||
|
unique_files: list[str] = []
|
||||||
|
|
||||||
|
for file in files:
|
||||||
|
try:
|
||||||
|
absolute_path = _get_absolute_path(file)
|
||||||
|
except FileNotFoundError:
|
||||||
|
unique_files.append(file)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if absolute_path in seen_paths:
|
||||||
|
continue
|
||||||
|
|
||||||
|
seen_paths.add(absolute_path)
|
||||||
|
unique_files.append(file)
|
||||||
|
|
||||||
|
show_progress = _should_show_progress(len(unique_files))
|
||||||
|
|
||||||
|
with Progress(
|
||||||
|
SpinnerColumn(),
|
||||||
|
TextColumn("[progress.description]{task.description}"),
|
||||||
|
BarColumn(),
|
||||||
|
MofNCompleteColumn(),
|
||||||
|
transient=True,
|
||||||
|
disable=not show_progress,
|
||||||
|
) as progress:
|
||||||
|
task_id = progress.add_task("Importing files...", total=len(unique_files))
|
||||||
|
|
||||||
|
for file in unique_files:
|
||||||
|
file_name = _truncate_file_name(Path(file).name)
|
||||||
|
description = f"Importing [bold]{file_name}[/]..."
|
||||||
|
progress.update(task_id, description=description)
|
||||||
|
try:
|
||||||
|
records_added = _import_one(collection, file)
|
||||||
|
successful_imports += 1
|
||||||
|
if not show_progress:
|
||||||
|
progress.console.print(
|
||||||
|
"[bold green]Added[/] "
|
||||||
|
f"{records_added} records from '{file}' to "
|
||||||
|
f"collection '{collection}'."
|
||||||
|
)
|
||||||
|
except FileNotFoundError:
|
||||||
|
failed_imports += 1
|
||||||
|
progress.console.print(
|
||||||
|
f"[bold red]Error[/]: The file '{file}' was not found."
|
||||||
|
)
|
||||||
|
except UnsupportedTextFileError:
|
||||||
|
failed_imports += 1
|
||||||
|
progress.console.print(
|
||||||
|
f"[bold red]Error[/]: The file '{file}' is not a text file."
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
progress.advance(task_id)
|
||||||
|
|
||||||
print(
|
print(
|
||||||
f"[bold green]Added[/] {records_added} records to collection '{collection}'.")
|
f"Imported {successful_imports} file(s) successfully; {failed_imports} failed."
|
||||||
return 0
|
)
|
||||||
|
|
||||||
|
if failed_imports:
|
||||||
|
return FAILURE_EXIT_CODE
|
||||||
|
|
||||||
|
return SUCCESS_EXIT_CODE
|
||||||
|
|||||||
@@ -1,14 +1,16 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from chromy.chroma_functions import list_collections
|
from chromy.chroma_functions import list_collections
|
||||||
from chromy.utilities import print_lines
|
from chromy.output import format_collection_names, print_lines
|
||||||
|
|
||||||
|
|
||||||
def handle_list_collections() -> int:
|
def handle_list_collections() -> int:
|
||||||
collections = list_collections()
|
collections = list_collections()
|
||||||
|
|
||||||
if not collections:
|
if not collections:
|
||||||
print("No collections found.")
|
print("No collections found.")
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
print_lines(collections)
|
print_lines(format_collection_names(collections))
|
||||||
|
|
||||||
return 0
|
return 0
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from chromy.utilities import format_query_result, print_lines, run_query
|
from chromy.output import format_query_result, print_lines
|
||||||
|
from chromy.utilities import run_query
|
||||||
|
|
||||||
|
|
||||||
def handle_query(collection: str, query_text: str) -> int:
|
def handle_query(collection: str, query_text: str) -> int:
|
||||||
|
|||||||
@@ -0,0 +1,70 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from collections.abc import Mapping, Sequence
|
||||||
|
|
||||||
|
from chromadb import QueryResult
|
||||||
|
from rich.console import Console
|
||||||
|
from rich.rule import Rule
|
||||||
|
from rich.text import Text
|
||||||
|
|
||||||
|
CONSOLE = Console()
|
||||||
|
|
||||||
|
|
||||||
|
def print_lines(lines: Sequence[Rule | Text | str]) -> None:
|
||||||
|
for line in lines:
|
||||||
|
CONSOLE.print(line)
|
||||||
|
|
||||||
|
|
||||||
|
def format_collection_names(collections: Sequence[str]) -> list[Text]:
|
||||||
|
return [Text(f"· {collection}") for collection in collections]
|
||||||
|
|
||||||
|
|
||||||
|
def format_count_message(collection_name: str, count: int) -> str:
|
||||||
|
return (
|
||||||
|
f"The '{collection_name}' collection contains [bold green]{count}[/] records."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def format_query_result(result: QueryResult) -> list[Rule | Text]:
|
||||||
|
ids = result.get("ids", [[]])
|
||||||
|
documents = result.get("documents", [[]])
|
||||||
|
distances = result.get("distances", [[]])
|
||||||
|
metadatas = result.get("metadatas", [[]])
|
||||||
|
|
||||||
|
first_ids = ids[0] if ids else []
|
||||||
|
first_documents = documents[0] if documents else []
|
||||||
|
first_distances = distances[0] if distances else []
|
||||||
|
first_metadatas = metadatas[0] if metadatas else []
|
||||||
|
|
||||||
|
if not first_ids:
|
||||||
|
return [Text.from_markup("[yellow]No results found.[/]")]
|
||||||
|
|
||||||
|
lines: list[Rule | Text] = [Rule(title="Query results")]
|
||||||
|
|
||||||
|
for index, document_id in enumerate(first_ids, start=1):
|
||||||
|
lines.append(
|
||||||
|
Text.from_markup(f"[bold]{index}[/].\t[green]id[/]\t\t{document_id}")
|
||||||
|
)
|
||||||
|
i = index - 1
|
||||||
|
|
||||||
|
if i < len(first_distances):
|
||||||
|
lines.append(
|
||||||
|
Text.from_markup(f"\t[green]distance[/]\t{first_distances[i]}")
|
||||||
|
)
|
||||||
|
|
||||||
|
if i < len(first_metadatas):
|
||||||
|
metadata = first_metadatas[i]
|
||||||
|
|
||||||
|
if isinstance(metadata, Mapping):
|
||||||
|
file_name = metadata.get("file_name")
|
||||||
|
|
||||||
|
if file_name:
|
||||||
|
lines.append(Text.from_markup(f"\t[green]file_name[/]\t{file_name}"))
|
||||||
|
|
||||||
|
if i < len(first_documents):
|
||||||
|
lines.append(Text.from_markup("\n[bold green]Retrieved contents[/]\n"))
|
||||||
|
lines.append(Text(first_documents[i]))
|
||||||
|
|
||||||
|
lines.append(Rule())
|
||||||
|
|
||||||
|
return lines
|
||||||
+35
-54
@@ -1,26 +1,18 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from rich.text import Text
|
from pathlib import Path
|
||||||
from rich.rule import Rule
|
|
||||||
from rich.console import Console
|
|
||||||
|
|
||||||
from collections.abc import Mapping, Sequence
|
|
||||||
|
|
||||||
from chromadb import QueryResult
|
from chromadb import QueryResult
|
||||||
|
|
||||||
from chromy.chroma_functions import add_data, query_data
|
from chromy.chroma_functions import add_data, delete_data, has_data_for_file, query_data
|
||||||
from chromy.chunk_functions import chunk_file
|
from chromy.chunking import chunk_file
|
||||||
from chromy.embed import embed
|
from chromy.embedding import embed
|
||||||
|
|
||||||
CONSOLE = Console()
|
|
||||||
|
|
||||||
|
|
||||||
def print_lines(lines: Sequence[str]) -> None:
|
|
||||||
for line in lines:
|
|
||||||
CONSOLE.print(line)
|
|
||||||
|
|
||||||
|
|
||||||
def ingest_file(collection_name: str, file_path: str) -> int:
|
def ingest_file(collection_name: str, file_path: str) -> int:
|
||||||
|
if has_data_for_file(collection_name, file_path):
|
||||||
|
delete_data(collection_name, {"file_name": file_path})
|
||||||
|
|
||||||
chunks = chunk_file(file_path)
|
chunks = chunk_file(file_path)
|
||||||
embeddings = embed(chunks)
|
embeddings = embed(chunks)
|
||||||
add_data(collection_name, embeddings, file_path)
|
add_data(collection_name, embeddings, file_path)
|
||||||
@@ -31,50 +23,39 @@ def run_query(collection_name: str, query_text: str) -> QueryResult:
|
|||||||
return query_data(collection_name, [query_text])
|
return query_data(collection_name, [query_text])
|
||||||
|
|
||||||
|
|
||||||
def format_query_result(result: QueryResult) -> list[str]:
|
def is_probably_text_file(path: str | Path, sample_size: int = 8192) -> bool:
|
||||||
ids = result.get("ids", [[]])
|
"""
|
||||||
documents = result.get("documents", [[]])
|
Return whether a file appears to contain text.
|
||||||
distances = result.get("distances", [[]])
|
|
||||||
metadatas = result.get("metadatas", [[]])
|
|
||||||
|
|
||||||
first_ids = ids[0] if ids else []
|
Args:
|
||||||
first_documents = documents[0] if documents else []
|
path (str | Path): The path to the file to inspect.
|
||||||
first_distances = distances[0] if distances else []
|
sample_size (int): The maximum number of bytes to read from the file.
|
||||||
first_metadatas = metadatas[0] if metadatas else []
|
|
||||||
|
|
||||||
if not first_ids:
|
Returns:
|
||||||
return ["No results found."]
|
bool: ``True`` if the sampled bytes decode as UTF-8, UTF-8 with BOM,
|
||||||
|
UTF-16, or UTF-32, or if the file is empty. Otherwise, ``False``.
|
||||||
|
"""
|
||||||
|
|
||||||
lines = [Rule(title="Query results")]
|
path = Path(path)
|
||||||
|
|
||||||
for index, document_id in enumerate(first_ids, start=1):
|
with path.open("rb") as f:
|
||||||
# lines.append(f"{index}.\tid: {document_id}")
|
sample = f.read(sample_size)
|
||||||
lines.append(
|
|
||||||
Text.from_markup(f"[bold]{index}[/].\t[green]id[/]\t\t{document_id}")
|
|
||||||
)
|
|
||||||
i = index - 1
|
|
||||||
|
|
||||||
if i < len(first_distances):
|
if not sample:
|
||||||
lines.append(
|
return True
|
||||||
Text.from_markup(f"\t[green]distance[/]\t{first_distances[i]}")
|
|
||||||
)
|
|
||||||
|
|
||||||
if i < len(first_metadatas):
|
encodings = (
|
||||||
metadata = first_metadatas[i]
|
"utf-8",
|
||||||
|
"utf-8-sig",
|
||||||
|
"utf-16",
|
||||||
|
"utf-32",
|
||||||
|
)
|
||||||
|
|
||||||
if isinstance(metadata, Mapping):
|
for encoding in encodings:
|
||||||
file_name = metadata.get("file_name")
|
try:
|
||||||
|
sample.decode(encoding)
|
||||||
|
return True
|
||||||
|
except UnicodeDecodeError:
|
||||||
|
pass
|
||||||
|
|
||||||
if file_name:
|
return False
|
||||||
lines.append(
|
|
||||||
Text.from_markup(f"\t[green]file_name[/]\t{file_name}")
|
|
||||||
)
|
|
||||||
|
|
||||||
if i < len(first_documents):
|
|
||||||
lines.append(Text.from_markup("\n[bold green]Retrieved contents[/]\n"))
|
|
||||||
lines.append(first_documents[i])
|
|
||||||
|
|
||||||
# Print a separator between documents
|
|
||||||
lines.append(Rule())
|
|
||||||
|
|
||||||
return lines
|
|
||||||
|
|||||||
+2
-2
@@ -24,7 +24,7 @@ dependencies = [
|
|||||||
chromy = "chromy.main:main"
|
chromy = "chromy.main:main"
|
||||||
|
|
||||||
[tool.setuptools]
|
[tool.setuptools]
|
||||||
packages = ["chromy", "chromy.handlers"]
|
packages = ["chromy", "chromy.chunking", "chromy.embedding", "chromy.handlers"]
|
||||||
|
|
||||||
[dependency-groups]
|
[dependency-groups]
|
||||||
dev = [
|
dev = [
|
||||||
@@ -72,7 +72,7 @@ module = [
|
|||||||
ignore_missing_imports = true
|
ignore_missing_imports = true
|
||||||
|
|
||||||
[[tool.mypy.overrides]]
|
[[tool.mypy.overrides]]
|
||||||
module = "chromy.chunk_functions"
|
module = "chromy.chunking.service"
|
||||||
disable_error_code = [
|
disable_error_code = [
|
||||||
"attr-defined",
|
"attr-defined",
|
||||||
]
|
]
|
||||||
|
|||||||
@@ -0,0 +1,43 @@
|
|||||||
|
---
|
||||||
|
name: chromy
|
||||||
|
description: This skill provides access to a RAG-like context enhancer that uses Chromadb locally.
|
||||||
|
---
|
||||||
|
|
||||||
|
# Chromy
|
||||||
|
|
||||||
|
Whenever the user asks to "use chromy", you should invoke `chromy`, which is a cli tool to perform RAG search.
|
||||||
|
The tool should be available in the `$PATH` as `chromy`.
|
||||||
|
|
||||||
|
You have access to these commands:
|
||||||
|
|
||||||
|
- `$ chromy lc` -> Lists the existing collections.
|
||||||
|
- `$ chromy q <collection> <query>` -> Performs a query. Be sure to quote the `<query>` if this is composed by multiple words.
|
||||||
|
|
||||||
|
Then use the response from Chromy to enhance the context and give the user a refined response.
|
||||||
|
|
||||||
|
## A note on file sources
|
||||||
|
|
||||||
|
The Chromy response returns the metadatas for the chunks it finds. Among these metadatas, there is `file_name`, which refers to the original file that was chunked and imported. **DO NOT ATTEMPT** to find or fetch these files. They most likely do not exist in the filesystem. You **SHOULD ALWAYS** however cite correctly from which files (**ONLY** from Chromy's metadatas) the information is coming.
|
||||||
|
|
||||||
|
## Example use case
|
||||||
|
|
||||||
|
**START**
|
||||||
|
|
||||||
|
User query:
|
||||||
|
|
||||||
|
> Search in Chromy information about lovecraft's Dunwich horror.
|
||||||
|
|
||||||
|
Step 1: Get the available collections with `chromy lc`. The output is:
|
||||||
|
|
||||||
|
```
|
||||||
|
lovecraft
|
||||||
|
documents
|
||||||
|
```
|
||||||
|
|
||||||
|
Most likely our information is in the `lovecraft` collection. We will use that for the query.
|
||||||
|
|
||||||
|
Step 2: Query using `chromy q lovecraft <query>`. The query _is up to you_, create one keeping into account that this is a raw query on a vector DB. Be concise, extract keywords, avoid noise.
|
||||||
|
|
||||||
|
Step 3: Get the results, enhance the context, and respond to the user.
|
||||||
|
|
||||||
|
**END**
|
||||||
+106
-13
@@ -31,11 +31,11 @@ class CliTests(unittest.TestCase):
|
|||||||
with patch(
|
with patch(
|
||||||
"chromy.handlers.list_collections.list_collections",
|
"chromy.handlers.list_collections.list_collections",
|
||||||
return_value=["books", "code"],
|
return_value=["books", "code"],
|
||||||
):
|
):
|
||||||
result = _invoke(["list-collections"])
|
result = _invoke(["list-collections"])
|
||||||
|
|
||||||
self.assertEqual(result.exit_code, 0)
|
self.assertEqual(result.exit_code, 0)
|
||||||
self.assertEqual(result.stdout, "books\ncode\n")
|
self.assertEqual(result.stdout, "· books\n· code\n")
|
||||||
|
|
||||||
def test_create_collection(self) -> None:
|
def test_create_collection(self) -> None:
|
||||||
with patch(
|
with patch(
|
||||||
@@ -51,15 +51,13 @@ class CliTests(unittest.TestCase):
|
|||||||
def test_create_collection_with_same_name(self) -> None:
|
def test_create_collection_with_same_name(self) -> None:
|
||||||
with patch(
|
with patch(
|
||||||
"chromy.handlers.create_collection.create_collection",
|
"chromy.handlers.create_collection.create_collection",
|
||||||
side_effect=InternalError()
|
side_effect=InternalError(),
|
||||||
|
|
||||||
) as create_collection:
|
) as create_collection:
|
||||||
result = _invoke(["create-collection", "notes"])
|
result = _invoke(["create-collection", "notes"])
|
||||||
|
|
||||||
create_collection.assert_called_once_with("notes")
|
create_collection.assert_called_once_with("notes")
|
||||||
self.assertEqual(result.exit_code, 1)
|
self.assertEqual(result.exit_code, 1)
|
||||||
self.assertEqual(
|
self.assertEqual(result.stdout, "Error: Collection 'notes' already exists.\n")
|
||||||
result.stdout, "Error: Collection 'notes' already exists.\n")
|
|
||||||
|
|
||||||
def test_delete_collection(self) -> None:
|
def test_delete_collection(self) -> None:
|
||||||
with patch(
|
with patch(
|
||||||
@@ -74,14 +72,13 @@ class CliTests(unittest.TestCase):
|
|||||||
def test_delete_non_existent_collection(self) -> None:
|
def test_delete_non_existent_collection(self) -> None:
|
||||||
with patch(
|
with patch(
|
||||||
"chromy.handlers.delete_collection.delete_collection",
|
"chromy.handlers.delete_collection.delete_collection",
|
||||||
side_effect=NotFoundError()
|
side_effect=NotFoundError(),
|
||||||
) as delete_collection:
|
) as delete_collection:
|
||||||
result = _invoke(["delete-collection", "notes"])
|
result = _invoke(["delete-collection", "notes"])
|
||||||
|
|
||||||
delete_collection.assert_called_once_with("notes")
|
delete_collection.assert_called_once_with("notes")
|
||||||
self.assertEqual(result.exit_code, 1)
|
self.assertEqual(result.exit_code, 1)
|
||||||
self.assertEqual(
|
self.assertEqual(result.stdout, "Error: Collection 'notes' does not exist.\n")
|
||||||
result.stdout, "Error: Collection 'notes' does not exist.\n")
|
|
||||||
|
|
||||||
def test_count(self) -> None:
|
def test_count(self) -> None:
|
||||||
with patch(
|
with patch(
|
||||||
@@ -92,7 +89,10 @@ class CliTests(unittest.TestCase):
|
|||||||
|
|
||||||
count_collection.assert_called_once_with("notes")
|
count_collection.assert_called_once_with("notes")
|
||||||
self.assertEqual(result.exit_code, 0)
|
self.assertEqual(result.exit_code, 0)
|
||||||
self.assertEqual(result.stdout, "7\n")
|
self.assertEqual(
|
||||||
|
result.stdout,
|
||||||
|
"The 'notes' collection contains 7 records.\n",
|
||||||
|
)
|
||||||
|
|
||||||
def test_import_data(self) -> None:
|
def test_import_data(self) -> None:
|
||||||
with patch(
|
with patch(
|
||||||
@@ -107,7 +107,101 @@ class CliTests(unittest.TestCase):
|
|||||||
)
|
)
|
||||||
self.assertEqual(result.exit_code, 0)
|
self.assertEqual(result.exit_code, 0)
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
result.stdout, "Added 3 records to collection 'notes'.\n")
|
result.stdout,
|
||||||
|
"Added 3 records from 'romeo_and_juliet.txt' to collection 'notes'.\n"
|
||||||
|
"Imported 1 file(s) successfully; 0 failed.\n",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_import_data_accepts_multiple_files(self) -> None:
|
||||||
|
with patch(
|
||||||
|
"chromy.handlers.import_data.ingest_file",
|
||||||
|
side_effect=[3, 2],
|
||||||
|
) as ingest_file:
|
||||||
|
result = _invoke(
|
||||||
|
["import", "notes", "romeo_and_juliet.txt", "README.md"],
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(ingest_file.call_count, 2)
|
||||||
|
ingest_file.assert_any_call(
|
||||||
|
"notes",
|
||||||
|
self._fixture_path("romeo_and_juliet.txt"),
|
||||||
|
)
|
||||||
|
ingest_file.assert_any_call(
|
||||||
|
"notes",
|
||||||
|
self._fixture_path("README.md"),
|
||||||
|
)
|
||||||
|
self.assertEqual(result.exit_code, 0)
|
||||||
|
self.assertEqual(
|
||||||
|
result.stdout,
|
||||||
|
"Added 3 records from 'romeo_and_juliet.txt' to collection 'notes'.\n"
|
||||||
|
"Added 2 records from 'README.md' to collection 'notes'.\n"
|
||||||
|
"Imported 2 file(s) successfully; 0 failed.\n",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_import_data_continues_after_missing_file(self) -> None:
|
||||||
|
with patch(
|
||||||
|
"chromy.handlers.import_data.ingest_file",
|
||||||
|
return_value=3,
|
||||||
|
) as ingest_file:
|
||||||
|
result = _invoke(
|
||||||
|
["import", "notes", "missing.txt", "romeo_and_juliet.txt"],
|
||||||
|
)
|
||||||
|
|
||||||
|
ingest_file.assert_called_once_with(
|
||||||
|
"notes",
|
||||||
|
self._fixture_path("romeo_and_juliet.txt"),
|
||||||
|
)
|
||||||
|
self.assertEqual(result.exit_code, 1)
|
||||||
|
self.assertEqual(
|
||||||
|
result.stdout,
|
||||||
|
"Error: The file 'missing.txt' was not found.\n"
|
||||||
|
"Added 3 records from 'romeo_and_juliet.txt' to collection 'notes'.\n"
|
||||||
|
"Imported 1 file(s) successfully; 1 failed.\n",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_import_data_rejects_non_text_files(self) -> None:
|
||||||
|
with patch(
|
||||||
|
"chromy.handlers.import_data.is_probably_text_file",
|
||||||
|
return_value=False,
|
||||||
|
):
|
||||||
|
result = _invoke(["import", "notes", "romeo_and_juliet.txt"])
|
||||||
|
|
||||||
|
self.assertEqual(result.exit_code, 1)
|
||||||
|
self.assertEqual(
|
||||||
|
result.stdout,
|
||||||
|
"Error: The file 'romeo_and_juliet.txt' is not a text file.\n"
|
||||||
|
"Imported 0 file(s) successfully; 1 failed.\n",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_import_data_treats_literal_glob_as_missing_file(self) -> None:
|
||||||
|
result = _invoke(["import", "notes", "*.md"])
|
||||||
|
|
||||||
|
self.assertEqual(result.exit_code, 1)
|
||||||
|
self.assertEqual(
|
||||||
|
result.stdout,
|
||||||
|
"Error: The file '*.md' was not found.\n"
|
||||||
|
"Imported 0 file(s) successfully; 1 failed.\n",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_import_data_deduplicates_paths_within_single_invocation(self) -> None:
|
||||||
|
with patch(
|
||||||
|
"chromy.handlers.import_data.ingest_file",
|
||||||
|
return_value=3,
|
||||||
|
) as ingest_file:
|
||||||
|
result = _invoke(
|
||||||
|
["import", "notes", "README.md", "./README.md"],
|
||||||
|
)
|
||||||
|
|
||||||
|
ingest_file.assert_called_once_with(
|
||||||
|
"notes",
|
||||||
|
self._fixture_path("README.md"),
|
||||||
|
)
|
||||||
|
self.assertEqual(result.exit_code, 0)
|
||||||
|
self.assertEqual(
|
||||||
|
result.stdout,
|
||||||
|
"Added 3 records from 'README.md' to collection 'notes'.\n"
|
||||||
|
"Imported 1 file(s) successfully; 0 failed.\n",
|
||||||
|
)
|
||||||
|
|
||||||
def test_query(self) -> None:
|
def test_query(self) -> None:
|
||||||
query_result = {"ids": [["1"]], "documents": [["hello"]]}
|
query_result = {"ids": [["1"]], "documents": [["hello"]]}
|
||||||
@@ -139,8 +233,7 @@ class CliTests(unittest.TestCase):
|
|||||||
self.assertEqual(result.exit_code, 0)
|
self.assertEqual(result.exit_code, 0)
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
result.stdout,
|
result.stdout,
|
||||||
"Deleted 2 record(s) from collection 'notes' "
|
"Deleted 2 record(s) from collection 'notes' where file_name=play.txt.\n",
|
||||||
"where file_name=play.txt.\n",
|
|
||||||
)
|
)
|
||||||
|
|
||||||
def test_invalid_delete_filter_keeps_user_facing_error(self) -> None:
|
def test_invalid_delete_filter_keeps_user_facing_error(self) -> None:
|
||||||
|
|||||||
+20
-2
@@ -1,11 +1,29 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import unittest
|
import unittest
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
from chromy.embedding import embed
|
||||||
|
|
||||||
|
|
||||||
class EmbedTest(unittest.TestCase):
|
class EmbedTest(unittest.TestCase):
|
||||||
def test_embed_function(self) -> None:
|
def test_embed_returns_empty_list_for_empty_chunks(self) -> None:
|
||||||
self.assertEqual(0, 0)
|
self.assertEqual(embed([]), [])
|
||||||
|
|
||||||
|
def test_embed_pairs_text_with_list_embeddings(self) -> None:
|
||||||
|
with patch(
|
||||||
|
"chromy.embedding.service.DefaultEmbeddingFunction",
|
||||||
|
return_value=lambda chunks: ((1.0, 2.0), (3.0, 4.0)),
|
||||||
|
):
|
||||||
|
result = embed(["first", "second"])
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
result,
|
||||||
|
[
|
||||||
|
{"text": "first", "embedding": [1.0, 2.0]},
|
||||||
|
{"text": "second", "embedding": [3.0, 4.0]},
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
+132
-7
@@ -6,15 +6,15 @@ from collections.abc import Callable
|
|||||||
from contextlib import redirect_stdout
|
from contextlib import redirect_stdout
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import TypeVar
|
from typing import TypeVar
|
||||||
from unittest.mock import patch
|
from unittest.mock import MagicMock, patch
|
||||||
|
|
||||||
from chromy.handlers.import_data import handle_import
|
|
||||||
from chromy.handlers.count_collection import handle_count_collection
|
from chromy.handlers.count_collection import handle_count_collection
|
||||||
from chromy.handlers.create_collection import handle_create_collection
|
from chromy.handlers.create_collection import handle_create_collection
|
||||||
from chromy.handlers.delete_collection import (
|
from chromy.handlers.delete_collection import (
|
||||||
handle_delete_collection,
|
handle_delete_collection,
|
||||||
handle_delete_records,
|
handle_delete_records,
|
||||||
)
|
)
|
||||||
|
from chromy.handlers.import_data import handle_import
|
||||||
from chromy.handlers.list_collections import handle_list_collections
|
from chromy.handlers.list_collections import handle_list_collections
|
||||||
from chromy.handlers.query import handle_query
|
from chromy.handlers.query import handle_query
|
||||||
|
|
||||||
@@ -47,7 +47,7 @@ class HandlerTests(unittest.TestCase):
|
|||||||
)
|
)
|
||||||
|
|
||||||
self.assertEqual(exit_code, 0)
|
self.assertEqual(exit_code, 0)
|
||||||
self.assertEqual(output, "notes\nplays\n")
|
self.assertEqual(output, "· notes\n· plays\n")
|
||||||
|
|
||||||
def test_create_collection_uses_typed_input(self) -> None:
|
def test_create_collection_uses_typed_input(self) -> None:
|
||||||
with patch(
|
with patch(
|
||||||
@@ -86,7 +86,7 @@ class HandlerTests(unittest.TestCase):
|
|||||||
|
|
||||||
count.assert_called_once_with("notes")
|
count.assert_called_once_with("notes")
|
||||||
self.assertEqual(exit_code, 0)
|
self.assertEqual(exit_code, 0)
|
||||||
self.assertEqual(output, "7\n")
|
self.assertEqual(output, "The 'notes' collection contains 7 records.\n")
|
||||||
|
|
||||||
def test_import_data_uses_typed_input(self) -> None:
|
def test_import_data_uses_typed_input(self) -> None:
|
||||||
with patch(
|
with patch(
|
||||||
@@ -96,7 +96,7 @@ class HandlerTests(unittest.TestCase):
|
|||||||
exit_code, output = _capture_output(
|
exit_code, output = _capture_output(
|
||||||
handle_import,
|
handle_import,
|
||||||
"notes",
|
"notes",
|
||||||
"romeo_and_juliet.txt",
|
["romeo_and_juliet.txt"],
|
||||||
)
|
)
|
||||||
|
|
||||||
ingest_file.assert_called_once_with(
|
ingest_file.assert_called_once_with(
|
||||||
@@ -104,7 +104,132 @@ class HandlerTests(unittest.TestCase):
|
|||||||
self._fixture_path("romeo_and_juliet.txt"),
|
self._fixture_path("romeo_and_juliet.txt"),
|
||||||
)
|
)
|
||||||
self.assertEqual(exit_code, 0)
|
self.assertEqual(exit_code, 0)
|
||||||
self.assertEqual(output, "Added 3 records to collection 'notes'.\n")
|
self.assertEqual(
|
||||||
|
output,
|
||||||
|
"Added 3 records from 'romeo_and_juliet.txt' to collection 'notes'.\n"
|
||||||
|
"Imported 1 file(s) successfully; 0 failed.\n",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_import_data_continues_after_missing_file(self) -> None:
|
||||||
|
with patch(
|
||||||
|
"chromy.handlers.import_data.ingest_file",
|
||||||
|
return_value=3,
|
||||||
|
) as ingest_file:
|
||||||
|
exit_code, output = _capture_output(
|
||||||
|
handle_import,
|
||||||
|
"notes",
|
||||||
|
["missing.txt", "romeo_and_juliet.txt"],
|
||||||
|
)
|
||||||
|
|
||||||
|
ingest_file.assert_called_once_with(
|
||||||
|
"notes",
|
||||||
|
self._fixture_path("romeo_and_juliet.txt"),
|
||||||
|
)
|
||||||
|
self.assertEqual(exit_code, 1)
|
||||||
|
self.assertEqual(
|
||||||
|
output,
|
||||||
|
"Error: The file 'missing.txt' was not found.\n"
|
||||||
|
"Added 3 records from 'romeo_and_juliet.txt' to collection 'notes'.\n"
|
||||||
|
"Imported 1 file(s) successfully; 1 failed.\n",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_import_data_rejects_non_text_files(self) -> None:
|
||||||
|
with patch(
|
||||||
|
"chromy.handlers.import_data.is_probably_text_file",
|
||||||
|
return_value=False,
|
||||||
|
):
|
||||||
|
exit_code, output = _capture_output(
|
||||||
|
handle_import,
|
||||||
|
"notes",
|
||||||
|
["romeo_and_juliet.txt"],
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(exit_code, 1)
|
||||||
|
self.assertEqual(
|
||||||
|
output,
|
||||||
|
"Error: The file 'romeo_and_juliet.txt' is not a text file.\n"
|
||||||
|
"Imported 0 file(s) successfully; 1 failed.\n",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_import_data_deduplicates_files(self) -> None:
|
||||||
|
with patch(
|
||||||
|
"chromy.handlers.import_data.ingest_file",
|
||||||
|
return_value=3,
|
||||||
|
) as ingest_file:
|
||||||
|
exit_code, output = _capture_output(
|
||||||
|
handle_import,
|
||||||
|
"notes",
|
||||||
|
["README.md", "./README.md"],
|
||||||
|
)
|
||||||
|
|
||||||
|
ingest_file.assert_called_once_with(
|
||||||
|
"notes",
|
||||||
|
self._fixture_path("README.md"),
|
||||||
|
)
|
||||||
|
self.assertEqual(exit_code, 0)
|
||||||
|
self.assertEqual(
|
||||||
|
output,
|
||||||
|
"Added 3 records from 'README.md' to collection 'notes'.\n"
|
||||||
|
"Imported 1 file(s) successfully; 0 failed.\n",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_import_data_suppresses_per_file_output_with_progress(self) -> None:
|
||||||
|
progress = MagicMock()
|
||||||
|
progress.__enter__.return_value = progress
|
||||||
|
progress.__exit__.return_value = None
|
||||||
|
progress.console.print = print
|
||||||
|
progress.add_task.return_value = 1
|
||||||
|
|
||||||
|
with (
|
||||||
|
patch("chromy.handlers.import_data.ingest_file", side_effect=[3, 2]),
|
||||||
|
patch(
|
||||||
|
"chromy.handlers.import_data._should_show_progress",
|
||||||
|
return_value=True,
|
||||||
|
),
|
||||||
|
patch("chromy.handlers.import_data.Progress", return_value=progress),
|
||||||
|
):
|
||||||
|
exit_code, output = _capture_output(
|
||||||
|
handle_import,
|
||||||
|
"notes",
|
||||||
|
["romeo_and_juliet.txt", "README.md"],
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(exit_code, 0)
|
||||||
|
self.assertEqual(output, "Imported 2 file(s) successfully; 0 failed.\n")
|
||||||
|
|
||||||
|
def test_import_data_truncates_long_file_names_in_progress(self) -> None:
|
||||||
|
progress = MagicMock()
|
||||||
|
progress.__enter__.return_value = progress
|
||||||
|
progress.__exit__.return_value = None
|
||||||
|
progress.console.print = print
|
||||||
|
progress.add_task.return_value = 1
|
||||||
|
|
||||||
|
with (
|
||||||
|
patch(
|
||||||
|
"chromy.handlers.import_data._get_absolute_path",
|
||||||
|
side_effect=[
|
||||||
|
"/tmp/this_is_a_very_long_file_name.txt",
|
||||||
|
self._fixture_path("README.md"),
|
||||||
|
"/tmp/this_is_a_very_long_file_name.txt",
|
||||||
|
self._fixture_path("README.md"),
|
||||||
|
],
|
||||||
|
),
|
||||||
|
patch("chromy.handlers.import_data._import_one", return_value=3),
|
||||||
|
patch(
|
||||||
|
"chromy.handlers.import_data._should_show_progress",
|
||||||
|
return_value=True,
|
||||||
|
),
|
||||||
|
patch("chromy.handlers.import_data.Progress", return_value=progress),
|
||||||
|
):
|
||||||
|
handle_import(
|
||||||
|
"notes",
|
||||||
|
["this_is_a_very_long_file_name.txt", "README.md"],
|
||||||
|
)
|
||||||
|
|
||||||
|
progress.update.assert_any_call(
|
||||||
|
1,
|
||||||
|
description="Importing [bold]this_is_a_very_lo...[/]...",
|
||||||
|
)
|
||||||
|
|
||||||
def test_query_uses_typed_input(self) -> None:
|
def test_query_uses_typed_input(self) -> None:
|
||||||
query_result = {"ids": [["1"]], "documents": [["hello"]]}
|
query_result = {"ids": [["1"]], "documents": [["hello"]]}
|
||||||
@@ -154,7 +279,7 @@ class HandlerTests(unittest.TestCase):
|
|||||||
|
|
||||||
def _capture_output(
|
def _capture_output(
|
||||||
handler: Callable[..., int],
|
handler: Callable[..., int],
|
||||||
*arguments: CommandT,
|
*arguments: object,
|
||||||
) -> tuple[int, str]:
|
) -> tuple[int, str]:
|
||||||
output = io.StringIO()
|
output = io.StringIO()
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,67 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import unittest
|
||||||
|
from unittest.mock import MagicMock, call, patch
|
||||||
|
|
||||||
|
from chromy.utilities import ingest_file
|
||||||
|
|
||||||
|
|
||||||
|
class UtilityTests(unittest.TestCase):
|
||||||
|
def test_ingest_file_adds_new_file_without_deleting(self) -> None:
|
||||||
|
chunks = ["chunk 1", "chunk 2"]
|
||||||
|
embeddings = [
|
||||||
|
{"text": "chunk 1", "embedding": [0.1, 0.2]},
|
||||||
|
{"text": "chunk 2", "embedding": [0.3, 0.4]},
|
||||||
|
]
|
||||||
|
|
||||||
|
with (
|
||||||
|
patch("chromy.utilities.has_data_for_file", return_value=False) as has_data,
|
||||||
|
patch("chromy.utilities.delete_data") as delete_data,
|
||||||
|
patch("chromy.utilities.chunk_file", return_value=chunks) as chunk_file,
|
||||||
|
patch("chromy.utilities.embed", return_value=embeddings) as embed,
|
||||||
|
patch("chromy.utilities.add_data") as add_data,
|
||||||
|
):
|
||||||
|
records_added = ingest_file("notes", "/tmp/play.txt")
|
||||||
|
|
||||||
|
has_data.assert_called_once_with("notes", "/tmp/play.txt")
|
||||||
|
delete_data.assert_not_called()
|
||||||
|
chunk_file.assert_called_once_with("/tmp/play.txt")
|
||||||
|
embed.assert_called_once_with(chunks)
|
||||||
|
add_data.assert_called_once_with("notes", embeddings, "/tmp/play.txt")
|
||||||
|
self.assertEqual(records_added, 2)
|
||||||
|
|
||||||
|
def test_ingest_file_replaces_existing_file_records_before_adding(self) -> None:
|
||||||
|
chunks = ["chunk 1"]
|
||||||
|
embeddings = [{"text": "chunk 1", "embedding": [0.1, 0.2]}]
|
||||||
|
manager = MagicMock()
|
||||||
|
|
||||||
|
with (
|
||||||
|
patch("chromy.utilities.has_data_for_file", return_value=True) as has_data,
|
||||||
|
patch("chromy.utilities.delete_data") as delete_data,
|
||||||
|
patch("chromy.utilities.chunk_file", return_value=chunks) as chunk_file,
|
||||||
|
patch("chromy.utilities.embed", return_value=embeddings) as embed,
|
||||||
|
patch("chromy.utilities.add_data") as add_data,
|
||||||
|
):
|
||||||
|
manager.attach_mock(has_data, "has_data")
|
||||||
|
manager.attach_mock(delete_data, "delete_data")
|
||||||
|
manager.attach_mock(chunk_file, "chunk_file")
|
||||||
|
manager.attach_mock(embed, "embed")
|
||||||
|
manager.attach_mock(add_data, "add_data")
|
||||||
|
|
||||||
|
records_added = ingest_file("notes", "/tmp/play.txt")
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
manager.mock_calls,
|
||||||
|
[
|
||||||
|
call.has_data("notes", "/tmp/play.txt"),
|
||||||
|
call.delete_data("notes", {"file_name": "/tmp/play.txt"}),
|
||||||
|
call.chunk_file("/tmp/play.txt"),
|
||||||
|
call.embed(chunks),
|
||||||
|
call.add_data("notes", embeddings, "/tmp/play.txt"),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
self.assertEqual(records_added, 1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
Reference in New Issue
Block a user