Compare commits

..

23 Commits

Author SHA1 Message Date
mrosati 28ec29f8af add progress bar when importing multiple files
build / build (push) Successful in 11s
pytest / pytest (push) Failing after 28s
2026-05-01 15:45:41 +02:00
mrosati fb62d1b539 refactor chunking and embedding into their own modules
build / build (push) Successful in 45s
pytest / pytest (push) Successful in 26s
2026-05-01 11:01:30 +02:00
Matteo Rosati 26df98c08e add multi-file import support
build / build (push) Successful in 9s
pytest / pytest (push) Successful in 26s
2026-04-29 15:39:42 +02:00
Matteo Rosati 74e48fbcd5 replace existing file records on re-import
build / build (push) Successful in 9s
pytest / pytest (push) Successful in 25s
2026-04-29 14:46:41 +02:00
Matteo Rosati d1b1238897 decouple core data from CLI formatting
build / build (push) Successful in 49s
pytest / pytest (push) Successful in 30s
2026-04-29 12:44:28 +02:00
mrosati 615ab14a1a add skill
build / build (push) Successful in 9s
pytest / pytest (push) Successful in 24s
2026-04-24 22:49:36 +02:00
mrosati 508d036815 add logo, update README
build / build (push) Successful in 35s
pytest / pytest (push) Successful in 27s
2026-04-24 22:46:09 +02:00
mrosati 292d0eb139 update agents file
build / build (push) Successful in 10s
pytest / pytest (push) Successful in 24s
2026-04-24 18:48:37 +02:00
mrosati d71fce7a6a cannot import non-text files!
build / build (push) Successful in 39s
pytest / pytest (push) Successful in 35s
2026-04-24 18:40:51 +02:00
mrosati c6ad060e85 fix types and print middle dot in collections list 2026-04-24 18:28:03 +02:00
mrosati c5b6b196b5 fix syntax and types 2026-04-24 18:23:02 +02:00
mrosati 948f8500be types cleanup 2026-04-24 18:20:22 +02:00
Matteo Rosati 55bbd897f4 update agents file
build / build (push) Successful in 11s
pytest / pytest (push) Successful in 29s
2026-04-23 22:03:16 +02:00
Matteo Rosati ebf664a25a add empty test for embedding
build / build (push) Successful in 9s
pytest / pytest (push) Successful in 23s
2026-04-23 22:00:45 +02:00
Matteo Rosati a14edebafe add colors!
build / build (push) Successful in 12s
pytest / pytest (push) Successful in 29s
2026-04-23 21:49:46 +02:00
Matteo Rosati 3fcc3904b4 remove pointless test
build / build (push) Successful in 9s
pytest / pytest (push) Successful in 25s
2026-04-23 21:18:35 +02:00
Matteo Rosati 6861636794 add test test_delete_non_existent_collection
build / build (push) Successful in 10s
pytest / pytest (push) Successful in 24s
2026-04-23 21:13:35 +02:00
Matteo Rosati 7ee58939a6 add test test_create_collection_with_same_name 2026-04-23 21:06:41 +02:00
Matteo Rosati 13d2f525a9 fix syntax 2026-04-23 21:06:31 +02:00
mrosati 65b3edda1c add pytest workflow
build / build (push) Successful in 9s
pytest / pytest (push) Successful in 29s
2026-04-23 20:52:15 +02:00
mrosati 4a7804af19 add test test_list_existing_collections
build / build (push) Successful in 10s
2026-04-23 20:49:52 +02:00
mrosati a672633526 extract _get_absolute_path
build / build (push) Successful in 11s
2026-04-23 20:46:26 +02:00
mrosati e5b63ac6fb use absolute paths 2026-04-23 19:56:11 +02:00
26 changed files with 778 additions and 127 deletions
+28
View File
@@ -0,0 +1,28 @@
name: pytest
on:
push:
branches:
- main
pull_request:
jobs:
pytest:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install uv
run: |
curl -Ls https://astral.sh/uv/install.sh | sh
echo "$HOME/.local/bin" >> $GITHUB_PATH
- name: Sync dependencies
run: |
uv sync --group dev
- name: Run pytest
run: |
uv run pytest -q
+9 -11
View File
@@ -2,32 +2,30 @@
## Project Structure & Module Organization
`chromy/` contains the Python package and CLI implementation. The entrypoint is `chromy/main.py`, parser wiring is in `chromy/cli_parser.py`, command dispatch is in `chromy/cli_app.py`, and typed command inputs live in `chromy/command_inputs.py`. Command-specific behavior belongs in `chromy/handlers/`. Shared Chroma, embedding, chunking, and formatting helpers live in package modules such as `chroma_functions.py`, `embed.py`, `chunk_functions.py`, and `utilities.py`.
`chromy/` contains the Python package and CLI implementation. The entrypoint is `chromy/main.py`, which loads environment variables and invokes the Typer app defined in `chromy/cli.py`. The active CLI commands are `list-collections`, `create-collection`, `delete-collection`, `count`, `import`, `query`, and `delete`. Command-specific behavior belongs in `chromy/handlers/`. Shared Chroma, embedding, chunking, querying, and output helpers live in package modules such as `chroma_functions.py`, `embed.py`, `chunk_functions.py`, and `utilities.py`.
`tests/` contains the pytest suite. `plans/` holds planning notes. Generated data and build outputs such as `chroma/`, `dist/`, `chromy.egg-info/`, caches, and `.venv/` are not source.
`tests/` contains the test suite for the CLI, handlers, and embedding helpers. `README.md` documents user-facing behavior, `pyproject.toml` defines packaging and tool configuration, and `romeo_and_juliet.txt` is a checked-in sample input used by tests and manual CLI runs. Treat generated or local-state directories such as `chroma/`, `dist/`, `chromy.egg-info/`, `.pytest_cache/`, `.mypy_cache/`, `.ruff_cache/`, `.venv/`, `__pycache__/`, and `main.onefile-build/` as non-source. The top-level `handlers/` directory currently contains only legacy bytecode artifacts and should not be treated as source.
## Build, Test, and Development Commands
- `uv sync`: install runtime and development dependencies from `pyproject.toml` and `uv.lock`.
- `uv run python -m chromy.main --help`: run the CLI from the source tree.
- `uv run chromy --help`: run the packaged console script inside the project environment.
- `uv run pytest -q`: run the test suite.
- `uv run ruff check .`: run lint checks.
- `uv run ruff format --check .`: verify formatting.
- `uv run mypy .`: run static type checks.
- `uv build`: build the source distribution and wheel into `dist/`.
- `uv tool install --editable .`: install the `chromy` command in editable mode for local CLI testing.
## Coding Style & Naming Conventions
Use Python 3.12+ syntax, type hints, and `from __future__ import annotations`. Follow the current style: 4-space indentation, snake_case functions and modules, PascalCase dataclasses/classes, and explicit command input objects instead of raw `argparse.Namespace` in handlers. Keep handlers focused on CLI orchestration; place reusable database or formatting logic in shared modules.
Use Python 3.12+ syntax, type hints, and `from __future__ import annotations`. Follow the current style: 4-space indentation, snake_case functions and modules, PascalCase classes, and Typer command functions in `chromy/cli.py` that delegate to small handler functions. Keep handlers focused on CLI orchestration and user-facing output; place reusable database, chunking, embedding, query, and formatting logic in shared modules. Prefer `rich` output for user-facing CLI messages to stay consistent with the existing commands.
## Testing Guidelines
Tests use pytest and currently include `unittest.TestCase`-style cases. Name test files `test_*.py` and test methods `test_*`. Prefer mocking Chroma-facing functions in handler tests so unit tests stay deterministic. Run `uv run pytest -q` before submitting changes, and add tests for new commands, parser behavior, handlers, and error paths.
## Commit & Pull Request Guidelines
Git history uses short, imperative, lowercase commit subjects, for example `move top-level modules into a real package` and `replace argparse.Namespace plumbing with typed command inputs`. Keep commits scoped to one logical change.
Pull requests should include a concise description, test results, and notes for any CLI behavior changes. Link related issues or plan files when applicable. Include terminal output examples or screenshots only when user-facing command output changes.
Tests run with pytest and are currently written in `unittest.TestCase` style. Name test files `test_*.py` and test methods `test_*`. Prefer mocking Chroma-facing and filesystem-facing functions in CLI and handler tests so unit tests stay deterministic. Run `uv run pytest -q` before submitting changes, and use `uv run ruff check .` plus `uv run mypy .` when touching typed code or shared modules. Add tests for new commands, Typer wiring, handlers, and error paths.
## Security & Configuration Tips
The CLI loads environment variables via `python-dotenv`; keep secrets in local `.env` files and do not commit them. Treat `chroma/` as local persistent database state. Avoid committing generated build artifacts, cache directories, or large ad hoc input files unless they are intentional fixtures.
The CLI loads environment variables via `python-dotenv`; keep secrets in local `.env` files and do not commit them. Treat `chroma/` as local persistent database state created by `chromadb.PersistentClient()`. Avoid committing generated build artifacts, cache directories, onefile build outputs, or large ad hoc input files unless they are intentional fixtures. If you change command names or examples, update both `README.md` and the tests so the documented CLI stays aligned with the implementation.
+19 -8
View File
@@ -1,6 +1,10 @@
# Chromy
A small command-line utility for working with a local Chroma database. It lets you create collections, ingest file contents as chunked embeddings, and run similarity queries against stored documents.
<div align="center">
<img src="logo.png" width=300 />
</div>
Chromy is small and simple to use command-line utility for working with a local Chroma database. It lets you create collections, ingest files as chunked embeddings, and run similarity queries against stored documents. It integrates perfectly with agentic coding tools via simple skills (see an [example](./skills/chromy/SKILL.md) in the `skills` directory).
## What it does
@@ -120,7 +124,7 @@ list-collections
create-collection <collection>
delete-collection <collection>
count <collection>
add-data <collection> <file>
import <collection> <file> [<file> ...]
query <collection> <query_text>
delete <collection> --where <condition>=<value>
```
@@ -133,10 +137,12 @@ Create a collection:
chromy create-collection notes
```
Add a file:
Add one or more files:
```bash
chromy add-data notes ./docs/example.txt
chromy import notes ./docs/example.txt
chromy import notes ./docs/intro.md ./docs/setup.md
chromy import notes *.md
```
Count stored records:
@@ -171,7 +177,7 @@ chromy delete notes --where file_name=example.txt
## How ingestion works
When you run `add-data`, the file is:
When you run `import`, each file is:
1. read from disk
2. split into chunks
@@ -182,6 +188,11 @@ Query results include the stored document chunk, its id, distance, and file name
## Notes
- collections are stored in a local persistent Chroma database
- `add-data` requires the target collection to already exist
- the CLI prints friendly messages for common errors such as missing collections or missing files
- collections are stored in a local persistent Chroma database in the current directory
- `import` requires the target collection to already exist
- `import` accepts one or more file paths
- unquoted glob patterns such as `*.md` are expanded by the shell before `chromy` starts
- quoted glob patterns such as `"*.md"` are treated as literal paths and are not expanded by `chromy`
- unmatched unquoted globs may behave differently by shell: `zsh` commonly fails before `chromy` starts, while `bash` may pass the literal pattern through depending on shell settings
- the CLI reports file-specific import failures and continues with the remaining files
- when importing multiple files in an interactive terminal, the CLI shows a Rich progress bar
+10 -4
View File
@@ -9,7 +9,7 @@ from chromadb.api import ClientAPI
from chromadb.api.types import QueryResult, Where
from chromadb.errors import NotFoundError
from chromy.embed import EmbeddingRecord
from chromy.embedding import EmbeddingRecord
def _get_client_and_collection(
@@ -54,9 +54,16 @@ def delete_data(collection_name: str, where: dict[str, str]) -> int:
return int(result.get("deleted", 0))
def has_data_for_file(collection_name: str, file_name: str) -> bool:
_, collection = _get_client_and_collection(collection_name)
result = collection.get(where=cast(Where, {"file_name": file_name}))
ids = result.get("ids", [])
return len(ids) > 0
def count_collection(collection_name: str) -> int:
_, collection = _get_client_and_collection(collection_name)
return collection.count()
@@ -70,8 +77,7 @@ def add_data(
_, collection = _get_client_and_collection(collection_name)
embeddings: list[Sequence[float]] = [record["embedding"]
for record in data]
embeddings: list[Sequence[float]] = [record["embedding"] for record in data]
collection.add(
ids=[str(uuid4()) for _ in data],
+5
View File
@@ -0,0 +1,5 @@
from __future__ import annotations
from chromy.chunking.service import chunk_file, chunk_text
__all__ = ["chunk_file", "chunk_text"]
@@ -3,7 +3,7 @@ from __future__ import annotations
from pathlib import Path
from typing import cast
import semchunk
from semchunk import semchunk
def chunk_text(text: str, chunk_size: int = 800) -> list[str]:
+13 -9
View File
@@ -4,14 +4,15 @@ from typing import Annotated, Callable
import typer
from chromadb.errors import InternalError, NotFoundError
from rich import print
from chromy.handlers.import_data import handle_import
from chromy.handlers.count_collection import handle_count_collection
from chromy.handlers.create_collection import handle_create_collection
from chromy.handlers.delete_collection import (
handle_delete_collection,
handle_delete_records,
)
from chromy.handlers.import_data import handle_import
from chromy.handlers.list_collections import handle_list_collections
from chromy.handlers.query import handle_query
@@ -27,7 +28,7 @@ def _run(handler: ExitCodeHandler) -> None:
def _fail(message: str) -> None:
typer.echo(message)
print("[bold red]Error[/]:", message)
raise typer.Exit(1)
@@ -104,24 +105,27 @@ def count(
# ------------------------------------------------------------------------------
@app.command(
"import",
help="Chunk, embed, and add a file to a collection in the local Chroma database.",
help=(
"Chunk, embed, and add one or more files to a collection in the "
"local Chroma database."
),
)
def import_data(
collection: Annotated[
str,
typer.Argument(help="Name of the target collection."),
],
file: Annotated[
str,
typer.Argument(help="Path to the file to chunk and add to the collection."),
files: Annotated[
list[str],
typer.Argument(
help="Path(s) to the file(s) to chunk and add to the collection."
),
],
) -> None:
try:
_run(lambda: handle_import(collection, file))
_run(lambda: handle_import(collection, files))
except NotFoundError:
_fail(f"Collection '{collection}' does not exist.")
except FileNotFoundError:
_fail(f"The file {file} was not found.")
# ------------------------------------------------------------------------------
+5
View File
@@ -0,0 +1,5 @@
from __future__ import annotations
from chromy.embedding.service import EmbeddingRecord, embed
__all__ = ["EmbeddingRecord", "embed"]
+5
View File
@@ -0,0 +1,5 @@
from __future__ import annotations
class UnsupportedTextFileError(Exception):
"""Raised when a file does not appear to contain supported text content."""
+4 -1
View File
@@ -1,8 +1,11 @@
from __future__ import annotations
from rich import print
from chromy.chroma_functions import count_collection
from chromy.output import format_count_message
def handle_count_collection(collection: str) -> int:
print(count_collection(collection))
print(format_count_message(collection, count_collection(collection)))
return 0
+3 -1
View File
@@ -1,9 +1,11 @@
from __future__ import annotations
from rich import print
from chromy.chroma_functions import create_collection
def handle_create_collection(collection: str) -> int:
collection_name = create_collection(collection)
print(f"Created collection '{collection_name}'.")
print(f"[bold green]Created[/]: collection '{collection_name}'.")
return 0
+6 -6
View File
@@ -1,5 +1,7 @@
from __future__ import annotations
from rich import print
from chromy.chroma_functions import delete_collection, delete_data
@@ -7,22 +9,20 @@ def _parse_where_clause(where_clause: str) -> dict[str, str]:
condition, separator, value = where_clause.partition("=")
if separator == "":
raise ValueError(
"Invalid --where value. Expected <condition>=<value>.")
raise ValueError("Invalid --where value. Expected <condition>=<value>.")
condition = condition.strip()
value = value.strip()
if not condition or not value:
raise ValueError(
"Invalid --where value. Expected <condition>=<value>.")
raise ValueError("Invalid --where value. Expected <condition>=<value>.")
return {condition: value}
def handle_delete_collection(collection: str) -> int:
delete_collection(collection)
print(f"Deleted collection '{collection}'.")
print(f"[bold green]Deleted[/] collection '{collection}'.")
return 0
@@ -31,7 +31,7 @@ def handle_delete_records(collection: str, where_clause: str) -> int:
deleted = delete_data(collection, where)
condition, value = next(iter(where.items()))
print(
f"Deleted {deleted} record(s) from collection '{collection}' "
f"[bold green]Deleted[/] {deleted} record(s) from collection '{collection}' "
f"where {condition}={value}."
)
return 0
+126 -4
View File
@@ -1,9 +1,131 @@
from __future__ import annotations
import os
import sys
from pathlib import Path
from typing import Final
from rich import print
from rich.progress import (
BarColumn,
MofNCompleteColumn,
Progress,
SpinnerColumn,
TextColumn,
)
from chromy.errors import UnsupportedTextFileError
from chromy.utilities import ingest_file
from ..utilities import is_probably_text_file
def handle_import(collection: str, file: str) -> int:
records_added = ingest_file(collection, file)
print(f"Added {records_added} records to collection '{collection}'.")
return 0
SUCCESS_EXIT_CODE: Final = 0
FAILURE_EXIT_CODE: Final = 1
def _get_absolute_path(file: str) -> str:
"""
A helper method that, given a valid relative path to a file, returns its
absolute path.
Args:
file (str): The relative path to the file.
Raises:
FileNotFoundError(): If the file does not exist.
"""
if not os.path.exists(file):
raise FileNotFoundError()
file_path = Path(file)
return str(file_path.resolve())
def _import_one(collection: str, file: str) -> int:
absolute_path = _get_absolute_path(file)
if not Path(absolute_path).is_file():
raise FileNotFoundError()
if not is_probably_text_file(absolute_path):
raise UnsupportedTextFileError()
return ingest_file(collection, absolute_path)
def _should_show_progress(file_count: int) -> bool:
return file_count > 1 and sys.stdout.isatty()
def _truncate_file_name(file_name: str, max_length: int = 20) -> str:
if len(file_name) <= max_length:
return file_name
return f"{file_name[: max_length - 3]}"
def handle_import(collection: str, files: list[str]) -> int:
successful_imports = 0
failed_imports = 0
seen_paths: set[str] = set()
unique_files: list[str] = []
for file in files:
try:
absolute_path = _get_absolute_path(file)
except FileNotFoundError:
unique_files.append(file)
continue
if absolute_path in seen_paths:
continue
seen_paths.add(absolute_path)
unique_files.append(file)
show_progress = _should_show_progress(len(unique_files))
with Progress(
SpinnerColumn(),
TextColumn("[progress.description]{task.description}"),
BarColumn(),
MofNCompleteColumn(),
transient=True,
disable=not show_progress,
) as progress:
task_id = progress.add_task("Importing files...", total=len(unique_files))
for file in unique_files:
file_name = _truncate_file_name(Path(file).name)
description = f"Importing [bold]{file_name}[/]..."
progress.update(task_id, description=description)
try:
records_added = _import_one(collection, file)
successful_imports += 1
if not show_progress:
progress.console.print(
"[bold green]Added[/] "
f"{records_added} records from '{file}' to "
f"collection '{collection}'."
)
except FileNotFoundError:
failed_imports += 1
progress.console.print(
f"[bold red]Error[/]: The file '{file}' was not found."
)
except UnsupportedTextFileError:
failed_imports += 1
progress.console.print(
f"[bold red]Error[/]: The file '{file}' is not a text file."
)
finally:
progress.advance(task_id)
print(
f"Imported {successful_imports} file(s) successfully; {failed_imports} failed."
)
if failed_imports:
return FAILURE_EXIT_CODE
return SUCCESS_EXIT_CODE
+4 -2
View File
@@ -1,14 +1,16 @@
from __future__ import annotations
from chromy.chroma_functions import list_collections
from chromy.utilities import print_lines
from chromy.output import format_collection_names, print_lines
def handle_list_collections() -> int:
collections = list_collections()
if not collections:
print("No collections found.")
return 0
print_lines(collections)
print_lines(format_collection_names(collections))
return 0
+2 -1
View File
@@ -1,6 +1,7 @@
from __future__ import annotations
from chromy.utilities import format_query_result, print_lines, run_query
from chromy.output import format_query_result, print_lines
from chromy.utilities import run_query
def handle_query(collection: str, query_text: str) -> int:
+70
View File
@@ -0,0 +1,70 @@
from __future__ import annotations
from collections.abc import Mapping, Sequence
from chromadb import QueryResult
from rich.console import Console
from rich.rule import Rule
from rich.text import Text
CONSOLE = Console()
def print_lines(lines: Sequence[Rule | Text | str]) -> None:
for line in lines:
CONSOLE.print(line)
def format_collection_names(collections: Sequence[str]) -> list[Text]:
return [Text(f"· {collection}") for collection in collections]
def format_count_message(collection_name: str, count: int) -> str:
return (
f"The '{collection_name}' collection contains [bold green]{count}[/] records."
)
def format_query_result(result: QueryResult) -> list[Rule | Text]:
ids = result.get("ids", [[]])
documents = result.get("documents", [[]])
distances = result.get("distances", [[]])
metadatas = result.get("metadatas", [[]])
first_ids = ids[0] if ids else []
first_documents = documents[0] if documents else []
first_distances = distances[0] if distances else []
first_metadatas = metadatas[0] if metadatas else []
if not first_ids:
return [Text.from_markup("[yellow]No results found.[/]")]
lines: list[Rule | Text] = [Rule(title="Query results")]
for index, document_id in enumerate(first_ids, start=1):
lines.append(
Text.from_markup(f"[bold]{index}[/].\t[green]id[/]\t\t{document_id}")
)
i = index - 1
if i < len(first_distances):
lines.append(
Text.from_markup(f"\t[green]distance[/]\t{first_distances[i]}")
)
if i < len(first_metadatas):
metadata = first_metadatas[i]
if isinstance(metadata, Mapping):
file_name = metadata.get("file_name")
if file_name:
lines.append(Text.from_markup(f"\t[green]file_name[/]\t{file_name}"))
if i < len(first_documents):
lines.append(Text.from_markup("\n[bold green]Retrieved contents[/]\n"))
lines.append(Text(first_documents[i]))
lines.append(Rule())
return lines
+35 -54
View File
@@ -1,26 +1,18 @@
from __future__ import annotations
from rich.text import Text
from rich.rule import Rule
from rich.console import Console
from collections.abc import Mapping, Sequence
from pathlib import Path
from chromadb import QueryResult
from chromy.chroma_functions import add_data, query_data
from chromy.chunk_functions import chunk_file
from chromy.embed import embed
CONSOLE = Console()
def print_lines(lines: Sequence[str]) -> None:
for line in lines:
CONSOLE.print(line)
from chromy.chroma_functions import add_data, delete_data, has_data_for_file, query_data
from chromy.chunking import chunk_file
from chromy.embedding import embed
def ingest_file(collection_name: str, file_path: str) -> int:
if has_data_for_file(collection_name, file_path):
delete_data(collection_name, {"file_name": file_path})
chunks = chunk_file(file_path)
embeddings = embed(chunks)
add_data(collection_name, embeddings, file_path)
@@ -31,50 +23,39 @@ def run_query(collection_name: str, query_text: str) -> QueryResult:
return query_data(collection_name, [query_text])
def format_query_result(result: QueryResult) -> list[str]:
ids = result.get("ids", [[]])
documents = result.get("documents", [[]])
distances = result.get("distances", [[]])
metadatas = result.get("metadatas", [[]])
def is_probably_text_file(path: str | Path, sample_size: int = 8192) -> bool:
"""
Return whether a file appears to contain text.
first_ids = ids[0] if ids else []
first_documents = documents[0] if documents else []
first_distances = distances[0] if distances else []
first_metadatas = metadatas[0] if metadatas else []
Args:
path (str | Path): The path to the file to inspect.
sample_size (int): The maximum number of bytes to read from the file.
if not first_ids:
return ["No results found."]
Returns:
bool: ``True`` if the sampled bytes decode as UTF-8, UTF-8 with BOM,
UTF-16, or UTF-32, or if the file is empty. Otherwise, ``False``.
"""
lines = [Rule(title="Query results")]
path = Path(path)
for index, document_id in enumerate(first_ids, start=1):
# lines.append(f"{index}.\tid: {document_id}")
lines.append(
Text.from_markup(f"[bold]{index}[/].\t[green]id[/]\t\t{document_id}")
)
i = index - 1
with path.open("rb") as f:
sample = f.read(sample_size)
if i < len(first_distances):
lines.append(
Text.from_markup(f"\t[green]distance[/]\t{first_distances[i]}")
)
if not sample:
return True
if i < len(first_metadatas):
metadata = first_metadatas[i]
encodings = (
"utf-8",
"utf-8-sig",
"utf-16",
"utf-32",
)
if isinstance(metadata, Mapping):
file_name = metadata.get("file_name")
for encoding in encodings:
try:
sample.decode(encoding)
return True
except UnicodeDecodeError:
pass
if file_name:
lines.append(
Text.from_markup(f"\t[green]file_name[/]\t{file_name}")
)
if i < len(first_documents):
lines.append(Text.from_markup("\n[bold green]Retrieved contents[/]\n"))
lines.append(first_documents[i])
# Print a separator between documents
lines.append(Rule())
return lines
return False
BIN
View File
Binary file not shown.

After

Width:  |  Height:  |  Size: 268 KiB

+3 -2
View File
@@ -13,6 +13,7 @@ dependencies = [
"openai>=2.32.0",
"pymupdf4llm>=1.27.2.2",
"python-dotenv>=1.2.2",
"rich>=15.0.0",
"semchunk>=4.0.0",
"tiktoken>=0.12.0",
"transformers>=5.5.4",
@@ -23,7 +24,7 @@ dependencies = [
chromy = "chromy.main:main"
[tool.setuptools]
packages = ["chromy", "chromy.handlers"]
packages = ["chromy", "chromy.chunking", "chromy.embedding", "chromy.handlers"]
[dependency-groups]
dev = [
@@ -71,7 +72,7 @@ module = [
ignore_missing_imports = true
[[tool.mypy.overrides]]
module = "chromy.chunk_functions"
module = "chromy.chunking.service"
disable_error_code = [
"attr-defined",
]
+43
View File
@@ -0,0 +1,43 @@
---
name: chromy
description: This skill provides access to a RAG-like context enhancer that uses Chromadb locally.
---
# Chromy
Whenever the user asks to "use chromy", you should invoke `chromy`, which is a cli tool to perform RAG search.
The tool should be available in the `$PATH` as `chromy`.
You have access to these commands:
- `$ chromy lc` -> Lists the existing collections.
- `$ chromy q <collection> <query>` -> Performs a query. Be sure to quote the `<query>` if this is composed by multiple words.
Then use the response from Chromy to enhance the context and give the user a refined response.
## A note on file sources
The Chromy response returns the metadatas for the chunks it finds. Among these metadatas, there is `file_name`, which refers to the original file that was chunked and imported. **DO NOT ATTEMPT** to find or fetch these files. They most likely do not exist in the filesystem. You **SHOULD ALWAYS** however cite correctly from which files (**ONLY** from Chromy's metadatas) the information is coming.
## Example use case
**START**
User query:
> Search in Chromy information about lovecraft's Dunwich horror.
Step 1: Get the available collections with `chromy lc`. The output is:
```
lovecraft
documents
```
Most likely our information is in the `lovecraft` collection. We will use that for the query.
Step 2: Query using `chromy q lovecraft <query>`. The query _is up to you_, create one keeping into account that this is a raw query on a vector DB. Be concise, extract keywords, avoid noise.
Step 3: Get the results, enhance the context, and respond to the user.
**END**
+146 -14
View File
@@ -2,8 +2,10 @@ from __future__ import annotations
import unittest
from collections.abc import Sequence
from pathlib import Path
from unittest.mock import patch
from chromadb.errors import InternalError, NotFoundError
from click.testing import Result
from typer.testing import CliRunner
@@ -11,7 +13,11 @@ from chromy.cli import app
class CliTests(unittest.TestCase):
def test_list_collections(self) -> None:
@staticmethod
def _fixture_path(path: str) -> str:
return str(Path(path).resolve())
def test_list_empty_collections(self) -> None:
with patch(
"chromy.handlers.list_collections.list_collections",
return_value=[],
@@ -21,6 +27,16 @@ class CliTests(unittest.TestCase):
self.assertEqual(result.exit_code, 0)
self.assertEqual(result.stdout, "No collections found.\n")
def test_list_existing_collections(self) -> None:
with patch(
"chromy.handlers.list_collections.list_collections",
return_value=["books", "code"],
):
result = _invoke(["list-collections"])
self.assertEqual(result.exit_code, 0)
self.assertEqual(result.stdout, "· books\n· code\n")
def test_create_collection(self) -> None:
with patch(
"chromy.handlers.create_collection.create_collection",
@@ -30,7 +46,18 @@ class CliTests(unittest.TestCase):
create_collection.assert_called_once_with("notes")
self.assertEqual(result.exit_code, 0)
self.assertEqual(result.stdout, "Created collection 'notes'.\n")
self.assertEqual(result.stdout, "Created: collection 'notes'.\n")
def test_create_collection_with_same_name(self) -> None:
with patch(
"chromy.handlers.create_collection.create_collection",
side_effect=InternalError(),
) as create_collection:
result = _invoke(["create-collection", "notes"])
create_collection.assert_called_once_with("notes")
self.assertEqual(result.exit_code, 1)
self.assertEqual(result.stdout, "Error: Collection 'notes' already exists.\n")
def test_delete_collection(self) -> None:
with patch(
@@ -42,6 +69,17 @@ class CliTests(unittest.TestCase):
self.assertEqual(result.exit_code, 0)
self.assertEqual(result.stdout, "Deleted collection 'notes'.\n")
def test_delete_non_existent_collection(self) -> None:
with patch(
"chromy.handlers.delete_collection.delete_collection",
side_effect=NotFoundError(),
) as delete_collection:
result = _invoke(["delete-collection", "notes"])
delete_collection.assert_called_once_with("notes")
self.assertEqual(result.exit_code, 1)
self.assertEqual(result.stdout, "Error: Collection 'notes' does not exist.\n")
def test_count(self) -> None:
with patch(
"chromy.handlers.count_collection.count_collection",
@@ -51,7 +89,10 @@ class CliTests(unittest.TestCase):
count_collection.assert_called_once_with("notes")
self.assertEqual(result.exit_code, 0)
self.assertEqual(result.stdout, "7\n")
self.assertEqual(
result.stdout,
"The 'notes' collection contains 7 records.\n",
)
def test_import_data(self) -> None:
with patch(
@@ -60,9 +101,107 @@ class CliTests(unittest.TestCase):
) as ingest_file:
result = _invoke(["import", "notes", "romeo_and_juliet.txt"])
ingest_file.assert_called_once_with("notes", "romeo_and_juliet.txt")
ingest_file.assert_called_once_with(
"notes",
self._fixture_path("romeo_and_juliet.txt"),
)
self.assertEqual(result.exit_code, 0)
self.assertEqual(result.stdout, "Added 3 records to collection 'notes'.\n")
self.assertEqual(
result.stdout,
"Added 3 records from 'romeo_and_juliet.txt' to collection 'notes'.\n"
"Imported 1 file(s) successfully; 0 failed.\n",
)
def test_import_data_accepts_multiple_files(self) -> None:
with patch(
"chromy.handlers.import_data.ingest_file",
side_effect=[3, 2],
) as ingest_file:
result = _invoke(
["import", "notes", "romeo_and_juliet.txt", "README.md"],
)
self.assertEqual(ingest_file.call_count, 2)
ingest_file.assert_any_call(
"notes",
self._fixture_path("romeo_and_juliet.txt"),
)
ingest_file.assert_any_call(
"notes",
self._fixture_path("README.md"),
)
self.assertEqual(result.exit_code, 0)
self.assertEqual(
result.stdout,
"Added 3 records from 'romeo_and_juliet.txt' to collection 'notes'.\n"
"Added 2 records from 'README.md' to collection 'notes'.\n"
"Imported 2 file(s) successfully; 0 failed.\n",
)
def test_import_data_continues_after_missing_file(self) -> None:
with patch(
"chromy.handlers.import_data.ingest_file",
return_value=3,
) as ingest_file:
result = _invoke(
["import", "notes", "missing.txt", "romeo_and_juliet.txt"],
)
ingest_file.assert_called_once_with(
"notes",
self._fixture_path("romeo_and_juliet.txt"),
)
self.assertEqual(result.exit_code, 1)
self.assertEqual(
result.stdout,
"Error: The file 'missing.txt' was not found.\n"
"Added 3 records from 'romeo_and_juliet.txt' to collection 'notes'.\n"
"Imported 1 file(s) successfully; 1 failed.\n",
)
def test_import_data_rejects_non_text_files(self) -> None:
with patch(
"chromy.handlers.import_data.is_probably_text_file",
return_value=False,
):
result = _invoke(["import", "notes", "romeo_and_juliet.txt"])
self.assertEqual(result.exit_code, 1)
self.assertEqual(
result.stdout,
"Error: The file 'romeo_and_juliet.txt' is not a text file.\n"
"Imported 0 file(s) successfully; 1 failed.\n",
)
def test_import_data_treats_literal_glob_as_missing_file(self) -> None:
result = _invoke(["import", "notes", "*.md"])
self.assertEqual(result.exit_code, 1)
self.assertEqual(
result.stdout,
"Error: The file '*.md' was not found.\n"
"Imported 0 file(s) successfully; 1 failed.\n",
)
def test_import_data_deduplicates_paths_within_single_invocation(self) -> None:
with patch(
"chromy.handlers.import_data.ingest_file",
return_value=3,
) as ingest_file:
result = _invoke(
["import", "notes", "README.md", "./README.md"],
)
ingest_file.assert_called_once_with(
"notes",
self._fixture_path("README.md"),
)
self.assertEqual(result.exit_code, 0)
self.assertEqual(
result.stdout,
"Added 3 records from 'README.md' to collection 'notes'.\n"
"Imported 1 file(s) successfully; 0 failed.\n",
)
def test_query(self) -> None:
query_result = {"ids": [["1"]], "documents": [["hello"]]}
@@ -94,23 +233,16 @@ class CliTests(unittest.TestCase):
self.assertEqual(result.exit_code, 0)
self.assertEqual(
result.stdout,
"Deleted 2 record(s) from collection 'notes' "
"where file_name=play.txt.\n",
"Deleted 2 record(s) from collection 'notes' where file_name=play.txt.\n",
)
def test_removed_alias_is_rejected(self) -> None:
result = _invoke(["lc"])
self.assertNotEqual(result.exit_code, 0)
self.assertIn("No such command", result.output)
def test_invalid_delete_filter_keeps_user_facing_error(self) -> None:
result = _invoke(["delete", "notes", "--where", "file_name"])
self.assertEqual(result.exit_code, 1)
self.assertEqual(
result.stdout,
"Invalid --where value. Expected <condition>=<value>.\n",
"Error: Invalid --where value. Expected <condition>=<value>.\n",
)
def test_delete_requires_where_option(self) -> None:
+30
View File
@@ -0,0 +1,30 @@
from __future__ import annotations
import unittest
from unittest.mock import patch
from chromy.embedding import embed
class EmbedTest(unittest.TestCase):
def test_embed_returns_empty_list_for_empty_chunks(self) -> None:
self.assertEqual(embed([]), [])
def test_embed_pairs_text_with_list_embeddings(self) -> None:
with patch(
"chromy.embedding.service.DefaultEmbeddingFunction",
return_value=lambda chunks: ((1.0, 2.0), (3.0, 4.0)),
):
result = embed(["first", "second"])
self.assertEqual(
result,
[
{"text": "first", "embedding": [1.0, 2.0]},
{"text": "second", "embedding": [3.0, 4.0]},
],
)
if __name__ == "__main__":
unittest.main()
+142 -9
View File
@@ -4,16 +4,17 @@ import io
import unittest
from collections.abc import Callable
from contextlib import redirect_stdout
from pathlib import Path
from typing import TypeVar
from unittest.mock import patch
from unittest.mock import MagicMock, patch
from chromy.handlers.import_data import handle_import
from chromy.handlers.count_collection import handle_count_collection
from chromy.handlers.create_collection import handle_create_collection
from chromy.handlers.delete_collection import (
handle_delete_collection,
handle_delete_records,
)
from chromy.handlers.import_data import handle_import
from chromy.handlers.list_collections import handle_list_collections
from chromy.handlers.query import handle_query
@@ -21,6 +22,10 @@ CommandT = TypeVar("CommandT")
class HandlerTests(unittest.TestCase):
@staticmethod
def _fixture_path(path: str) -> str:
return str(Path(path).resolve())
def test_list_collections_prints_empty_message(self) -> None:
with patch(
"chromy.handlers.list_collections.list_collections", return_value=[]
@@ -42,7 +47,7 @@ class HandlerTests(unittest.TestCase):
)
self.assertEqual(exit_code, 0)
self.assertEqual(output, "notes\nplays\n")
self.assertEqual(output, "· notes\n· plays\n")
def test_create_collection_uses_typed_input(self) -> None:
with patch(
@@ -56,7 +61,7 @@ class HandlerTests(unittest.TestCase):
create_collection.assert_called_once_with("notes")
self.assertEqual(exit_code, 0)
self.assertEqual(output, "Created collection 'notes'.\n")
self.assertEqual(output, "Created: collection 'notes'.\n")
def test_delete_collection_uses_typed_input(self) -> None:
with patch("chromy.handlers.delete_collection.delete_collection") as delete:
@@ -81,7 +86,7 @@ class HandlerTests(unittest.TestCase):
count.assert_called_once_with("notes")
self.assertEqual(exit_code, 0)
self.assertEqual(output, "7\n")
self.assertEqual(output, "The 'notes' collection contains 7 records.\n")
def test_import_data_uses_typed_input(self) -> None:
with patch(
@@ -91,12 +96,140 @@ class HandlerTests(unittest.TestCase):
exit_code, output = _capture_output(
handle_import,
"notes",
"romeo_and_juliet.txt",
["romeo_and_juliet.txt"],
)
ingest_file.assert_called_once_with("notes", "romeo_and_juliet.txt")
ingest_file.assert_called_once_with(
"notes",
self._fixture_path("romeo_and_juliet.txt"),
)
self.assertEqual(exit_code, 0)
self.assertEqual(output, "Added 3 records to collection 'notes'.\n")
self.assertEqual(
output,
"Added 3 records from 'romeo_and_juliet.txt' to collection 'notes'.\n"
"Imported 1 file(s) successfully; 0 failed.\n",
)
def test_import_data_continues_after_missing_file(self) -> None:
with patch(
"chromy.handlers.import_data.ingest_file",
return_value=3,
) as ingest_file:
exit_code, output = _capture_output(
handle_import,
"notes",
["missing.txt", "romeo_and_juliet.txt"],
)
ingest_file.assert_called_once_with(
"notes",
self._fixture_path("romeo_and_juliet.txt"),
)
self.assertEqual(exit_code, 1)
self.assertEqual(
output,
"Error: The file 'missing.txt' was not found.\n"
"Added 3 records from 'romeo_and_juliet.txt' to collection 'notes'.\n"
"Imported 1 file(s) successfully; 1 failed.\n",
)
def test_import_data_rejects_non_text_files(self) -> None:
with patch(
"chromy.handlers.import_data.is_probably_text_file",
return_value=False,
):
exit_code, output = _capture_output(
handle_import,
"notes",
["romeo_and_juliet.txt"],
)
self.assertEqual(exit_code, 1)
self.assertEqual(
output,
"Error: The file 'romeo_and_juliet.txt' is not a text file.\n"
"Imported 0 file(s) successfully; 1 failed.\n",
)
def test_import_data_deduplicates_files(self) -> None:
with patch(
"chromy.handlers.import_data.ingest_file",
return_value=3,
) as ingest_file:
exit_code, output = _capture_output(
handle_import,
"notes",
["README.md", "./README.md"],
)
ingest_file.assert_called_once_with(
"notes",
self._fixture_path("README.md"),
)
self.assertEqual(exit_code, 0)
self.assertEqual(
output,
"Added 3 records from 'README.md' to collection 'notes'.\n"
"Imported 1 file(s) successfully; 0 failed.\n",
)
def test_import_data_suppresses_per_file_output_with_progress(self) -> None:
progress = MagicMock()
progress.__enter__.return_value = progress
progress.__exit__.return_value = None
progress.console.print = print
progress.add_task.return_value = 1
with (
patch("chromy.handlers.import_data.ingest_file", side_effect=[3, 2]),
patch(
"chromy.handlers.import_data._should_show_progress",
return_value=True,
),
patch("chromy.handlers.import_data.Progress", return_value=progress),
):
exit_code, output = _capture_output(
handle_import,
"notes",
["romeo_and_juliet.txt", "README.md"],
)
self.assertEqual(exit_code, 0)
self.assertEqual(output, "Imported 2 file(s) successfully; 0 failed.\n")
def test_import_data_truncates_long_file_names_in_progress(self) -> None:
progress = MagicMock()
progress.__enter__.return_value = progress
progress.__exit__.return_value = None
progress.console.print = print
progress.add_task.return_value = 1
with (
patch(
"chromy.handlers.import_data._get_absolute_path",
side_effect=[
"/tmp/this_is_a_very_long_file_name.txt",
self._fixture_path("README.md"),
"/tmp/this_is_a_very_long_file_name.txt",
self._fixture_path("README.md"),
],
),
patch("chromy.handlers.import_data._import_one", return_value=3),
patch(
"chromy.handlers.import_data._should_show_progress",
return_value=True,
),
patch("chromy.handlers.import_data.Progress", return_value=progress),
):
handle_import(
"notes",
["this_is_a_very_long_file_name.txt", "README.md"],
)
progress.update.assert_any_call(
1,
description="Importing [bold]this_is_a_very_lo...[/]...",
)
def test_query_uses_typed_input(self) -> None:
query_result = {"ids": [["1"]], "documents": [["hello"]]}
@@ -146,7 +279,7 @@ class HandlerTests(unittest.TestCase):
def _capture_output(
handler: Callable[..., int],
*arguments: CommandT,
*arguments: object,
) -> tuple[int, str]:
output = io.StringIO()
+67
View File
@@ -0,0 +1,67 @@
from __future__ import annotations
import unittest
from unittest.mock import MagicMock, call, patch
from chromy.utilities import ingest_file
class UtilityTests(unittest.TestCase):
def test_ingest_file_adds_new_file_without_deleting(self) -> None:
chunks = ["chunk 1", "chunk 2"]
embeddings = [
{"text": "chunk 1", "embedding": [0.1, 0.2]},
{"text": "chunk 2", "embedding": [0.3, 0.4]},
]
with (
patch("chromy.utilities.has_data_for_file", return_value=False) as has_data,
patch("chromy.utilities.delete_data") as delete_data,
patch("chromy.utilities.chunk_file", return_value=chunks) as chunk_file,
patch("chromy.utilities.embed", return_value=embeddings) as embed,
patch("chromy.utilities.add_data") as add_data,
):
records_added = ingest_file("notes", "/tmp/play.txt")
has_data.assert_called_once_with("notes", "/tmp/play.txt")
delete_data.assert_not_called()
chunk_file.assert_called_once_with("/tmp/play.txt")
embed.assert_called_once_with(chunks)
add_data.assert_called_once_with("notes", embeddings, "/tmp/play.txt")
self.assertEqual(records_added, 2)
def test_ingest_file_replaces_existing_file_records_before_adding(self) -> None:
chunks = ["chunk 1"]
embeddings = [{"text": "chunk 1", "embedding": [0.1, 0.2]}]
manager = MagicMock()
with (
patch("chromy.utilities.has_data_for_file", return_value=True) as has_data,
patch("chromy.utilities.delete_data") as delete_data,
patch("chromy.utilities.chunk_file", return_value=chunks) as chunk_file,
patch("chromy.utilities.embed", return_value=embeddings) as embed,
patch("chromy.utilities.add_data") as add_data,
):
manager.attach_mock(has_data, "has_data")
manager.attach_mock(delete_data, "delete_data")
manager.attach_mock(chunk_file, "chunk_file")
manager.attach_mock(embed, "embed")
manager.attach_mock(add_data, "add_data")
records_added = ingest_file("notes", "/tmp/play.txt")
self.assertEqual(
manager.mock_calls,
[
call.has_data("notes", "/tmp/play.txt"),
call.delete_data("notes", {"file_name": "/tmp/play.txt"}),
call.chunk_file("/tmp/play.txt"),
call.embed(chunks),
call.add_data("notes", embeddings, "/tmp/play.txt"),
],
)
self.assertEqual(records_added, 1)
if __name__ == "__main__":
unittest.main()
Generated
+2
View File
@@ -261,6 +261,7 @@ dependencies = [
{ name = "openai" },
{ name = "pymupdf4llm" },
{ name = "python-dotenv" },
{ name = "rich" },
{ name = "semchunk" },
{ name = "tiktoken" },
{ name = "transformers" },
@@ -281,6 +282,7 @@ requires-dist = [
{ name = "openai", specifier = ">=2.32.0" },
{ name = "pymupdf4llm", specifier = ">=1.27.2.2" },
{ name = "python-dotenv", specifier = ">=1.2.2" },
{ name = "rich", specifier = ">=15.0.0" },
{ name = "semchunk", specifier = ">=4.0.0" },
{ name = "tiktoken", specifier = ">=0.12.0" },
{ name = "transformers", specifier = ">=5.5.4" },