Separating Storage From llm Embeddings

William Baxter

2024-11-07

2 min read til llm unix

The llm program uses sqlite to store embeddings. I want to store my embeddings in PostgreSQL. Rather than try to alter the built-in storage, I opted for the more unix-friendly approach of simply printing the results to standard output.

This was fairly easy to implement based on code stolen from llm/cli.py and llm/embeddings.py. It also leaves in place the logging code, which still uses sqlite.

The llm-vector project implements this. As of this writing it lacks tests, because I haven’t yet figured out how to adapt the tests from llm embed and llm embed-multi. The two top-level functions are llm vector and llm vector-multi. These closely resemble the corresponding embed calls. See the module or its help for details.

A couple of differences are worth calling out here. There is no use of an embedding database, so the model becomes a mandatory argument. The vector versions do not filter out entries based on an input hash. There is no accommodation for metadata or collections. Under the hood it’s somewhat simpler than the original.

What can one do with this new code?

Print the embedding for a single string:

llm vector my-favorite-model -c "The llm-vector module prints embeddings of one or more inputs to standard output."

Write multiple embeddings to a file:

llm vector-multi my-favorite-model --files files-dir '**/*' > output

Load embeddings directly into a PostgreSQL table:

llm vector-multi 3-small --files files-dir '**/*' --oformat tsv | psql connection-string -c 'COPY mytable(key,vector,content) from stdin with (FORMAT text)'

I already use PostgreSQL for my other database work, and now I can integrate llm into my workflow. Once again, unix-style programming for the win!