Split for Embedding

Different embedding models have different limits on input tokens. When you want to create embeddings for a large corpus, one major annoyance is splitting the content so that it fits within this limit.

My recent work has centered around Simon Willison’s excellent llm tool. He also wrote ttok, a command-line tool for counting tokens based on tiktoken. While it lacks options to split, it was not difficult to add one in the form of a --chunksize int option.

When invoked with --chunksize, ttok prints output in sets of the given size. This doesn’t make much sense when printing the text out as happens with --truncate. It’s more sensible but not very useful with --tokens, which prints the tokens in legible form. But with --encode it means you get one line of token numbers per chunk. That lets you do this:

ttok -m model -i infile --encode --chunksize 1000 | split -l 1 outprefix.

Iterating on the chunked output with ttok --decode yields your chunks. Very simple.

After implementing this I discovered a much earlier conversation that covers splitting and converting back to text in one pass. As it happens, I started with that in mind as well, but saw that working with encoded output gave line-oriented results. This meant ttok could do less, and split could do the rest, which appealed to my taste for unix-style simplicity. You can find ttok with --chunksize on the “chunksize” branch here. The packaging is incomplete. If it is useful to you, let me know and I’ll complete the work needed for a proper pull request.