Split for Embedding
Different embedding models have different limits on input tokens. When you want to create embeddings for a large corpus, one major annoyance is splitting the content so that it fits within this limit.
My recent work has centered around Simon Willison’s excellent llm
tool. He also wrote ttok
, a command-line tool for counting tokens based on tiktoken
. While it lacks options to split, it was not difficult to add one in the form of a --chunksize int
option.
When invoked with --chunksize
, ttok
prints output in sets of the given size. This doesn’t make much sense when printing the text out as happens with --truncate
. It’s more sensible but not very useful with --tokens
, which prints the tokens in legible form. But with --encode
it means you get one line of token numbers per chunk. That lets you do this:
ttok -m model -i infile --encode --chunksize 1000 | split -l 1 outprefix.
Iterating on the chunked output with ttok --decode
yields your chunks. Very simple.
After implementing this I discovered a much earlier conversation that covers splitting and converting back to text in one pass. As it happens, I started with that in mind as well, but saw that working with encoded output gave line-oriented results. This meant ttok
could do less, and split
could do the rest, which appealed to my taste for unix-style simplicity. You can find ttok
with --chunksize
on the “chunksize” branch here. The packaging is incomplete. If it is useful to you, let me know and I’ll complete the work needed for a proper pull request.