cat somebook.txt | llm \-q “Write a summary of each chapter, 3 bullet points per chapter”
Context windows are crazy long these days. I wrote a little utility that pipes stdin plus your question into an LLM. The result is a tool that can query 50 minute transcript files or 132 page books. I was planning on making a genius RAG system, but it turned out to be all in vain. What I learned is to just chill out and enjoy the long context window.
Retrieval augmented generation (RAG) is a basic technique employed for bringing in context to an LLM Query. Basically, you build a search system that can return chunks of relevant information, then give the LLM the ability to call a function which can search for and add context to its query. As a result you will have references in your response and the AI system will give better answers and be able to pull from a very large set of sources, more than it would be able to understand directly in its context window.
I wanted to start asking questions of a 48 minute video transcript. Naturally I assumed this will be way too much content for an LLM to provide decent results for. As a fun exercise I decided to experiment with an algorithm for chunking up the data for a RAG system to pull from.
My Idea was:
After looping through steps 1-4 about 100 times, we did actually get a set of bullet points out of the LLM. You could look through the output and see a full history of points. The algorithm kind of did an ok job of summarising the content.
The problem was that the meaning really got lost along the way. Points would get repeated for a number of iterations, then all of a sudden be dropped – even though they should have never been dropped. The LLM was losing context and not weighting the important points correctly.
Would this approach eventually work? Incremental improvements, yep. My prompt was far from optimal. If you looked at the unique points in the final list (maybe 50-100 sentences), you could get the meaning of the text, but it was again way too duplicate. A secondary layer of vector indexing (search for context) on the key bullet points might have been a cool experiment to enhance the result.
It was not long after the first dumb attempt of using RAG that I came up with an even dumber idea. Just give the entire transcript with no preprocessing to the LLM and ask the question directly.
I decided to do this after looking at the spec for gemini-1.5-flash
and seeing that it supports an input count of just over 1 million tokens. WOW! Surprisingly you can just ignore fancy splitting and summarising techniques to just give it the whole content, and that amount of content can be larger than you think if you’ve spent the last 6 months not following LLM development.
The 48 minute transcript I passed to the model turned out to be about 400k input tokens, so there was plenty of space for more. I was able to ask simple questions about the content and get an answer back.
I created a command line llm
binary, that can take input from stdin with a parameter to provide additional question input. No fancy prompcraft or engineering was needed. With just a simple input + question. I was able to chat with video transcripts, 132 page books. The whole thing has been an amazing little experiment. Now I can quickly ask a question from the terminal, or give it a code file to tell me something about. Given the huge context window you can just chill out and not worry about clever chunking or RAG algorithms.
Here is the entire code listing: package main
import ( "context" "flag" "fmt" "go-agent/llm" "io" "log/slog" "os" "strings" )
var ( keyPath = flag.String("key", mustStr(os.UserHomeDir)+"/key", "Path to the API key file") question = flag.String("q", "", "Question to ask the model") )
func main() { flag.Parse()
// read input from stdin ctx := context.Background() b, err := os.ReadFile(*keyPath) if err != nil { slog.Error("Error reading key file", "err", err) os.Exit(1) } chat, err := llm.NewClientFromKeyfile(ctx, string(b)) if err != nil { slog.Error("Error creating genai client", "err", err) os.Exit(1) } stat, err := os.Stdin.Stat() if err != nil { slog.Error("Error getting stdin stat", "err", err) os.Exit(1) } isPipe := (stat.Mode() & os.ModeCharDevice) == 0
var input []byte
if isPipe { input, err = io.ReadAll(os.Stdin) if err != nil { slog.Error("Error reading stdin", "err", err) os.Exit(1) }
} fullQuestion := string(input) + "\n" + *question slog.Debug("Read input", "input", fullQuestion) slog.Info("Sending message to model") msg, err := chat.Send(ctx, fullQuestion) if err != nil { slog.Error("Error sending message", "err", err) os.Exit(1) }
fmt.Println(msg.Text()) slog.Info("Tokens used:", "input", chat.Tokens.Input, "output", chat.Tokens.Output) } func mustStr(s func() (string, error)) string { v, err := s() if err != nil { slog.Error("Error getting string", "err", err) os.Exit(1) } return v }
If you are a developer, you should definitely be exploring this stuff. You can get a lot of usage out of small command line utilities and the capabilities of LLM’s are rapidly evolving. The development experience with this kind of project is also pretty fun. You will get a really enjoyable non-deterministic output that is a bit surprising sometimes. It reminds me a lot of game development where you sort of play with the program a bit to see how it feels before tapping out your next iteration.