Here at Kablamo we engineer software solutions. What does that mean exactly? Well I could build you a bridge. I would buy some timber and some screws and join them all up. It might look OK and it would work to some extent - but how would you know to what extent exactly? Could you drive across it with your family? Would you choose to depend on it for your livelihood? How could the bridge's risk be calculated in the long-term?
If you get an engineer to build you a bridge, they will tell you exactly how much load the bridge can bear and what its likely failure modes will be.
That's engineering.
Consider the following go function. It redacts the word "go" and replaces it with "XX".
gofunc redacter(infilename, outfilename string) error {var result []byteres, err := ioutil.ReadFile(infilename)if err != nil {return err}var redacted stringredacted = strings.ReplaceAll(string(res), "go", "XX")if err := ioutil.WriteFile(outfilename, []byte(redacted), 0644); err != nil {return err}return nil}
This code might look OK, and it will work to some extent. It is great for a GoDoc
example or Stack Overflow post because it is short and you can debug it
easily by printing out res
and redacted
. It is comfortable for people
that are used to developing in interpreted languages because they can step
through line-by-line almost like having a REPL.
So can you use it in production? Can you stake the profitability of your business and your reputation on it?
This function is guaranteed to crash its host process at some point.
Why?
It reads a complete file into memory with no knowledge or control over the maximum size of that file. This function will clearly use at least 2 times the size of the input file in RAM.
You could kick the can down the road by over-engineering the infrastructure to have "more than enough" RAM. That is hoping for the best from users who should be assumed to be malicious or at least clumsy. Fuzzing random large and malformed inputs at a target is an elementary vulnerability scanning technique that an exploiter can use to ex-filtrate sensitive data in the worst case and take you off-line in the absolute best case.
Furthermore, there is no way to calculate the degree of parallelism you might be able to run this redactor in its over-engineered state. Go's clear and concise multi-processing models are one of its major advantages over other languages. This function makes it impossible to use that.
So - as software and infrastructure engineers - how can we confidently calculate this tasks resource usage to ensure high service availability and high performance by utilising available resources in parallel?
When you open a video in a video player, it does not load it into memory. It buffers the contents in as required. You can watch an 8GB video on a phone with 2GB RAM.
This model is older than digital video. This is what Unix pipes do. When you
grep
a file, it loads part of the file into its buffer and returns results as
it moves its buffer through the file. That way the next processor in the chain
can begin working too. All before the file is finished reading in.
So what does a production grade redactor that makes use of streamed input look like?
gofunc bufferedRedacter(infilename, outfilename string) error {// Open input stream NOT byte sliceinFile, err := os.Open(infilename)defer inFile.Close()if err != nil {return err}// Open output streamoutFile, err := os.Create(outfilename)defer outFile.Close()if err != nil {return err}// Redact the input line by linescanner := bufio.NewScanner(inFile)for scanner.Scan() {fmt.Fprintln(outFile, strings.ReplaceAll(scanner.Text(), "go", "XX"))}return scanner.Err()}
The input and output file descriptors are opened. As the input file is copied to the
output file it is redacted. These two could be any io.Reader
/io.Writer
pair so could
just as easily be STDIN/STDOUT and the program used to edit the stream between Unix pipes
like so:
('stream editor' ... catchy name. Where have I heard that before 🤔)
bashcat infile | redactor | grep "regex" > outfile
They could also be streamed HTTP Requests/Responses.
So how does the resource usages and execution time compare for the two options?
Using an input text file with size 9.4MB, we see the following performance:
The byte slice redactor uses a cumulative total memory of around 2 times the input file as we expect, 18.61MB. This will scale infinitely with input file size.
In total the buffered redactor uses 12.75kB irrespective of the input size. This is approximately 3 times the size of the default read buffer.
Its unlikely that any infrastructure component we'd deploy to, would have less than 128MB RAM. In practice we increase the read buffer to more fully utilise the available system memory and reduce syscalls or network traffic, noting that this kind of task is CPU bound. These kind of optimisations are not the focus of this blog post.
The point here is that we can design processors that can work on arbitrarily large input files using less than 10kB of memory.
This is how Kablamo engineers solutions for rapid and parallel video processing tasks, on input files up to 100GB using containers and AWS Lambda functions, without persistent disk. Such tasks include; transcoding, clipping, metadata extraction and audio waveform generation.