Streaming IO

Processing 100GB video on 128MB RAM

pipes

Code Vs Engineered Software

Here at Kablamo we engineer software solutions. What does that mean exactly? Well I could build you a bridge. I would buy some timber and some screws and join them all up. It might look OK and it would work to some extent - but how would you know to what extent exactly? Could you drive across it with your family? Would you choose to depend on it for your livelihood? How could the bridge's risk be calculated in the long-term?

If you get an engineer to build you a bridge, they will tell you exactly how much load the bridge can bear and what its likely failure modes will be.

That's engineering.

Consider the following go function. It redacts the word "go" and replaces it with "XX".

go
1
func redacter(infilename, outfilename string) error {
2
var result []byte
3
res, err := ioutil.ReadFile(infilename)
4
if err != nil {
5
return err
6
}
7
8
var redacted string
9
redacted = strings.ReplaceAll(string(res), "go", "XX")
10
11
if err := ioutil.WriteFile(outfilename, []byte(redacted), 0644); err != nil {
12
return err
13
}
14
return nil
15
}
16

This code might look OK, and it will work to some extent. It is great for a GoDoc example or Stack Overflow post because it is short and you can debug it easily by printing out res and redacted. It is comfortable for people that are used to developing in interpreted languages because they can step through line-by-line almost like having a REPL.

So can you use it in production? Can you stake the profitability of your business and your reputation on it?

This function is guaranteed to crash its host process at some point.

Why?

It reads a complete file into memory with no knowledge or control over the maximum size of that file. This function will clearly use at least 2 times the size of the input file in RAM.

You could kick the can down the road by over-engineering the infrastructure to have "more than enough" RAM. That is hoping for the best from users who should be assumed to be malicious or at least clumsy. Fuzzing random large and malformed inputs at a target is an elementary vulnerability scanning technique that an exploiter can use to ex-filtrate sensitive data in the worst case and take you off-line in the absolute best case.

Furthermore, there is no way to calculate the degree of parallelism you might be able to run this redactor in its over-engineered state. Go's clear and concise multi-processing models are one of its major advantages over other languages. This function makes it impossible to use that.

So - as software and infrastructure engineers - how can we confidently calculate this tasks resource usage to ensure high service availability and high performance by utilising available resources in parallel?

Pipes and Streams

When you open a video in a video player, it does not load it into memory. It buffers the contents in as required. You can watch an 8GB video on a phone with 2GB RAM.

This model is older than digital video. This is what Unix pipes do. When you grep a file, it loads part of the file into its buffer and returns results as it moves its buffer through the file. That way the next processor in the chain can begin working too. All before the file is finished reading in.

So what does a production grade redactor that makes use of streamed input look like?

go
1
func bufferedRedacter(infilename, outfilename string) error {
2
// Open input stream NOT byte slice
3
inFile, err := os.Open(infilename)
4
defer inFile.Close()
5
if err != nil {
6
return err
7
}
8
9
// Open output stream
10
outFile, err := os.Create(outfilename)
11
defer outFile.Close()
12
if err != nil {
13
return err
14
}
15
16
// Redact the input line by line
17
scanner := bufio.NewScanner(inFile)
18
for scanner.Scan() {
19
fmt.Fprintln(outFile, strings.ReplaceAll(scanner.Text(), "go", "XX"))
20
}
21
return scanner.Err()
22
}
23

The input and output file descriptors are opened. As the input file is copied to the output file it is redacted. These two could be any io.Reader/io.Writer pair so could just as easily be STDIN/STDOUT and the program used to edit the stream between Unix pipes like so:

('stream editor' ... catchy name. Where have I heard that before 🤔)

bash
1
cat infile | redactor | grep "regex" > outfile
2

They could also be streamed HTTP Requests/Responses.

So how does the resource usages and execution time compare for the two options?

Using an input text file with size 9.4MB, we see the following performance:

The byte slice redactor uses a cumulative total memory of around 2 times the input file as we expect, 18.61MB. This will scale infinitely with input file size.

In total the buffered redactor uses 12.75kB irrespective of the input size. This is approximately 3 times the size of the default read buffer.

Its unlikely that any infrastructure component we'd deploy to, would have less than 128MB RAM. In practice we increase the read buffer to more fully utilise the available system memory and reduce syscalls or network traffic, noting that this kind of task is CPU bound. These kind of optimisations are not the focus of this blog post.

Summary

The point here is that we can design processors that can work on arbitrarily large input files using less than 10kB of memory.

This is how Kablamo engineers solutions for rapid and parallel video processing tasks, on input files up to 100GB using containers and AWS Lambda functions, without persistent disk. Such tasks include; transcoding, clipping, metadata extraction and audio waveform generation.