Unbuffering the Buffered

I've been writing unix-ish code for more than two decades (crap, I'm old!) but last week I discovered something I'd never used before, the stdbuf command. It solves (well, works around) one of my longstanding problems working with command line programs: buffering.

How many times have you been working on some code where you need to invoke a command line program. It seems so easy. You just exec the command, back-tick or spawn or whatever exec equivalent your programming language has. The result is captured back into a string that you then parse. Easy peasy.

Now suppose you need to invoke a long running command line program. Perhaps it's downloading a file or you want to monitor all the processes with 'ps'. Whatever it is, you want results back from the program before it finishes. You want the output as it's generated. This seems like it would be simple. If you run it from the command line the output is incremental, often one line at a time. Now call it from your program. 

Nothing.  

No output comes into your code. Wait a few minutes. Oh there it is. The input comes in a giant chunk instead of one line at a time. What happened? Welcome to Unix.

Everything is a file, except when it's not.

Unix' philosophy of everything is great as far as it goes, but it's actually a lie. There are many thing in unix which appear to be files but are not. Sure, you can open them like files, but they actually produce data on their own timetable. They might queue up all data until the program ends, then spit it out at once. Or they might send data in 4k chunks. However, when you run these programs interactively everything is realtime. What gives? The answer is that the standard input / output files actually have extra features to determine if the other end is 'interactive' or not. Let's dive a bit deeper.

When you are on the command line or terminal you are actually running what's called a pseudo-terminal. According to Wikipedia these date back at least to the late 1960s. They process control characters which move the cursor around, setting colors, resizing the screen, and other interactive things. When you run a program it receives data on the stdin (standard input) stream. If you run the program interactively then stdin comes from the shell. If you run it as part of a pipeline, or from within another program, the stdin comes from a file or some other input source. 

Given that stdin could come from anywhere, how does a program know to use these interactive control characters? It turns out there are APIs to determine if the input to a program is actually an interactive terminal or really just a stream of bytes. This is how a program like Vim can be interactive through what is supposedly a file. And this is why the same program will behave differently depending on how you call it. This is why exec'ing a program from your code gives you unexpected behavior. The program is switching from interactive to streaming mode.  

In Unix terminology, the output switches from unbuffered to buffered. The buffers have a default size, often 4Kb (depending on your operating system). That's why you get no output for a while then suddenly get a bunch. The buffer is only flushed when you hit 4096 bytes of data.

So what can you do about it?

In the past I've used a couple of solutions. Some programs have explicit options to turn off the buffering. This might be called interactive mode. Other times I can find a different way to get the data. A last resort option is to trick the program. In Node I can use the pty module to essentially create a pretend pseudo terminal (a pseudo-pseudo-terminal?) which then runs the app for me. It's inefficient but it gets the job done.

However, I've discovered a brand new solution. Or rather, a brand old one. The stdbuf command has been a part of the standard GNU utils for decades. According to the docs it will "allow one to modify the buffering operations of the three standard I/O streams associated with a program." Perfect. Just what we want.

It works like this. Call stdbuf with options to change buffering, followed by the original command you wanted to run. To disable buffering on the input stream use -i 0. Use -o for the output stream. In my case I wanted to run bluetooth scanner with buffering disabled on all of the input, output, and error streams. Here's the code in NodeJS

var scanner = child_process.spawn(
    "stdbuf", //the stdbuf command
    [        
       '-i0','-o0','-e0', //disable all buffering
       '/usr/bin/hcitool','lescan','--duplicates' //the real program I want to run     
]);

And that's it! How does it work? Magic probably. According to this it's a LD_PRELOAD hack, which is basically magic. So there you go.

How have I not heard of this magic program before?!

Talk to me about it on Twitter

Posted May 14th, 2015

Tagged: unix nodejs