mostly data science, statistics and applications.

Mawk or Awk? Munging Large Log Files

| Comments

I’ve recently needed to process some pretty large log files. The files accrue about 2gb a week. For now, it’s possible to read the files into R, but very soon it won’t be possible without a key-value store, or serving it off of a MySQL server as Jim Porzack suggests.

Big data is relatively new to me. I don’t have a Hadoop cluster to work on, so going that route doesn’t get me much miliage right now. R only runs on one cluster by default, though, so I’ve learned efficiency techniques can make amazing things possible locally. Julia, for example shines:

gcc 4.8.1 3.0.2 R2012a 3.6.4 8.0 V8 go1

Figure: benchmark times relative to C (smaller is better, C performance = 1.0).

Source and all work attributable to: [http://julialang.org](http://julialang.org); C compiled by gcc 4.8.1, taking best timing from all optimization levels (-O0 through -O3). C, Fortran and Julia use OpenBLAS v0.2.8. The Python implementations of rand_mat_stat and rand_mat_mul use NumPy (v1.6.1) functions; the rest are pure Python implementations.
Benchmarks can also be seen here as a plot created with Gadfly.

However not shown on that table is awk or it’s cousin mawk. Since I’m working with a log file that’s mostly text, I want quick and easy ability to apply regular expressions. The file is not fully structured, so it will need to be parsed.

I’ve recently been exploring Unix tools. I moved from Windows to Unix about a year ago and haven’t looked back.

Last weekend I spent about three hours or so learning awk with this gem of a book.

I knew that the program was great at text processing, but I wasn’t sure if there was something better.

The response in this StackOverflow post pushed me over the edge:

If you quickly learn the basics of awk, you can indeed do amazing things on the command line.

But the real reason to learn awk is to have an excuse to read the superb book The AWK Programming Language by its authors Aho, Kernighan, and Weinberger. You would think, from the name, that it simply teaches you awk. Actually, that is just the beginning. Launching into the vast array of problems that can be tackled once one is using a concise scripting language that makes string manipulation easy — and awk was one of the first — it proceeds to teach the reader how to implement a database, a parser, an interpreter, and (if memory serves me) a compiler for a small project-specific computer language! If only they had also programmed an example operating system using awk, the book would have been a fairly complete survey introduction to computer science!

Famously clear and concise, like the original C Language book, it also is a wonderful example of friendly technical writing done right. Even the index is a piece of craftsmanship.

Awk? If you know it, you’ll use it at the command-line occasionally, but for anything larger you’ll feel trapped, unable to access the wider features of your system and the Internet that something like Python provides access to. But the book? You’ll always be glad you read it!

So I read the book. The authors are right, the book is very clear and can allow you to do some pretty neat things with the language in an hour or so.

As an aside, I’m also very interested in sed. It’s a stream editor, which means that it applies rules to lines of data one at a time and can allow for the processing of HUGE files without actually opening them.

After playing around with awk a bit, I came across this confusing blog title. The post compares mawk and awk in terms of speed. mawk is way faster and can apparently beat C++ and Java-speed! The post also talks about potential problems with mawk historically, but it appears that those have been fixed. Also, I definitely plan to check operations with awk.

See my simple speed comparisons below simply counting 5.2 million records:

mawk is about double the speed of awk!

Finally, the post mentions something exciting. It imagines an awk on LLVM. Since the post is a couple years old, I did a little search and found what appears to be an implementation!