view all posts by Tom Limoncelli

Simple bucket-ized stats in awk

Tom Limoncelli Posted by Tom Limoncelli | Fri, 15 Aug 2014
see the original posting from Everything Sysadmin

Someone recently asked how to take a bunch of numbers from STDIN and then break them down into distribution buckets. This is simple enough that it should be do-able in awk.

Here's a simple script that will generate 100 random numbers. Bucketize them to the nearest multiple of 10, print based on # of items in bucket:

while true ; do echo $[ 1 + $[ RANDOM % 100 ]] ; done | head -100 | awk '{ bucket = int(($1 + 5) / 10) * 10 ; arr[bucket]++} END { for (i in arr) {print i, arr[i] }}' | sort -k2n,2 -k1n,1

Many people don't know that in bash, a single quote can go over multiple lines. This makes it very easy to put a little bit of awk right in the middle of your code, eliminating the need for a second file that contains the awk code itself. Since you can put newlines anywhere, you can make it very readable:

#!/bin/bash

while true ; do
  echo $[ 1 + $[ RANDOM % 100 ]]
done | head -100 | \
  awk '
      {
        bucket = int(($1 + 5) / 10) * 10 ;
        arr[bucket]++
      }
      END {
        for (i in arr) {
          print i, arr[i]
        }
      }
' | sort -k2n,2 -k1n,1

If you want to sort by the buckets, change the sort to sort -k1n,1 -k2n,2

If you want to be a little more fancy, separate out the bucket function into a separate function. What? awk can do functions? Sure it can. You can also import values from the environment using the -v flag.

#!/bin/bash

# Bucketize stdin to nearest multiple of argv[1], or 10 if no args given.
# "nearest" means 0..4.999 -> 0, 5..14.999 -> 10, etc.

# Usage:
# while true ; do echo $[ 1 + $[ RANDOM % 100 ]]; done | head -99 | bucket.sh 8

awk -v multiple="${1:-10}" '

function bucketize(a) {
  # Round to the nearest multiple of "multiple"
  #  (nearest... i.e. may round up or down)
  return int((a + (multiple/2)) / multiple) * multiple;
}

# All lines get bucketized.
{ arr[bucketize($1)]++ }

# When done, output the array.
END {
  for (i in arr) {
    print i, arr[i]
  }
}
' | sort -k2n,2 -k1n,1

I generally use Python for scripting but for something this short, awk makes sense. Sadly using awk has become a lost art.


see the original posting from Everything Sysadmin

Back to top