Posted By

tm on 06/02/10


Tagged

file Bash awk frequency word


Versions (?)

Count Word Frequency in a File (bash4, awk)


 / Published in: Bash
 

  1. # "word" is defined as "space delimited token" - i.e. "one" and "one." are dfferent
  2. # words. The awk-expression is used as "Tokenizer", its results are used to build an
  3. # associative array. Blank lines are assigned to the array-entry a[Blank Line] (which
  4. # cannot result from the tokenizing, since it has a space)
  5. unset a; declare -A a;
  6. while read -r; do
  7. ! [[ $REPLY ]] && REPLY="Blank Line"
  8. ((a[\$REPLY]++))
  9. done < <(awk -v OFS=\\n '{$1=$1} 1' ./file)
  10. # print the results, sorted by frequency
  11. for word in "${!a[@]}"; do
  12. printf '%d\t%s\n' "${a[$word]}" "$word"
  13. done | sort -n
  14.  
  15. # NOTE: The "\$REPLY" (i.e. the backslash) is needed because for bash4
  16. # otherwise a "[" or "]" would break the expansion/evaluation.
  17. # NOTE ALSO: This is mainly sort of a "proof of concept". It is very slow and would better
  18. # be implemented in e.g. awk completely!

Report this snippet  

You need to login to post a comment.