Count Word Frequency in a File (bash4, awk) - Bash Snipplr Social Repository

Revision: 27329

at January 11, 2011 11:01 by tm

Updated Code

# "word" is defined as "space delimited token" - i.e. "one" and "one." are dfferent
# words. The awk-expression is used as "Tokenizer", its results are used to build an
# associative array. Blank lines are assigned to the array-entry a[Blank Line] (which
# cannot result from the tokenizing, since it has a space)
unset a; declare -A a;
while read -r; do
    ! [[ $REPLY ]] && REPLY="Blank Line"
    ((a[\$REPLY]++))
done < <(awk -v OFS=\\n '{$1=$1} 1' ./file)
# print the results, sorted by frequency
for word in "${!a[@]}"; do
    printf '%d\t%s\n' "${a[$word]}" "$word"
done | sort -n

# NOTE: The "\$REPLY" (i.e. the backslash) is needed because for bash4
# otherwise a "[" or "]" would break the expansion/evaluation.
# NOTE ALSO: This is mainly sort of a "proof of concept". It is very slow and would better
# be implemented in e.g. awk completely!

Revision: 27328

at June 7, 2010 20:19 by tm

Updated Code

# "word" is defined as "space delimited token" - i.e. "one" and "one." are dfferent
# words. The awk-expression is used as "Tokenizer", its results are used to build an
# associative array. Blank lines are assigned to the array-entry a[Blank Line] (which
# cannot result from the tokenizing, since it has a space)
unset a; declare -A a;
while read -r; do
    ! [[ $REPLY ]] && REPLY="Blank Line"
    ((a[\$REPLY]++))
done < <(awk -v OFS=\\n '{$1=$1} 1' ./file)
# print the results, sorted by frequency
for word in "${!a[@]}"; do
    printf '%d\t%s\n' "${a[$word]}" "$word"
done | sort -n

# NOTE: The "\$REPLY" (i.e. the backslash) is needed for some unclear reason, because
# otherwise a "[" or "]" would break the expansion/evaluation. Furthermore it's unclear,
# why bash4 has problems with tbe empty string as key for the assoc. array
# NOTE ALSO: This is mainly sort of a "proof of concept". It is very slow and would better
# be implemented in e.g. awk completely!

Revision: 27327

at June 2, 2010 18:38 by tm

Initial Code

# "word" is defined as "space delimited token" - i.e. "one" and "one." are dfferent
# words. The awk-expression is used as "Tokenizer", its results are used to build an
# associative array. Blank lines are assigned to the array-entry a[Blank Line] (which
# cannot result from the tokenizing, since it has a space)
unset a; declare -A a;
while read -r; do
    ! [[ $REPLY ]] && REPLY="Blank Line"
    ((a[\$REPLY]++))
done < <(awk -v OFS=\\n '{$1=$1} 1' ./file)
# print the results, sorted by frequency
for word in "${!a[@]}"; do
    printf '%d\t%s\n' "${a[$word]}" "$word"
done | sort -n

# NOTE: The "\$REPLY" (i.e. the backslash) is needed for some unclear reason, because
# otherwise a "[" or "]" would break the expansion/evaluation

Initial URL

Initial Description

Initial Title

Count Word Frequency in a File (bash4, awk)

Initial Tags

file, Bash

Initial Language

Bash

Choose a language for easy browsing: