Revision: 27329
Updated Code
at January 11, 2011 11:01 by tm
Updated Code
# "word" is defined as "space delimited token" - i.e. "one" and "one." are dfferent # words. The awk-expression is used as "Tokenizer", its results are used to build an # associative array. Blank lines are assigned to the array-entry a[Blank Line] (which # cannot result from the tokenizing, since it has a space) unset a; declare -A a; while read -r; do ! [[ $REPLY ]] && REPLY="Blank Line" ((a[\$REPLY]++)) done < <(awk -v OFS=\\n '{$1=$1} 1' ./file) # print the results, sorted by frequency for word in "${!a[@]}"; do printf '%d\t%s\n' "${a[$word]}" "$word" done | sort -n # NOTE: The "\$REPLY" (i.e. the backslash) is needed because for bash4 # otherwise a "[" or "]" would break the expansion/evaluation. # NOTE ALSO: This is mainly sort of a "proof of concept". It is very slow and would better # be implemented in e.g. awk completely!
Revision: 27328
Updated Code
at June 7, 2010 20:19 by tm
Updated Code
# "word" is defined as "space delimited token" - i.e. "one" and "one." are dfferent # words. The awk-expression is used as "Tokenizer", its results are used to build an # associative array. Blank lines are assigned to the array-entry a[Blank Line] (which # cannot result from the tokenizing, since it has a space) unset a; declare -A a; while read -r; do ! [[ $REPLY ]] && REPLY="Blank Line" ((a[\$REPLY]++)) done < <(awk -v OFS=\\n '{$1=$1} 1' ./file) # print the results, sorted by frequency for word in "${!a[@]}"; do printf '%d\t%s\n' "${a[$word]}" "$word" done | sort -n # NOTE: The "\$REPLY" (i.e. the backslash) is needed for some unclear reason, because # otherwise a "[" or "]" would break the expansion/evaluation. Furthermore it's unclear, # why bash4 has problems with tbe empty string as key for the assoc. array # NOTE ALSO: This is mainly sort of a "proof of concept". It is very slow and would better # be implemented in e.g. awk completely!
Revision: 27327
Initial Code
Initial URL
Initial Description
Initial Title
Initial Tags
Initial Language
at June 2, 2010 18:38 by tm
Initial Code
# "word" is defined as "space delimited token" - i.e. "one" and "one." are dfferent # words. The awk-expression is used as "Tokenizer", its results are used to build an # associative array. Blank lines are assigned to the array-entry a[Blank Line] (which # cannot result from the tokenizing, since it has a space) unset a; declare -A a; while read -r; do ! [[ $REPLY ]] && REPLY="Blank Line" ((a[\$REPLY]++)) done < <(awk -v OFS=\\n '{$1=$1} 1' ./file) # print the results, sorted by frequency for word in "${!a[@]}"; do printf '%d\t%s\n' "${a[$word]}" "$word" done | sort -n # NOTE: The "\$REPLY" (i.e. the backslash) is needed for some unclear reason, because # otherwise a "[" or "]" would break the expansion/evaluation
Initial URL
Initial Description
Initial Title
Count Word Frequency in a File (bash4, awk)
Initial Tags
file, Bash
Initial Language
Bash