Posted By


ctran on 09/05/07

Tagged


Statistics


Viewed 57 times
Favorited by 1 user(s)

VietnameseAnalyzer.rb


/ Published in: Ruby
Save to your folder(s)

Convert Vietnamese characters into ASCII so they can be indexed and searched.


Copy this code and paste it in your HTML
  1. require 'unicode'
  2.  
  3. # Normalizes token text to lower case.
  4. class UnicodeLowerCaseFilter
  5. def initialize(token_stream)
  6. @input = token_stream
  7. end
  8.  
  9. def text=(text)
  10. @input.text = text
  11. end
  12.  
  13. def next()
  14. t = @input.next()
  15.  
  16. if (t == nil)
  17. return nil
  18. end
  19.  
  20. t.text = Unicode.downcase(t.text)
  21. return t
  22. end
  23. end
  24.  
  25. class VietnameseAnalyzer < Ferret::Analysis::Analyzer
  26. include Ferret::Analysis
  27.  
  28. # Standard Character mappings to remove all special characters
  29. # so only default ASCII characters get indexed
  30. CHARACTER_MAPPINGS = {
  31. ['á','à','ạ','ả','ã','ă','ắ','ằ','ặ','ẳ','ẵ','â','ấ','ầ','ậ','ẩ','ẫ'] => 'a',
  32. ['Ä‘'] => 'd',
  33. ['é','è','ẹ','ẻ','ẽ','ê','ế','ề','ệ','ể','ễ'] => 'e',
  34. ['í','ì','ị','ỉ','ĩ'] => 'i',
  35. ['ó','ò','ọ','ủ','õ','ơ','ớ','ờ','ợ','ở','ỡ','ô','ố','ồ','ộ','ổ','ỗ'] => 'o',
  36. ['ú','ù','ụ','ů','ũ','ư','ứ','ừ','ự','ử','ữ'] => 'u',
  37. ['ý','ỳ','ỵ','ỷ','ỹ'] => 'y',
  38. } unless defined?(CHARACTER_MAPPINGS)
  39.  
  40. def token_stream(field, str)
  41. ts = StandardTokenizer.new(str)
  42. ts = UnicodeLowerCaseFilter.new(ts)
  43. ts = MappingFilter.new(ts, CHARACTER_MAPPINGS)
  44. end
  45. end

Report this snippet


Comments

RSS Icon Subscribe to comments

You need to login to post a comment.