Posted By

ctran on 09/05/07

Tagged

Versions (?)

Last Edited at 09/05/07 07:26pm

Statistics

Viewed 95 times

Favorited by 1 user(s)

VietnameseAnalyzer.rb

/ Published in: Ruby

Convert Vietnamese characters into ASCII so they can be indexed and searched.

Expand | Embed | Plain Text

Copy this code and paste it in your HTML

require 'unicode'
 
# Normalizes token text to lower case.
class UnicodeLowerCaseFilter
  def initialize(token_stream)
    @input = token_stream
  end
 
  def text=(text)
    @input.text = text   
  end
 
  def next()
    t = @input.next()
 
    if (t == nil)
      return nil
    end
 
    t.text = Unicode.downcase(t.text)
    return t
  end
end
 
class VietnameseAnalyzer < Ferret::Analysis::Analyzer
  include Ferret::Analysis
 
  # Standard Character mappings to remove all special characters
  # so only default ASCII characters get indexed
  CHARACTER_MAPPINGS = {
    ['Ã¡','Ã ','áº¡','áº£','Ã£','Äƒ','áº¯','áº±','áº·','áº³','áºµ','Ã¢','áº¥','áº§','áº','áº©','áº«'] => 'a',
    ['Ä‘'] => 'd',
    ['Ã©','Ã¨','áº¹','áº»','áº½','Ãª','áº¿','á»','á»‡','á»ƒ','á»…'] => 'e',
    ['Ã','Ã¬','á»‹','á»‰','Ä©'] => 'i',
    ['Ã³','Ã²','á»','á»§','Ãµ','Æ¡','á»›','á»','á»£','á»Ÿ','á»¡','Ã´','á»‘','á»“','á»™','á»•','á»—'] => 'o',
    ['Ãº','Ã¹','á»¥','Å¯','Å©','Æ°','á»©','á»«','á»±','á»','á»¯'] => 'u',
    ['Ã½','á»³','á»µ','á»·','á»¹'] => 'y',
  } unless defined?(CHARACTER_MAPPINGS)
 
  def token_stream(field, str)
    ts = StandardTokenizer.new(str)
    ts = UnicodeLowerCaseFilter.new(ts)
    ts = MappingFilter.new(ts, CHARACTER_MAPPINGS)
  end
end

Report this snippet Tweet

Comments

Subscribe to comments

Comment:

You need to login to post a comment.