Wednesday 16 November 2011

Canonicalization

What is Canonicalization? (also known as c14n or standardization or normalization)

Canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

Like consider a search text "Is there any side effect of taking paracetamol during cancer?" can be represented in other forms like "Side effects of paracetamol during cancer", or "During cancer taking paracetamol has what all side effects", but eventually every representation is talking about same thing. Now if you Canonicalization them they will become something like this "cancer during effect side taking paracetamol", what I did was just removed the stop words, and sorted the terms alphabetically. Now every representation will eventually match to this Canonical form.

Q. Why Canonicalization?
There are many benefits of it:
1. After doing the Canonicalization of the text you come to know the exact meaning of it whatever is the presentation.
2. Many variation of presentation can be targeted on a single title.
3. Search quality can be improved by searching the relevant terms only.

How Canonicalization?
1. Define your characters set as per your domain and remove the other characters which is not required. Like if you are dealing with english langage text data, then you can remove any character other than alphanumeric.

2. Remove the stop words

3. Do the stemming

4. Sort in a chronological order