What is Jabber/XMPP?
Jabber is a non-for-profit organization who overlook the development of XMPP.
XMPP : Extensible Messaging and Presence Protocol (XMPP), an open XML based communications technology, widely used for IM, chatting.
There are many many implementation of XMPP as a server, and many many Jabber clients also. Gtalk/MSN/AOL/mac does support XMPP, so using any Jabber client you can communicate with above all account holding buddies.
Q. Where to get a Jabber client?
A. Here you can find the list : http://xmpp.org/xmpp-software/clients/
Q. Where to get XMPP server?
A. I used Openfire for it that you can download it from here : http://www.igniterealtime.org/downloads/index.jsp
Step for Making your chatter Bot
Step1: Download and Making you server up
Download XMPP server and install it on a machine. If you are Window/Mac user then you can directly download exe/dmg file and install.
But if you are installing it on Linux, then most simplest way is to download the tar file, extract it and run the openfire
executable file from bin folder, like "$ ./openfire start". Make sure before you start openfire you have Java set on your system.
In the script of openfire, just set some environment variable to access the Java like this.
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export JRE_HOME=/usr/lib/jvm/java-6-openjdk/jre
And while logging it might ask for access on certain directory or files, so instead giving access to them just edit the log4j.xml files(which you might find in lib directory), and change the log file, either you can hardcode them or just change the directory where you already have the access.
In case of Windows and Mac you can start your Openfire from program files or Applications.
Step 2: Install
Once you started Openfire, you can hit this url in browser http://localhost:9090/ and configure it. Its very simple, let me know if you face any problem in that. For this step you will need a DB also, you might use Mysql or anything like that.
Step3: Writing a Client
Write a client, using smack API, before that download the smack API from Openfilre download link(above), or precisely this will do, http://www.igniterealtime.org/downloads/download-landing.jsp?file=smack/smack_3_2_1.zip
Smack here very good tutorials on "how to start" and all, so using that write Java based application which can act as a bot, Lets call it "xyz@abc.com".
Step4: Certification
SSL Certification: If you want to send friend request to gtalk/msn or any other Jabber clients most likely they ask for a SSL certification because they prefer to communicate securely. If you are deploying this for a local purpose(intranet) possibly you can create a self signed certificate and deploy that in Openfire. Or else create a CSR, give it to CA and get a cert file and deploy that in Openfire.
How to generate keyfile, CSR and all, you can take help from this URL : http://www.igniterealtime.org/builds/openfire/docs/latest/documentation/ssl-guide.html
so kool everything is ready. One simple illustration of How it will work? Like If you want to send a friend request to a aramis_123@gmail.com, that can be done using a Client "xyz", once "aramis" confirms the friend request, "xyz" can send any chatting message to aramis and aramis send back any message to xyz, the client program of xyz will receive that in the Chatlistener.
So whoever is online at their chatting client, you are reachable to them...kool
Monday, 21 November 2011
Wednesday, 16 November 2011
Canonicalization
What is Canonicalization? (also known as c14n or standardization or normalization)
Canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.
Like consider a search text "Is there any side effect of taking paracetamol during cancer?" can be represented in other forms like "Side effects of paracetamol during cancer", or "During cancer taking paracetamol has what all side effects", but eventually every representation is talking about same thing. Now if you Canonicalization them they will become something like this "cancer during effect side taking paracetamol", what I did was just removed the stop words, and sorted the terms alphabetically. Now every representation will eventually match to this Canonical form.
Q. Why Canonicalization?
There are many benefits of it:
1. After doing the Canonicalization of the text you come to know the exact meaning of it whatever is the presentation.
2. Many variation of presentation can be targeted on a single title.
3. Search quality can be improved by searching the relevant terms only.
How Canonicalization?
1. Define your characters set as per your domain and remove the other characters which is not required. Like if you are dealing with english langage text data, then you can remove any character other than alphanumeric.
2. Remove the stop words
3. Do the stemming
4. Sort in a chronological order
Canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.
Like consider a search text "Is there any side effect of taking paracetamol during cancer?" can be represented in other forms like "Side effects of paracetamol during cancer", or "During cancer taking paracetamol has what all side effects", but eventually every representation is talking about same thing. Now if you Canonicalization them they will become something like this "cancer during effect side taking paracetamol", what I did was just removed the stop words, and sorted the terms alphabetically. Now every representation will eventually match to this Canonical form.
Q. Why Canonicalization?
There are many benefits of it:
1. After doing the Canonicalization of the text you come to know the exact meaning of it whatever is the presentation.
2. Many variation of presentation can be targeted on a single title.
3. Search quality can be improved by searching the relevant terms only.
How Canonicalization?
1. Define your characters set as per your domain and remove the other characters which is not required. Like if you are dealing with english langage text data, then you can remove any character other than alphanumeric.
2. Remove the stop words
3. Do the stemming
4. Sort in a chronological order
Labels:
Canonicalization,
search engine,
text mining,
text processing
Monday, 14 November 2011
Stop Word Listing
Stop words in computation domain are the terms which need to be removed before natural language text(data) processing because they do not make any significant sense.
A precise definition:
“Words that do not appear in the index in a particular database because they are either insignificant (i.e., articles, prepositions) or so common that the results would be higher than the system can handle (as in the case of IUCAT where terms such as United States or Department are stop words in keyword searching.) Stop words vary from system to system. Also, some systems will merely ignore stop words where use of stop words in other systems will result in retrieving zero hits. ”
You have to build your own stop word (manually) as per the use of it, suppose you want to build a topic Canonicalization then stop word list could be quite big. For topics you should remove negations like "not","nothing"; and you should remove question tokens like "why", "what","how". But if you are building Canonicalization for questions, then you should keep negations and question tokens.
Stop words for Topic Canonicalization
------------------------------------
a
about
above
across
again
against
all
almost
alone
along
already
also
although
always
am
among
an
and
another
any
anybody
anyone
anything
anywhere
are
area
areas
around
as
ask
asked
asking
asks
at
away
b
backed
backing
backs
be
became
because
become
becomes
been
began
behind
being
beings
best
better
between
big
both
but
by
c
came
can
cannot
case
cases
certain
certainly
clear
clearly
come
communityid
could
d
did
differ
do
does
done
down
downed
downing
downs
e
each
early
either
end
ended
ending
ends
enough
even
evenly
ever
every
everybody
everyone
everything
everywhere
f
face
faces
fact
facts
far
felt
few
find
finds
first
for
four
from
full
fully
further
furthered
furthering
furthers
g
gave
general
generally
get
gets
gif
give
given
gives
go
going
good
goods
got
great
greater
greatest
group
grouped
grouping
groups
h
had
has
have
having
he
her
here
herself
high
higher
highest
him
himself
his
how
however
i
icon
if
im
important
in
interest
interested
interesting
interests
into
is
it
its
itself
j
just
k
keep
keeps
kind
knew
know
known
knows
l
large
largely
last
later
latest
least
less
let
lets
like
likely
long
longer
longest
m
made
make
making
man
many
may
me
member
members
men
might
more
most
mostly
mr
mrs
much
must
my
myself
n
necessary
need
needed
needing
needs
never
new
newer
newest
next
no
nobody
non
noone
not
nothing
now
nowhere
number
numbers
o
of
off
often
older
oldest
on
once
one
only
open
opened
opening
opens
or
ordered
ordering
orders
other
others
our
out
over
p
part
parted
parting
parts
per
perhaps
pl
place
places
pls
plz
pointed
pointing
possible
presented
presenting
presents
put
puts
q
quite
r
rather
really
regname
right
room
rooms
s
said
same
saw
say
says
second
seconds
see
seem
seemed
seeming
seems
sees
several
shall
she
should
show
showed
showing
shows
side
sides
since
small
smaller
smallest
so
some
somebody
someone
something
somewhere
state
states
still
such
sure
t
take
taken
than
that
the
their
them
then
there
therefore
these
they
thing
things
think
thinks
this
those
though
thought
thoughts
three
through
thus
to
today
together
too
took
toward
turn
turned
turning
turns
two
u
under
until
up
upon
us
use
used
uses
v
very
w
want
wanted
wanting
wants
was
way
ways
we
well
wells
went
were
what
when
where
whether
which
while
who
whole
whose
why
will
with
within
without
work
worked
working
works
would
x
y
yet
you
young
younger
youngest
your
yours
yr
z
Stop word for Question Canonicalization
----------------------------------------------
a
about
above
across
again
against
all
almost
alone
already
also
although
always
am
among
an
and
any
anybody
anyone
anything
anywhere
are
area
areas
around
as
ask
at
away
b
backed
backing
backs
be
became
because
become
becomes
been
being
beings
best
better
but
by
c
came
cannot
communityid
d
differ
do
does
done
downed
downing
downs
e
each
either
end
ended
ending
ends
enough
even
evenly
ever
every
everybody
everyone
everything
everywhere
f
fact
facts
far
felt
few
find
finds
first
for
four
from
full
fully
further
furthered
furthering
furthers
g
general
generally
get
gets
gif
give
given
gives
greater
greatest
h
had
has
have
having
he
her
here
herself
him
himself
his
however
i
icon
if
im
important
in
into
is
it
its
itself
j
just
k
keep
keeps
l
large
largely
last
later
latest
least
less
let
lets
m
many
may
me
might
more
most
mostly
mr
mrs
much
must
my
myself
n
needed
needing
needs
new
newer
newest
nobody
noone
now
nowhere
o
of
off
on
one
or
others
our
out
p
parted
parting
per
perhaps
pl
please
pls
plz
put
puts
q
quite
r
rather
really
regname
s
said
same
saw
say
says
see
seem
seemed
seeming
seems
sees
several
she
so
some
somebody
someone
somewhere
such
sure
t
that
the
their
them
then
there
therefore
these
they
this
those
though
three
through
thus
to
today
too
toward
u
up
upon
us
v
very
w
wanted
wanting
was
we
well
wells
went
were
will
with
would
x
y
yet
you
your
yours
yr
z
A precise definition:
“Words that do not appear in the index in a particular database because they are either insignificant (i.e., articles, prepositions) or so common that the results would be higher than the system can handle (as in the case of IUCAT where terms such as United States or Department are stop words in keyword searching.) Stop words vary from system to system. Also, some systems will merely ignore stop words where use of stop words in other systems will result in retrieving zero hits. ”
You have to build your own stop word (manually) as per the use of it, suppose you want to build a topic Canonicalization then stop word list could be quite big. For topics you should remove negations like "not","nothing"; and you should remove question tokens like "why", "what","how". But if you are building Canonicalization for questions, then you should keep negations and question tokens.
Stop words for Topic Canonicalization
------------------------------------
a
about
above
across
again
against
all
almost
alone
along
already
also
although
always
am
among
an
and
another
any
anybody
anyone
anything
anywhere
are
area
areas
around
as
ask
asked
asking
asks
at
away
b
backed
backing
backs
be
became
because
become
becomes
been
began
behind
being
beings
best
better
between
big
both
but
by
c
came
can
cannot
case
cases
certain
certainly
clear
clearly
come
communityid
could
d
did
differ
do
does
done
down
downed
downing
downs
e
each
early
either
end
ended
ending
ends
enough
even
evenly
ever
every
everybody
everyone
everything
everywhere
f
face
faces
fact
facts
far
felt
few
find
finds
first
for
four
from
full
fully
further
furthered
furthering
furthers
g
gave
general
generally
get
gets
gif
give
given
gives
go
going
good
goods
got
great
greater
greatest
group
grouped
grouping
groups
h
had
has
have
having
he
her
here
herself
high
higher
highest
him
himself
his
how
however
i
icon
if
im
important
in
interest
interested
interesting
interests
into
is
it
its
itself
j
just
k
keep
keeps
kind
knew
know
known
knows
l
large
largely
last
later
latest
least
less
let
lets
like
likely
long
longer
longest
m
made
make
making
man
many
may
me
member
members
men
might
more
most
mostly
mr
mrs
much
must
my
myself
n
necessary
need
needed
needing
needs
never
new
newer
newest
next
no
nobody
non
noone
not
nothing
now
nowhere
number
numbers
o
of
off
often
older
oldest
on
once
one
only
open
opened
opening
opens
or
ordered
ordering
orders
other
others
our
out
over
p
part
parted
parting
parts
per
perhaps
pl
place
places
pls
plz
pointed
pointing
possible
presented
presenting
presents
put
puts
q
quite
r
rather
really
regname
right
room
rooms
s
said
same
saw
say
says
second
seconds
see
seem
seemed
seeming
seems
sees
several
shall
she
should
show
showed
showing
shows
side
sides
since
small
smaller
smallest
so
some
somebody
someone
something
somewhere
state
states
still
such
sure
t
take
taken
than
that
the
their
them
then
there
therefore
these
they
thing
things
think
thinks
this
those
though
thought
thoughts
three
through
thus
to
today
together
too
took
toward
turn
turned
turning
turns
two
u
under
until
up
upon
us
use
used
uses
v
very
w
want
wanted
wanting
wants
was
way
ways
we
well
wells
went
were
what
when
where
whether
which
while
who
whole
whose
why
will
with
within
without
work
worked
working
works
would
x
y
yet
you
young
younger
youngest
your
yours
yr
z
Stop word for Question Canonicalization
----------------------------------------------
a
about
above
across
again
against
all
almost
alone
already
also
although
always
am
among
an
and
any
anybody
anyone
anything
anywhere
are
area
areas
around
as
ask
at
away
b
backed
backing
backs
be
became
because
become
becomes
been
being
beings
best
better
but
by
c
came
cannot
communityid
d
differ
do
does
done
downed
downing
downs
e
each
either
end
ended
ending
ends
enough
even
evenly
ever
every
everybody
everyone
everything
everywhere
f
fact
facts
far
felt
few
find
finds
first
for
four
from
full
fully
further
furthered
furthering
furthers
g
general
generally
get
gets
gif
give
given
gives
greater
greatest
h
had
has
have
having
he
her
here
herself
him
himself
his
however
i
icon
if
im
important
in
into
is
it
its
itself
j
just
k
keep
keeps
l
large
largely
last
later
latest
least
less
let
lets
m
many
may
me
might
more
most
mostly
mr
mrs
much
must
my
myself
n
needed
needing
needs
new
newer
newest
nobody
noone
now
nowhere
o
of
off
on
one
or
others
our
out
p
parted
parting
per
perhaps
pl
please
pls
plz
put
puts
q
quite
r
rather
really
regname
s
said
same
saw
say
says
see
seem
seemed
seeming
seems
sees
several
she
so
some
somebody
someone
somewhere
such
sure
t
that
the
their
them
then
there
therefore
these
they
this
those
though
three
through
thus
to
today
too
toward
u
up
upon
us
v
very
w
wanted
wanting
was
we
well
wells
went
were
will
with
would
x
y
yet
you
your
yours
yr
z
Labels:
Canonicalization,
search engine,
stopwords,
text mining,
text processing
Thursday, 10 November 2011
Optimizing Solr
Check the stats and find out these numbers: (you can find these stats in admin console of the solr)
1. queryRequestHandlers - First find out which one you are using, most likely it could be "dismax"
1. requests
2. avgTimePerRequest
2. queryResultCache
3. documentCache
Now 3 above things can be optimized...
1. If Number of requests for queryRequestHandler is quite good, say more than 10000 and avgTimePerRequest is more than 10ms, then I am pretty sure that It can be optimized.
2. Increase the documentCache size to the number of documents that you have in the index, or if it too big, then set it to the number which you think could get requested per day, might be 20000 to 30000 should suffice if you have traffic of 50000.
3. Increase the queryResultCache by the number which you think is getting request frequently
4. In both the cache, queyResultCache and documentCache, you should optimize the hit ratio to more than 95%, then caching is making sense to be used
5. Observe the avgTimePerRequest for the query handler
1. queryRequestHandlers - First find out which one you are using, most likely it could be "dismax"
1. requests
2. avgTimePerRequest
2. queryResultCache
3. documentCache
Now 3 above things can be optimized...
1. If Number of requests for queryRequestHandler is quite good, say more than 10000 and avgTimePerRequest is more than 10ms, then I am pretty sure that It can be optimized.
2. Increase the documentCache size to the number of documents that you have in the index, or if it too big, then set it to the number which you think could get requested per day, might be 20000 to 30000 should suffice if you have traffic of 50000.
3. Increase the queryResultCache by the number which you think is getting request frequently
4. In both the cache, queyResultCache and documentCache, you should optimize the hit ratio to more than 95%, then caching is making sense to be used
5. Observe the avgTimePerRequest for the query handler
Subscribe to:
Posts (Atom)