Saturday, 26 March 2016

Parsing and Analyzing Apache Log file using linux commands

Apache log file is typically created in this format

%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"

Definition of each variable are

%h   =  IP address of the client (remote host) which made the request
%l   =  RFC 1413 identity of the client
%u   =  userid of the person requesting the document
%t   =  Time that the server finished processing the request
%r   =  Request line from the client in double quotes
%>s  =  Status code that the server sends back to the client
%b   =  Size of the object returned to the client

Sample Log file data - - [26/Mar/2016:06:48:21 +0000] "GET /index.php HTTP/1.1" 200 6933 "" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0" - - [26/Mar/2016:06:50:53 +0000] "GET /robots.txt HTTP/1.1" 404 510 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +" - - [26/Mar/2016:06:52:48 +0000] "HEAD / HTTP/1.1" 200 205 "-" "Mozilla/5.0+(compatible; UptimeRobot/2.0;" - - [26/Mar/2016:06:57:33 +0000] "GET / HTTP/1.1" 200 6952 "-" "Sogou web spider/4.0(+" - - [26/Mar/2016:06:57:48 +0000] "HEAD / HTTP/1.1" 200 205 "-" "Mozilla/5.0+(compatible; UptimeRobot/2.0;" - - [26/Mar/2016:07:00:44 +0000] "GET /analytics-technology-solution.php HTTP/1.1" 200 6174 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [26/Mar/2016:07:01:20 +0000] "GET /images/tech/java_logo.png HTTP/1.1" 200 20524 "" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36" - - [26/Mar/2016:07:02:48 +0000] "HEAD / HTTP/1.1" 200 205 "-" "Mozilla/5.0+(compatible; UptimeRobot/2.0;" - - [26/Mar/2016:07:07:48 +0000] "HEAD / HTTP/1.1" 200 205 "-" "Mozilla/5.0+(compatible; UptimeRobot/2.0;"

Token Separation using awk command

cat access.log | awk '{print $1}' #IP Address (%h)

cat access.log | awk '{print $4,5}' # data/time (%t)

cat access.log | awk '{print $9}' # status codes

cat access.log | awk '{print $10}' # size

cat access.log | awk -F\" '{print $2}' # Requested URI

cat access.log | awk -F\" '{print $4}' # Referer URL

cat access.log | awk -F\" '{print $6}' # Agents

Aggregation commands

sort - sort all the lines
uniq -c - group all the lines with unique value and maintain the count
uniq -ci - group all the lines with unique value and maintain the count, with case ignored
sort -rg - sort in descending order (r), and numerical sorting (g)
head -n - gives you n lines from the top


#Get top 20 agents
cat access.log | awk -F \" '{print $6}' |  sort | uniq -c | sort -rg | head -n 20

#Get top 20 IP address from where the requests came
cat access.log | awk '{print $1}' |  sort | uniq -c | sort -rg | head -n 20

#For a certain IP, find all the agents
cat access.log | grep "" | awk -F \" '{print $6}' |  uniq -c | sort -rg

#Get who all accessing your image files
awk -F\" '($2 ~ /\.(jpg|gif)/ && $4 !~ /^http:\/\/www\.tuskerdatalab\.com/){print $4}' access.log | sort | uniq -c | sort

Monday, 9 February 2015

5 step process to improve your SEO in a month

SEO (Search Engine Optimization) is something which is bit intangible and unpredictable and so quite interesting area also. Lots of people think to do the SEO but hardly anyone understand what they want to achieve out of SEO.

 Lets try to understand the intent and possible achievement out of SEO.

SEO can be understood as marketing strategy, by ranking up high in Google, but it's the only 1st step of SEO. SEO has 2 more steps left which has almost remained hidden.
So what are the steps:

1. Short term strategy - Helps in quick marketing, few keyword ranking for increasing your reach-ability, estimated time is 1 month
2. Long term strategy - Helps in building a sustainable traffic, estimated time is 6 months
3. Ecosystem strategy - Helps search engineer in refining their knowledge and search, estimated time is 2 years and more

Now a days the search engine algorithm is mature, intelligent, and intuitive. While there are quite a few things that require technical expertise, the overall theme of SEO is pretty simple. So if you are targeting something short term strategy and planning to do some keyword ranking optimization, it's pretty simple.

Do these 5 step process and it would improve SERP within a month.

1. High performance is must : Your page should load as fast as possible, As an average measurement in GA if it is taking more than 7 seconds, you are bad, 4 seconds is awesome, even if it is 5 seconds, it's good. Write good HTML to save DOM execution time, use CDN to deliver CSS, JS and Image, use caching etc.

2. Understand your business and customers and write a quality blog about educating your product, talking about peripheral of your business domain, product and may be your competitors. This helps a lot to Search engine to decide your seriousness of the product, their social need and as the people will start consuming your product, Search engine will give lot of weight to your product, your website. Blog is to educate them, providing extra value about your product. Google algorithm also considers social websites, blogs and social contents to be pushed into Google news section, which is an advantage.

3. Keep watching your spamy links : Google Penguin updates are targeting to penalized website who create spamy link backs. With October Penguin 3.0 update it's much more clear that any anchor text manipulation and spamy link will get penalized. It's good to keep watching your links through webmaster or some tool and audit them at least once in 6 month. And work towards get good quality link back, blog is one of the strategy to get good link backs.

4. Content is the king and play on keyword density: Creating high quality content is still the central idea in SEO. Think about creating a content in much more organized, creative and new style. Like providing good meta information, images, bread crumbs for navigation, readable and something new(be creative). Use the keyword density keyword repetition intelligently on the page. You can use SEO quake kind of tool to check the keyword density. Every keyword repetition should have a value, it should not be just stuffed.

5. Promote your website on social website
Create Facebook page, twitter handle and other social properties to talk about your product, gather people and promote your website, links. This has 2 advantage, one social index score becomes better which also plays a role in SEO and other you can get extra traffic from social media.

And Few small things to remember:
  1. Don't publish or build business on illegal stuffs
  2. HTTPS serving websites are getting extra value
  3. Don't play on aggregation and content farming, now it doesn't work
  4. Only keyword stuffing can kill you
  5. Being mobile friendly website is quite helpful to receive mobile traffic

It's still simple to achieve good but you've have to focused, consistent and dedicated. It works, it really works :)

Monday, 1 September 2014

Some concepts of Solr

Q. Difference in simple query parser and Dismax and eDismax query parser
A: Dismax is an abbreviation of Disjoint Max and it is a popular query mode with Solr. Default query parse in Solr is quite stupid, it doesn't support the syntactic parsing of the query and doesn't handles the exceptions smoothly. If you put some extra character, it might throw exception. Where as Dismax query parse is pretty safe with not to get exceptions, more over it understand the query with logical operator, weight allocation and results accordingly. Like ( "Caner" AND "Blood" ) OR "Blood Cancer"^2 : will be processed by Dismax.

The Extended DisMax Query Parser (eDisMax) is a robust parser designed to process advanced user input directly. It searches for the query words across multiple fields with different boosts, based on the significance of each field. Additional options let you influence the score based on rules specific to each use case (independent of user input).

Q. Using Token based searching in Solr
A. It's simple :) You should prepare a corpus of words and phrases and assign different weight to them based on relevance and importance to your domain. Now once you get the search query, use N gram tokenizer to replace the tokens by boosted factor based one their relevance or weight, and use the AND and OR operation wisely.

Example "Blood cancer symptoms", can be converted into "Blood Cancer symptoms"^100 OR ("Blood caner" AND symptom)^10 OR (Blood AND Cancer AND symptom)^5 OR "Blood Cancer"

This query will result you much relevant results than simple query.

Q. Highlighter to highlight the matched keywords
A : First to set the highlighting while making query, it's very simple you just to query.setHightlight(true) and you can also set other parameters.

        query.setParam("hl.fl", highlightingField);
");        query.setParam("hl.snippets","2");

Now when you get the response, you have to extract the snippets from response. It is normally a Map of highlightings which returns the list of String (which are snippets) by passing the key as identifier field.

 if (response.getHighlighting()!=null
      && response.getHighlighting().get(object.getId().toString()) != null) {
                    List highlightSnippets = response.getHighlighting().get(object.getId().toString()).get(Highlighted_Field_Name);
                        StringBuilder contentToShow=new StringBuilder();
                        for(String snippet:highlightSnippets){
                            if(contentToShow.length()<170 p="">                                contentToShow.append(snippet).append(" ...");
                            } else {

Refer at here

Monday, 9 June 2014

Enabling JMX in Tomcat

Any Java program can be monitored using Jconsole. Jconsole is an user interface which comes by default with any JDK package, you just need to type jconsole on your terminal and it will show you the interface. Jconsole works along with JMX, which has to be enabled in the respective java program or JVM.

How to enable JMX in tomcat?

You to enable these variables while starting of the JVM. In tomcat you can simple enable them by using CATALINA_OPTS variable in file in bin folder of tomcat.{port to access}
-Djava.rmi.server.hostname={optional, allow what ip to access this Tomcat}
export CATALINA_OPTS="$CATALINA_OPTS -Djava.rmi.server.hostname="

To know more details about variables, you can check here,

After changing this, start the tomcat and then run jconsole from anywhere and connect using
"IP:port", if authentication is enable, enter the credentials also. Now you can see all the JVM stuff, threads, memory usages, CPU usages, GCs performed etc.

Friday, 25 April 2014

RSA server certificate CommonName (CN) '' does NOT match server name


I was getting this error on restarting of my Apache,

"I receive the Error: "RSA server certificate CommonName (CN) '' does NOT match server name"?"

And there was on website which was able to tell me that is something wrong, which is

In my case, I had missed an entry of "ServerName" directive in Apache virtual host configuration.

<VirtualHost _default_:443>

SSLEngine on
SSLCertificateFile /root/ssl/domain.crt
SSLCertificateKeyFile /root/ssl/server.key
SSLCertificateChainFile /root/ssl/bundle.crt


So after entering the ServerName, apache error.log file stopped showing the error message. And the website sslshopper also, started getting the certificate of diagnosis, which made me believe, that issue is resolved.

In general, you should also check
1. DNS entry (host name and IP is correct)
2. /etc/hosts file
3. while creating CSR file, did you miss the common name by any chance, In this case create a new CSR and Re-key the certificate and deploy new certificates.
4. Check in Apache or Nginx, ServerName must match, certificates are issued for a fix domain.

Tuesday, 25 February 2014

fb:share_button/fb:login_button/fb: failed to resize in 45s

I saw one strange issue while using fb_share button which I would like to share and How did I fix it.

When you use fb_share buttons with their stats, means you are using some javascript to get the starts and show beside of the button. When the page load it was working fine, but when you navigate to any other page by clicking on a link or button, and you do browser back to the same page again, it tries to load fb_share button again.. but it keep the textarea locked for 45 seconds, which was quite strange.

I guess it had to be with while fb tries to get the data it also tries to make the proper UI of the button, size and all, and during that period, it locks the textarea.

I started searching for the issue and got a pathetic solution on stackoverflow which says to use this css

.fb-share-button span,
.fb-share-button iframe {
    width: 120px! important;
    height: 25px! important;

And amazing part is, it works :)

Thursday, 5 December 2013

Make an anchor tag with href but do not let user navigate away

This can be utilized a lot for SEO purposes. Anchor tag href would be crawled by search engines, but when user will be click, it will not navigate away the user from the page, and you can handle the onclick even in anyway you want to.

<a href="" onclick="dothis(event); return false;">Click me</a>

function dothis(e){
if (!e) var e = window.event;
if (e.stopPropagation)
return false;