Log Parser project - URL filter

Ashwini_Maddala · November 8, 2018, 9:31am

I am working on Log parser project using spark. In the following link https://cloudxlab.com/assessment/slide/58/spark-project-log-parsing/630/spark-project-apache-log-parsing-top-10-requested-urls
I am trying to pull the url from the logs. The problem statement is not quite clear as to what exactly is the url. Can someone clarify?

Which part of the input log is considered as the url here? There are two parts in some of the logs. For example,
"121.242.40.10 - - [03/Aug/2015:06:30:52 -0400] \"POST /mod_pagespeed_beacon?url=http%3A%2F%2Fwww.knowbigdata.com%2Fpage%2Fabout-us HTTP/1.1\" 204 206 \"http://www.knowbigdata.com/...\" \"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0\"

This line has two urls, one being the post request in the first quotes, and another in second quotes. Which is the actual url here? Also, this is not the case with most of the logs. Anything to filter out from the logs before we process them?
An example will be great.

Also, I am trying to use the pattern (["'])(?:(?=(\\?))\2.)*?\1 to get all the texts within quotes, however it works on the online regex validators but not on the scala project. Something to do with how we provide escape literals in scala?
Here is my lines of code:

val pattern = "([\"'])(?:(?=(\\?))\2.)*?\1".r
val result = pattern.findAllIn(line)
          println(result)
          for(i <- result)
          {
            println(i)
          }

This gives me empty iterator for the input mentioned above. Am I missing something here?

Mohammad_Shahrukh · November 8, 2018, 9:30am

Hi Ashwini,

A single Apache access log entry contains a few information. For eg. the last part of this log entry i.e. "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0\" gives information about the browser from which the request was made. Also the second last part "http://www.knowbigdata.com/...\" identifies the HTTP referrer i.e. the URL of the page from which the request was made.
In our case, the request is a POST request to the URL "http%3A%2F%2Fwww.knowbigdata.com%2Fpage%2Fabout-us HTTP/1.1\" which as specified after “url=” in the entry. And this request is made from a page whose URL is "http://www.knowbigdata.com/...\".

Keeping these information in mind, the exercise requires us to find the top 10 requested URL’s. Hence we need to consider the HTTP referrer part i.e. the URL of the page from which the request was made. Note that in cases where there is only one URL present, than you’ll have to consider that URL only.

Hope this explanation makes it clear.
Thanks

Ashwini_Maddala · November 8, 2018, 9:34am

In that case, we will need to consider the second part of the log as the url. Is that correct?

Mohammad_Shahrukh · November 8, 2018, 9:53am

Yes, exactly. You’ll have to consider the second part of the log as the URL, i.e. the URL of the page from which the request was made.

Ashwini_Maddala · November 8, 2018, 9:54am

Thank you very much. Can you also clarify on the second question regarding the regex usage?

Mohammad_Shahrukh · November 8, 2018, 3:19pm

Hi Ashwini, sorry for my delayed response. You might want to consider the format of Apache log for constructing your regex pattern. For eg. The first part is IP address of the client (so you can use something like (\S+) ) to match it and inside the angle bracket, we have the timestamp of the request, in which there are 4 numerical characters at the end. I hope this helps you. In case you get really stuck you can have a look at this code for reference.
Thanks!