I am working on Log parser project using spark. In the following link https://cloudxlab.com/assessment/slide/58/spark-project-log-parsing/630/spark-project-apache-log-parsing-top-10-requested-urls
I am trying to pull the url from the logs. The problem statement is not quite clear as to what exactly is the url. Can someone clarify?
Which part of the input log is considered as the url here? There are two parts in some of the logs. For example,
"121.242.40.10 - - [03/Aug/2015:06:30:52 -0400] \"POST /mod_pagespeed_beacon?url=http%3A%2F%2Fwww.knowbigdata.com%2Fpage%2Fabout-us HTTP/1.1\" 204 206 \"http://www.knowbigdata.com/...\" \"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0\"
This line has two urls, one being the post request in the first quotes, and another in second quotes. Which is the actual url here? Also, this is not the case with most of the logs. Anything to filter out from the logs before we process them?
An example will be great.
Also, I am trying to use the pattern (["'])(?:(?=(\\?))\2.)*?\1
to get all the texts within quotes, however it works on the online regex validators but not on the scala project. Something to do with how we provide escape literals in scala?
Here is my lines of code:
val pattern = "([\"'])(?:(?=(\\?))\2.)*?\1".r
val result = pattern.findAllIn(line)
println(result)
for(i <- result)
{
println(i)
}
This gives me empty iterator for the input mentioned above. Am I missing something here?