Spark project 10 requested URLs along with count - problem 1


#1

Hello,

According to Sandeep’s advice, it is working but not working with actual regex. Please find the code as below. (Here I want to take url for example. “/shuttle/missions/sts-68/news/sts-68-mcc-05.txt”

val pattern = “”"(\ /)(\S+)(\S*)"""
val line1 = “”“in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] “GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0” 200 1839"”"
val pattern(ip,x,y)= line1

Please find the error.

scala.MatchError: in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] “GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0” 200 1839 (of class ja
va.lang.String)
… 48 elided

I could mange to get the URL with below code but I think this not a good practice. Please help.

// below fuction gives only URL //
def extractURL(line:String):(String) = {
var arr = line.split(" ");
arr(6).trim
} // the above function is working fine … gives only URL

scala> var nurlkeyval = urlaccesslogs.map(line=>(extractURL(line),1))
n_url: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[14] at map at :36

var urlcounts = nurlkeyval.reduceByKey((a,b) => (a+b))
var urlcountsOrdered = urlcounts.sortBy(f => f._2, false);
urlcountsOrdered.take(10)


#2

#3

#4

I think this has been addressed now in another thread.


#5

Hello,

I am sorry I could not convey my query properly here I think.

I have tried the solution that you have advised, however it does not work (when I changed the correct pattern) if you can see my code in previous message… I have tried to explain with proper justification.

val pattern = “”"(\ /)(\S+)(\S)"""*
val line1 = “”“in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] “GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0” 200 1839"”"
val pattern(ip,x,y)= line1

Please find the error.

scala.MatchError: in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] “GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0” 200 1839 (of class ja
va.lang.String)
… 48 elided
strong text
thanks.


#6

I can see the following issues with the code:

  1. Special double quote characters. The Microsoft word or Google document seem to convert the double quotes to the curly double quotes which are special characters. Please those.
  2. The pattern string should have “.r” at end of it else it will be considered as normal string not regular expression
  3. The pattern would be matching end to end. So, I think the regular expression is not matching. As per your patterm, line1 should stat with space followed by a slash. Then 1 or more of the non-whitespace following by 1 non-whitespace character. This pattern would not match the line1.

#7

please help me to solve this question.please share the code.


#8

i am getting the same problm.please help me


#9

Could you post your code and error message with screenshot?