Spark project -- Write Spark code to find out unique HTTP codes problem 4

ns2000 · September 21, 2017, 8:03am

Query -2 Spark project Problem 4

Please find code and error.

val logFile = sc.textFile("/data/spark/project/NASA_access_log_Aug95.gz")

def containsHTTP(line:String):Boolean = {
val pattern = “”"(\d{3})""".r
val res = pattern.findFirstMatchIn(line)
if (res.isEmpty)
{
return false
}
else
{
return true
}
}

var urlaccesslogs = logFile.filter(containsHTTP)

// below fuction gives only HTTP //
def extractHTTP(line:String):(String) = {
var arr = line.split(" ");
arr(8)
}// the above function is working fine … gives only HTTP

var HTTPval = urlaccesslogs.map(line=>(extractHTTP(line),1))

var HTTPcnts = HTTPval.reduceByKey((a,b) => (a+b))

var HTTPcountsOrdered = HTTPcounts.sortBy(f => f._2, false);
HTTPcountsOrdered.take(5).foreach(println)

Please find error as below.

17/09/19 10:59:38 ERROR scheduler.TaskSetManager: Task 0 in stage 53.0 failed 1 times; aborting joborg.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 53.0 failed 1 times, most recent failure: Lost task 0.0 in stage 53.0 (TID 37, localhost): java.lang.ArrayIndexOutOfBoundsException

sgiri · September 21, 2017, 9:06am

What is happening here is that when you are calling map(), it is calling your function even on the lines which are empty.

Therefore, you could use filter method before that statement.
Here is example code:

var HTTPval = urlaccesslogs.filter(pattern.findFirstMatchIn(_)).map(line=>(extractHTTP(line),1))

ns2000 · September 21, 2017, 11:55pm

Hello Sandeep,

Thanks for revert and I agree with you. however I did same thing in below my code.

val logFile = sc.textFile("/data/spark/project/NASA_access_log_Aug95.gz")

// Here you can see where I am filtering empty line.
def containsHTTP(line:String):Boolean = {
val pattern = “”"(\d{3})""".r
val res = pattern.findFirstMatchIn(line)
if (res.isEmpty)
{
return false
}
else
{
return true
}
}

var urlaccesslogs = logFile.filter(containsHTTP)

// below fuction gives only HTTP //
def extractHTTP(line:String):(String) = {
var arr = line.split(" ");
arr(8)
}//

var HTTPval = urlaccesslogs.map(line=>(extractHTTP(line),1))

// output //

scala> HTTPval.take(5).foreach(println)
(200,1)
(304,1)
(304,1)
(304,1)
(304,1)

//here my code does not work..
var HTTPcnts = HTTPval.reduceByKey((a,b) => (a+b))

var HTTPcountsOrdered = HTTPcounts.sortBy(f => f._2, false);
HTTPcountsOrdered.take(5).foreach(println)

sgiri · September 22, 2017, 9:25am

What error are you getting?

I can see that that there is typing mistake in variable name HTTPcnts and HTTPcounts.

ns2000 · September 25, 2017, 8:03am

worng variable name is typo mistake… however error comes in (reduceKey) transformation.

Please find the errors as below

scala> var HTTPval = urlaccesslogs.map(line=>(extractHTTP(line),1))
HTTPval: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[8] at map at :32

scala> var HTTPcnts = HTTPval.reduceByKey((a,b) => (a+b))
HTTPcnts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at :34

scala> HTTPcnts.take(5).foreach(println)
17/09/25 08:00:12 ERROR executor.Executor: Exception in task 0.0 in stage 2.0 (TID 1)
java.lang.ArrayIndexOutOfBoundsException
17/09/25 08:00:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 1, localhost): java.lang.ArrayIndexOutOfBoundsException
17/09/25 08:00:12 ERROR scheduler.TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (T
ID 1, localhost): java.lang.ArrayIndexOutOfBoundsException

sgiri · September 25, 2017, 4:21pm

The main error that is occuring is ArrayIndexOutOfBoundsException.

This may be due to some line not having 9 words because you are splitting by space and then accessing arr(8). You have to remove such invalid lines from the records using filter method.

sgiri · September 25, 2017, 4:21pm

The main error that is occuring is ArrayIndexOutOfBoundsException.

This may be due to some line not having 9 words because you are splitting by space and then accessing arr(8). You have to remove such invalid lines from the records using filter method.

ns2000 · September 26, 2017, 8:22am

Could you please help what is wrong in below code.

// below fuction gives only HTTP //
def extractHTTP(line:String):Array[String] = {
var arr = line.split(" ");
if (arr(8) != null || arr(8) != " ")
arr(8)
**}// **

error : :26: error: type mismatch;
** found : String**
** required: Array[String]**
** arr(8)**
** ^**
:25: error: type mismatch;
** found : Unit**
** required: Array[String]**
** if (arr(8) != null || arr(8) != " ")**

ns2000 · September 26, 2017, 9:23am

Hello , kindly ignore previous post…

I have used filter method but still no luck.

Please find the whole code as below. (error is same). Kindly help… stuck badly.

Problem 4 -

Write Spark code to find out unique HTTP codes returned by the server along with count (this information is helpful for DevOps team
to find out how many requests are failing so that appropriate action can be taken to fix the issue)

val logFile = sc.textFile("/data/spark/project/NASA_access_log_Aug95.gz")

def containsHTTP(line:String):Boolean = {
val pattern = “”"(\d{3})""".r
val res = pattern.findFirstMatchIn(line)
if (res.isEmpty)
{
return false
}
else
{
return true
}
}

var urlaccesslogs = logFile.filter(containsHTTP)

// below fuction gives only HTTP //

def extractHTTP(line:String):String = {
var arr = line.split(" ");
arr(8)
}

var notnullval = urlaccesslogs.filter(line=>(extractHTTP(line)) != null)

var HTTPval = notnullval.map(line=>(extractHTTP(line),1))

var HTTPcnts = HTTPval.reduceByKey((a,b) => (a+b))
var HTTPcountsOrdered = HTTPcnts.sortBy(f => f._2, false);
HTTPcountsOrdered.take(5).foreach(println)