Log parsing project

Can anyone please explain me this function line by line used in LOG PARSING using python??

PLEASE ITS URGENT

def parseLogLine(log):
m = re.match(PATTERN, log)
if m:
**_

return [Row(host=m.group(1), timeStamp=m.group(4),url=m.group(6), httpCode=int(m.group(8)))]

_**
else:
return []

THANKYOU

It matches the PATTERN with the “log” text:

m = re.match(PATTERN, log)

If the match is found then this condition will be true:

if m:

The regular expression “PATTERN” would be having tagged expression or captures. 1 refers to the first capture:

m.group(1)

It returns an array containing the object of Row class. This object of Row class is being created using the constructor of Row with the various arguments such as host, timeStamp and url extracted from the “log” text using the regular expression:

return [Row(host=m.group(1), timeStamp=m.group(4),url=m.group(6), httpCode=int(m.group(8)))]

i have a confusion that how u are creating the groups ,
see i understood the group(1) but than group(4),(6)&(8) i didn’t understand that why we have not taken group(2) for timestamp and so on.

please clarify this

Please look at the data and look at the pattern. You will get an idea which field has been missed.
may be that field was not needed.

Post one line from “log” and “PATTERN” here.

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] “GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0” 200 1839

PATTERN = ‘^(\S+) (\S+) (\S+) [([\w:/]+\s[+-]\d{4})] “(\S+) (\S+)(.*)” (\d{3}) (\S+)’

[Row(host=m.group(1), timeStamp=m.group(4),url=m.group(6), httpCode=int(m.group(8)))]

DO GROUP(6) CONTAINS ALL FROM /shuttle/missions/sts-68/news/sts-68-mcc-05.txt

Yes, till the space. '(\S+) ’ means any-non-whitespace characters followed by a space.

ONE MORE QUESTION SIR I HOPE I AM NOT IRRITATING YOU…

PLEASE EXPLAIN THE (TIMESTAMP,1,14) command mentioned below,

timeStamp seems to be one of the columns of nasa_log dataframe.

I am asking about (timestamp,1,14) what is significance of (1,14)