Log parsing project

Anubhav_Gupta · March 31, 2020, 3:22am

Can anyone please explain me this function line by line used in LOG PARSING using python??

PLEASE ITS URGENT

def parseLogLine(log):
m = re.match(PATTERN, log)
if m:
**_

return [Row(host=m.group(1), timeStamp=m.group(4),url=m.group(6), httpCode=int(m.group(8)))]

_**
else:
return []

THANKYOU

sgiri · March 31, 2020, 5:33am

It matches the PATTERN with the “log” text:

m = re.match(PATTERN, log)

If the match is found then this condition will be true:

if m:

The regular expression “PATTERN” would be having tagged expression or captures. 1 refers to the first capture:

m.group(1)

It returns an array containing the object of Row class. This object of Row class is being created using the constructor of Row with the various arguments such as host, timeStamp and url extracted from the “log” text using the regular expression:

return [Row(host=m.group(1), timeStamp=m.group(4),url=m.group(6), httpCode=int(m.group(8)))]

Anubhav_Gupta · March 31, 2020, 5:50am

i have a confusion that how u are creating the groups ,
see i understood the group(1) but than group(4),(6)&(8) i didn’t understand that why we have not taken group(2) for timestamp and so on.

please clarify this

sgiri · March 31, 2020, 6:03am

Please look at the data and look at the pattern. You will get an idea which field has been missed.
may be that field was not needed.

sgiri · March 31, 2020, 6:03am

Post one line from “log” and “PATTERN” here.

Anubhav_Gupta · March 31, 2020, 6:06am

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] “GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0” 200 1839

PATTERN = ‘^(\S+) (\S+) (\S+) [([\w:/]+\s[+-]\d{4})] “(\S+) (\S+)(.*)” (\d{3}) (\S+)’

[Row(host=m.group(1), timeStamp=m.group(4),url=m.group(6), httpCode=int(m.group(8)))]

sgiri · March 31, 2020, 7:50am

Anubhav_Gupta · March 31, 2020, 8:52am

DO GROUP(6) CONTAINS ALL FROM /shuttle/missions/sts-68/news/sts-68-mcc-05.txt

sgiri · March 31, 2020, 9:07am

Yes, till the space. '(\S+) ’ means any-non-whitespace characters followed by a space.

Anubhav_Gupta · March 31, 2020, 11:36am

ONE MORE QUESTION SIR I HOPE I AM NOT IRRITATING YOU…

PLEASE EXPLAIN THE (TIMESTAMP,1,14) command mentioned below,

sgiri · March 31, 2020, 3:57pm

timeStamp seems to be one of the columns of nasa_log dataframe.

Anubhav_Gupta · April 1, 2020, 4:13am

I am asking about (timestamp,1,14) what is significance of (1,14)