Analyzing Apache Log Files

  • This tutorial will cover the steps of processing log files
    • Read the log file as a dataframe
    • Parse each log line, extract relevent fields and create a structured dataframe
    • Check for bad records or bad data elements
    • Clean up the fields and create a clean data set
    • Do some basic exploratory analysis to find insights

Load the log file

The data set from NASA Kennedy Space Center web server in Florida. The full data set is freely available at http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html, and it contains all HTTP requests for the month of august.
  • Read the log file using sqlContext.read.text() into a dataframe. Each log line become a row or record with a single string column called value.
In [1]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
In [2]:
log_file = sqlContext.read.text( "file:///home/hadoop/lab/data/access_log_Aug95")
In [3]:
log_file.show( 5 )
+--------------------+
|               value|
+--------------------+
|in24.inetnebr.com...|
|uplherc.upl.com -...|
|uplherc.upl.com -...|
|uplherc.upl.com -...|
|uplherc.upl.com -...|
+--------------------+
only showing top 5 rows

In [4]:
log_file.show(5, truncate=False)
+--------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                     |
+--------------------------------------------------------------------------------------------------------------------------+
|in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839|
|uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] "GET / HTTP/1.0" 304 0                                                   |
|uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0                          |
|uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0                        |
|uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0                           |
+--------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows

Parsing the log file and extract relevant fields

  • The logs are in Common Log Format.<
  • The fields are
    • remotehost rfc931 authuser [date] "request" status bytes
field meaning
remotehost hostname or IP addredd of the machine which made the http request
rfc931 Remote logname of the user.
authuser Username of the remote user, as authenticated by the HTTP server.
[date] The date and time of the http request arriving at the web server.
"request" The http request url from the browser
status The HTTP status code, which server sent back to the client.
bytes The number of bytes Content-Length transferred back to the client.

Spark sql function regexp_extract() be used to extract the fields from each log line.

Details of regex can be found at RegexOne web site.

In [5]:
from pyspark.sql.functions import split, regexp_extract
split_df = log_file.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                        regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                        regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                        regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                        regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_df.show(2, truncate=False)
+------------------+--------------------------+-----------------------------------------------+------+------------+
|host              |timestamp                 |path                                           |status|content_size|
+------------------+--------------------------+-----------------------------------------------+------+------------+
|in24.inetnebr.com |01/Aug/1995:00:00:01 -0400|/shuttle/missions/sts-68/news/sts-68-mcc-05.txt|200   |1839        |
|uplherc.upl.com   |01/Aug/1995:00:00:07 -0400|/                                              |304   |0           |
+------------------+--------------------------+-----------------------------------------------+------+------------+
only showing top 2 rows

Cache the dataframe

In [6]:
split_df.cache()
Out[6]:
DataFrame[host: string, timestamp: string, path: string, status: int, content_size: int]

Data Cleaning

Let's see how well our parsing logic worked. First, let's verify that there are no null rows in the original data set.
In [7]:
split_df.columns
Out[7]:
['host', 'timestamp', 'path', 'status', 'content_size']

How many null records are available in the original file

In [8]:
log_file.filter( log_file["value"].isNull() ).count()
Out[8]:
0

Not a single null log entry found in the original file.

So, how many null values in the columns were found after parsing the records

In [9]:
bad_rows_df = split_df.filter(split_df['host'].isNull() |
                            split_df['timestamp'].isNull() |
                            split_df['path'].isNull() |
                            split_df['status'].isNull() |
                           split_df['content_size'].isNull())
bad_rows_df.count()
Out[9]:
14178

There seems to be 14178 rows which contains the null values after parsing the records or applying regular expressions

In [10]:
from pyspark.sql import Column

def count_null( col_name ):
  return split_df.filter( split_df[col_name].isNull() ).count()
In [11]:
for col_name in split_df.columns:
  print( col_name, " : ", count_null( col_name ) )
host  :  0
timestamp  :  0
path  :  0
status  :  0
content_size  :  14178

Looks like there are many nulls in the content_size column. Rest all columns seem to be fine

In [12]:
split_df.filter( split_df["content_size"].isNull() ).show( 5 )
+--------------------+--------------------+--------------------+------+------------+
|                host|           timestamp|                path|status|content_size|
+--------------------+--------------------+--------------------+------+------------+
|        gw1.att.com |01/Aug/1995:00:03...|/shuttle/missions...|   302|        null|
|js002.cc.utsunomi...|01/Aug/1995:00:07...|/shuttle/resource...|   404|        null|
|    tia1.eskimo.com |01/Aug/1995:00:28...|/pub/winvn/releas...|   404|        null|
|itws.info.eng.nii...|01/Aug/1995:00:38...|/ksc.html/facts/a...|   403|        null|
|grimnet23.idirect...|01/Aug/1995:00:50...|/www/software/win...|   404|        null|
+--------------------+--------------------+--------------------+------+------------+
only showing top 5 rows

Find out frequencies of null values for each status code.

In [13]:
( split_df.filter(
  split_df["content_size"].isNull() )
  .groupby( "status").count() ).show()
+------+-----+
|status|count|
+------+-----+
|   501|   27|
|   302| 3836|
|   200|   78|
|   400|   10|
|   403|  171|
|   404|10056|
+------+-----+

Find out the original log lines which converts to null values

  • Check with regular expression if the filed is not a digit
In [14]:
( log_file.filter(
      ~log_file["value"].rlike('''\d+$'''))
      .show( 5, truncate = False ) )
+---------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                      |
+---------------------------------------------------------------------------------------------------------------------------+
|gw1.att.com - - [01/Aug/1995:00:03:53 -0400] "GET /shuttle/missions/sts-73/news HTTP/1.0" 302 -                            |
|js002.cc.utsunomiya-u.ac.jp - - [01/Aug/1995:00:07:33 -0400] "GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0" 404 -|
|tia1.eskimo.com - - [01/Aug/1995:00:28:41 -0400] "GET /pub/winvn/release.txt HTTP/1.0" 404 -                               |
|itws.info.eng.niigata-u.ac.jp - - [01/Aug/1995:00:38:01 -0400] "GET /ksc.html/facts/about_ksc.html HTTP/1.0" 403 -         |
|grimnet23.idirect.com - - [01/Aug/1995:00:50:12 -0400] "GET /www/software/winvn/winvn.html HTTP/1.0" 404 -                 |
+---------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows

We know that it is for the requests for which the server did not return any response

  • We can convert all null values to 0
In [15]:
cleaned_df = split_df.na.fill( 0 )
cleaned_df.cache()
Out[15]:
DataFrame[host: string, timestamp: string, path: string, status: int, content_size: int]

Do we have any more null values?

In [16]:
cleaned_df.filter( cleaned_df["content_size"].isNull() ).count()
Out[16]:
0

Find out most frquently accessed pages

In [22]:
from pyspark.sql.functions import desc

top_10_pages = cleaned_df.groupby( "path" ).count().sort( desc( "count" ) )
In [23]:
top_10_pages.show( 10, truncate = False )
+---------------------------------------+-----+
|path                                   |count|
+---------------------------------------+-----+
|/images/NASA-logosmall.gif             |97275|
|/images/KSC-logosmall.gif              |75283|
|/images/MOSAIC-logosmall.gif           |67356|
|/images/USA-logosmall.gif              |66975|
|/images/WORLD-logosmall.gif            |66351|
|/images/ksclogo-medium.gif             |62670|
|/ksc.html                              |43618|
|/history/apollo/images/apollo-logo1.gif|37806|
|/images/launch-logo.gif                |35119|
|/                                      |30105|
+---------------------------------------+-----+
only showing top 10 rows

Seems like lot of images which are embedded into pages

Find out the html pages which are frequently accessed

In [29]:
only_html_pages = cleaned_df.filter( cleaned_df["path"].rlike( 'html' ) )
In [30]:
top_10_pages = only_html_pages.groupby( "path" ).count().sort( desc( "count" ) )
top_10_pages.show( 10, truncate = False )
+-----------------------------------------------+-----+
|path                                           |count|
+-----------------------------------------------+-----+
|/ksc.html                                      |43618|
|/shuttle/missions/sts-69/mission-sts-69.html   |24592|
|/shuttle/missions/missions.html                |22429|
|/software/winvn/winvn.html                     |10343|
|/history/history.html                          |10111|
|/history/apollo/apollo.html                    |8973 |
|/shuttle/countdown/liftoff.html                |7858 |
|/history/apollo/apollo-13/apollo-13.html       |7160 |
|/shuttle/technology/sts-newsref/stsref-toc.html|6506 |
|/shuttle/missions/sts-69/images/images.html    |5260 |
+-----------------------------------------------+-----+
only showing top 10 rows