cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Check out the JMP® Marketplace featured Capability Explorer add-in
Choose Language Hide Translation Bar
Craige_Hales
Super User
Web Logs

Log Files 

About 25% of the activity is 200-OKAbout 25% of the activity is 200-OK

I discovered my web server has been keeping some log files for a year, and some for a month. I've explored three types of log files that contain information about visitors to the server (there are other files that seem less interesting). Here's some typical examples:

The auth.log (IPs changed) 

Oct 23 06:02:24 server1 sshd[20172]: Failed password for invalid user admin from 192.168.1.1 port 64721 ssh2
Oct 23 06:09:01 server1 CRON[20181]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 23 06:09:01 server1 CRON[20181]: pam_unix(cron:session): session closed for user root
Oct 23 06:10:35 server1 sshd[20195]: Did not receive identification string from 192.168.1.1
Oct 23 06:10:40 server1 sshd[20196]: Invalid user admin from 192.168.1.1
Oct 23 06:10:40 server1 sshd[20196]: input_userauth_request: invalid user admin [preauth]
Oct 23 06:10:41 server1 sshd[20196]: pam_unix(sshd:auth): check pass; user unknown

Authorization attemps on a web server represent login attemps, either by me, software that belongs on my server (CRON, above), or another person or computer (someone trying a user ID of admin, above). Admin falls into the "Bad Guys" category.

The access.log (Apache server) (IPs changed)

192.168.2.2 - - [30/Aug/2016:08:30:29 -0400] "GET / HTTP/1.1" 200 1714 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
192.168.1.1 - - [30/Aug/2016:08:51:52 -0400] "GET /hndUnblock.cgi HTTP/1.0" 404 496 "-" "Wget(linux)"
192.168.1.1 - - [30/Aug/2016:08:51:52 -0400] "GET /tmUnblock.cgi HTTP/1.0" 404 495 "-" "Wget(linux)"
192.168.1.1 - - [30/Aug/2016:08:51:52 -0400] "GET /cgi/common.cgi HTTP/1.0" 404 496 "-" "Wget(linux)"
192.168.1.1 - - [30/Aug/2016:08:51:53 -0400] "GET /stssys.htm HTTP/1.0" 404 492 "-" "Wget(linux)"
192.168.1.1 - - [30/Aug/2016:08:51:53 -0400] "GET / HTTP/1.0" 200 1714 "-" "Wget(linux)"
192.168.1.1 - - [30/Aug/2016:08:51:54 -0400] "POST /command.php HTTP/1.0" 404 493 "-" "Wget(linux)"
192.168.3.3 - - [30/Aug/2016:09:09:01 -0400] "GET /?feed=rss2 HTTP/1.1" 200 794 "-" "Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)"

The access log entries represent attemps to get a web page from the server. The first line above represents a valid request by a web crawler (Baidu, a Chinese search engine). The get request was for / which is the main page. The 200 means the page was served OK. The next six lines all have a 404 (not found) result. This is a hacker trying to find a mis-configured web server. After several attempts, monitoring software (Fail2Ban) noticed what was happening and locked him out. I used to think Fail2Ban was really important (still is!), but then I found out about UFW, the Uncomplicated Fire Wall, and an even bigger threat.

Here's the UFW.log:

Oct  9 04:35:21 server1 kernel: [5110701.311961] [UFW BLOCK] IN=venet0 OUT= MAC= src=175.9.9.9 DST=192.168.1.1 LEN=60 TOS=0x00 PREC=0x00 TTL=40 ID=53848 DF PROTO=TCP SPT=10651 DPT=23 WINDOW=5808 RES=0x00 SYN URGP=0 
Oct  9 04:35:24 server1 kernel: [5110704.311558] [UFW BLOCK] IN=venet0 OUT= MAC= src=175.8.8.8 DST=192.168.1.1 LEN=60 TOS=0x00 PREC=0x00 TTL=40 ID=53849 DF PROTO=TCP SPT=10651 DPT=23 WINDOW=5808 RES=0x00 SYN URGP=0 
Oct  9 04:35:36 server1 kernel: [5110715.912365] [UFW BLOCK] IN=venet0 OUT= MAC= src=112.7.7.7 DST=192.168.1.1 LEN=40 TOS=0x00 PREC=0x00 TTL=39 ID=32234 PROTO=TCP SPT=60061 DPT=23 WINDOW=16244 RES=0x00 SYN URGP=0 
Oct  9 04:35:46 server1 kernel: [5110726.213736] [UFW BLOCK] IN=venet0 OUT= MAC= src=67.6.6.6 DST=192.168.1.1 LEN=44 TOS=0x00 PREC=0x40 TTL=239 ID=7084 PROTO=TCP SPT=57510 DPT=23 WINDOW=14600 RES=0x00 SYN URGP=0 
Oct  9 04:37:02 server1 kernel: [5110802.146865] [UFW BLOCK] IN=venet0 OUT= MAC= src=139.5.5.5 DST=192.168.1.1 LEN=40 TOS=0x00 PREC=0x00 TTL=39 ID=33753 PROTO=TCP SPT=6885 DPT=23 WINDOW=9693 RES=0x00 SYN URGP=0 
Oct  9 04:37:53 server1 kernel: [5110853.056320] [UFW BLOCK] IN=venet0 OUT= MAC= src=87.4.4.4 DST=192.168.1.1 LEN=60 TOS=0x00 PREC=0x00 TTL=54 ID=61204 DF PROTO=TCP SPT=47013 DPT=23 WINDOW=5808 RES=0x00 SYN URGP=0 
Oct  9 04:37:56 server1 kernel: [5110856.056659] [UFW BLOCK] IN=venet0 OUT= MAC= src=87.3.3.3 DST=192.168.1.1 LEN=60 TOS=0x00 PREC=0x00 TTL=54 ID=61205 DF PROTO=TCP SPT=47013 DPT=23 WINDOW=5808 RES=0x00 SYN URGP=0 
Oct  9 04:38:02 server1 kernel: [5110862.056195] [UFW BLOCK] IN=venet0 OUT= MAC= src=87.2.2.2 DST=192.168.1.1 LEN=60 TOS=0x00 PREC=0x00 TTL=54 ID=61206 DF PROTO=TCP SPT=47013 DPT=23 WINDOW=5808 RES=0x00 SYN URGP=0 
Oct  9 04:39:04 server1 kernel: [5110924.610656] [UFW BLOCK] IN=venet0 OUT= MAC= src=175.1.1.1 DST=192.168.1.1 LEN=40 TOS=0x00 PREC=0x00 TTL=43 ID=16953 PROTO=TCP SPT=35941 DPT=23 WINDOW=41293 RES=0x00 SYN URGP=0 

Each line of the UFW BLOCK lines represents another sort of attack on a server: looking for a mis-configured port. All of these lines are testing DPT=23, which is telnet, an ancient protocol. Notice this is happening at least once a minute. All day long. Every day.  

Load the data into a data table

Each of the files is formatted differently, the dates may not contain the current year, fields might have name=value or just be positional. I moved all the files from the Linux server to my windows machine and used FilesInDirectory to get a list of filenames in JSL. Then I wrote a set of patterns to efficiently parse the three types of files. The data table columns were chosen to represent the data I wanted to capture from each of the files:

Columns in tableColumns in table

ipaddr32 is a formula column to turn 1.2.3.4 into 1<<24 + 2<<16 + 3<<8 + 4 for a continuous axis on a graph. ipname will be looked up later. kind2 is a formula to reduce all the different user IDs into just "login".

path = "f:/logs/";
list = Files In Directory( path, "recursive" );
global:dt = New Table( "logData",  
    New Column( "datetime", Numeric, "Continuous", Format( "ddMonyyyy:h:m:s", 18, 0 ), Input Format( "ddMonyyyy:h:m:s", 0 )),
    New Column( "ipaddr", Character, "Nominal" ),
    New Column( "kind", Character, "Nominal" ),
    New Column( "detail1", Character, "Nominal" ),
    New Column( "detail2", Character, "Nominal" )
);

Efficient parsing

For maximum speed with large files, JSL parsing needs to be done in a single PatMatch function call. Each call to PatMatch has some overhead, and if you are making multiple calls then you are also probably chopping the input string up as you go, which adds more overhead. So, lets do it with one PatMatch per log file. The keys are two pattern matching functions: PatRepeat and PatTest. PatRepeat will let the pattern repeat for each line in the file. PatTest will let the pattern run some JSL during the match (nominally to test a condition, but our test result will always be "true"). A few helper patterns, and then into the good stuff...

ipPattern = patRegex("(\d{1,3}[-.]\d{1,3}[-.]\d{1,3}[-.]\d{1,3})|([a-fA-F0-9:]+)"); // approximation
ipMonth=patRegex("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec");
datePattern1 = "[" + patregex("\d{2}")>>day + "/" + ipMonth>>month + "/" + patregex("\d{4}")>>year + ":" + patregex("\d\d:\d\d:\d\d")>>time + patregex(" [-+]\d{4}") +"]"; // [18/Sep/2016:06:14:25 -0400] 
datePattern2 = ipMonth>>month + patspan(" ") + patregex("\d{1,2}")>>day + " " + patregex("\d\d:\d\d:\d\d")>>time + pattest(year="2016";1); //vs Oct 16 04:54:31   the unknown year is 2016 for these logs 
quote = "\!"";
quotepattern = quote + patrepeat(patbreak(quote||"\") +(("\"+(quote|""))|"")) + quote; // allows \" escaped embedded quotation marks

the two date patterns match the two date formats. The >> operator puts the matched text on the left into the variable on the right. PatTest(year="2016";1) always returns true after assigning 2016 to year; this is for the log files without an explicit year. I use a mix'n'match of patterns and regular expressions to build up bigger and bigger patterns. Here's the Apache access.log pattern:

apachePat = ipPattern>>ip+" "+patbreak(" ")+" "+patbreak(" ")+" "
        +datePattern1+" "
        +quotePattern>>request+" "
        +patbreak(" ")>>code+" "
        +patbreak(" ")>>length+" "+patbreak(" ")+" "
        +(quotePattern|patbreak("\!n\!r"))>>browser
        +pattest(
            dt1<<addrows(1);
            dt1:datetime = inFormat( day||month||year||":"||time, "ddMonyyyy:h:m:s" );
            dt1:ipaddr = ip;
            dt1:kind = code;
            dt1:detail1 = request;
            dt1:detail2 = browser;
            ;1);

Compare what this does to a line from the access.log

192.168.1.1 - - [30/Aug/2016:08:51:52 -0400] "GET /tmUnblock.cgi HTTP/1.0" 404 495 "-" "Wget(linux)"

and notice the pattest(...) at the end; it adds a row to the table dt1 and populates the columns with the values just parsed.

There are also UFW and auth patterns. Here's the function to parse a file:


global:skipPattern = (patbreak("\!n\!r") + patlen(1))>> unknown + pattest(show(filename,unknown);stop();1);
global:whitespace = patregex("\s*");

parser = Function( {filename, pattern, dt1},  {Default Local},
    txt = Load Text File( filename );
    rc = Pat Match( //
        txt, //
        Pat Pos( 0 ) //
        + Pat Repeat( ((global:whitespace + pattern + global:whitespace) | global:skipPattern) + Pat Fence() ) //
        + Pat R Pos( 0 ) //
    ); //
);

The function loads a log from filename and uses the pattern parameter to parse it and stores the results in dt1. The PatMatch function builds an alternation pattern that allows unrecognized lines to be skipped. PatFence() is another optimization for large files; it means "if you got this far, there is no point backing up and trying something different", and it can both speed up the match and reduce the memory required. Finaly, here's the driver for the function:

While( N Items( file = Remove From( list, 1 ) ) == 1, //
    file = file[1];//
    If(//
        !Is Missing( Regex( file, "ssl_access.log" ) ), //
            parser( path || file, apachePat, global:dt );//
    ,  // else 
        !Is Missing( Regex( file, "other_vhosts_access.log" ) ), //
            0//parser( path || file, apachePat, global:dt );//
    ,  // else
        !Is Missing( Regex( file, "access.log" ) ), //
            parser( path || file, apachePat, global:dt );//
    ,  // else
        !Is Missing( Regex( file, "ufw.log" ) ), 
            parser( path || file, ufwPat, global:dt );//
    ,  // else
        !Is Missing( Regex( file, "auth.log" ) ), //
            parser( path || file, authPat, global:dt );//
    );//	
);//

Points of Interest

The Raspberry PI computer seems to attract a lot of attacks using username PI. PHP scripting must be mis-configured pretty often; there are a lot of checks for certain PHP "admin" files. And finding an open port that has a weakness must be really valuable; the vast majority of the attacks are testing ports. If you manage a web server and you are not using fail2ban and firewall rules (like UFW), you should consider installing and configuring them. You can slow down a brute force attack by allowing a few attemps per day instead of multiple attempts per second.

Dashboard view of web trafficDashboard view of web traffic

Last Modified: Dec 2, 2016 10:44 AM