Monday, December 22, 2014

Mumble + Splunk

Who's on the Server? Splunk it!


Mumble is a great VOIP solution for latency sensitive situations; I run several servers for different applications, and monitoring those has always been a bit of a challenge since it only generates text logs and doesn't do any historic usage tracking. Fortunately we've got a tool to solve that problem in both real time and historic situations: Splunk.

In this article we'll walk through a simple example of data ingestion, parsing, and dashboard creation in Splunk. When we're finished we'll be able to tell who is online at any given time and how popular your server has been in the recent past.


Step 1: Data Ingestion

This section assumes the following:

  • Mumble (or another app if using this as a general guideline) installed and logging to a consistent location.
  • Logging is not set to rename logfiles as part of the log rollover process; doing so would complicate our source setup. (note this is achievable) 
  • The Splunk forwarder is installed on the target machine(s) and is already configured to output to the indexer(s). (install location referred to henceforth as $SPLUNK_HOME) What's that? You've got thousands of boxes and no time to install? We can fix that!

I'm using Windows boxes in this example, but there's no reason all of this won't work on Linux as well with some minor tweaks.


To get the data into Splunk we'll first need to identify where the data is and how to ensure it gets into the target indexer(s). In my case I'll be targeting the following (4 instances) data sources from one server:
  • C:\Apps\Murmur1LowQual\murmur.log
  • C:\Apps\Murmur2LowQual\murmur.log
  • C:\Apps\Murmur1HighQual\murmur.log
  • C:\Apps\Murmur2HighQual\murmur.log
  • Performance data: Network interface, CPU Usage

If the application to be monitored is distributed densely and consistently in your enterprise you will want to make the input changes in an "app" for deployment assuming you use "Forwarder Management". In this case it is a one-off configuration so I will be specifying these inputs manually.


For our one-off case, open the $SPLUNK_HOME/etc/system/local/inputs.conf file for editing (for background, make sure you understand "About Configuration Files") and add the following entries:

    [monitor://C:\Apps\MurmurLowQual\murmur.log]
    disabled = false
    sourcetype = murmur
    
    [monitor://C:\Apps\MurmurHighQual\murmur.log]
    disabled = false
    sourcetype = murmur

    [monitor://C:\Apps\Murmur2HighQual\murmur.log]
    disabled = false
    sourcetype = murmur
    
    [monitor://C:\Apps\Murmur2LowQual\murmur.log]
    disabled = false
    sourcetype = murmur
    
    [perfmon://CPU Load]
    counters = % Processor Time
    instances = _Total
    interval = 20
    object = Processor
    
    [perfmon://Network Interface]
    counters = Bytes Received/sec;Bytes Sent/sec
    instances = *
    interval = 15
    object = Network Interface
    
Note these entries are Windows specific; the paths and the perfmon data would need be changed on a Linux host.

Where:
  • monitor://<logfilePath> specifies the path to the logfile to be ingested (Murmur is Mumble's server component)
  • sourcetype = murmur specifies the sourcetype to store this information under in Splunk. This is critical to properly sorting data.
  • perfmon://<counter> , counters = <counter1>;<counter2> , instances = *, and object = <object in question> specify the performance information we want to bring in. This line is Windows specific and needs to be changed for *nix.
  • interval = <interval in seconds> is the interval to bring that performance data in at. Lower interval = more granular performance information but more data to store.  

This may or may not be sufficient in your case. For more information, see "Edit inputs.conf" on the Splunk site


Step 2: Pre-Format Data


Now we need to take the steps necessary to easily create searches against this data. At a minimum any Splunk Admin should take the time to do proper field extractions from your new data source. Creating custom fields ensure there will be meaningful information for users to search on. This part of Splunk operation is often the make-or-break point in many organizations, as proper field extraction can be the difference between an end user figuring out how to create meaningful searches vs. giving up and going to the original log files.

In this example I'll use one regex to extract three fields from the Murmur logfile under the "search" app context. To set up this field extraction:

  1. Log in to your Splunk web interface
  2. Change to the "search" app context
  3. Navigate to "Settings -> Fields


  4. Click on "Field extractions"


  5. Click "New"
  6. Leave the "destination app"as "search" (We'll work in search for now but you could make this into an app)
  7. Name it per your enterprise standard. In my case that is <AppName_fields extracted>, so "Murmur_sessionID_UserName_AdminStatus".
  8. Change "Apply to" to "sourcetype" named "murmur"
  9. Keep "Type" to "Inline"
  10. For the Extraction/Transform insert your regex statement. To extract the Session ID (as session_ID), UserName (as u_name), and AdminStatus (as userIsAdmin) from a Murmur logfile use this: "=> <(?<session_ID>[^:]+):(?<u_name>[^\(]+)\((?<userIsAdmin>[^\)]+)"
  11. Click "Save"


  12. To make this usable for everyone, click "Permissions" under "Sharing" to the right of the name on the "Field extractions" screen. 


  13. Select "This app only(search)" (change later if you use a different app) and check "Read" under "Everyone" Click "Save". 

We have now configured field extractions for Mumble. While there is much more one should do (data sizing, business use cases, etc.) to on-board an application this will be enough for now to develop a basic dashboard.


Step 3: Searches and a Dashboard!


Now we'll make use of this data. This is the beauty of Splunk; you can format almost all the data in a meaningful way, and even create new data points inferred from the other available data. To illustrate what I mean, one of these searches will determine who is online right now from the logon/logoff information in the log file. Let's tackle that and a few others:


Search 1: Who is online right now?

sourcetype=murmur |transaction session_ID,u_name maxspan=24h|search authenticated NOT closed|eval AdminRange = case(userIsAdmin < 0, "False", userIsAdmin >= 1, "True")|table u_name,session_ID,AdminRange,_time | rename u_name as "User Name", session_ID as "Session ID", AdminRange as "User is Administrator"

Will generate a table like this:


Where:

sourcetype=murmur : only search the appropriate sourcetype. Note you may want to further limit by host, index, or other locations.

transaction session_ID,u_name maxspan=24h : uses the powerful "transaction" command to string events into a transaction by the listed fields. Note: This command can be computationally expensive so be careful when using it!

search authenticated NOT closed : look for sessions illustrating a user connecting but not yet closed Note: again, since we're using a "NOT" clause this search could be expensive depending on your data volume from the previous parts of the search. Fortunately we should be very limited data wise in this search string by now.

eval AdminRange = case(userIsAdmin < 0, "False", userIsAdmin >= 1, "True") : use the eval command to determine admin status as Boolean

table u_name,session_ID,AdminRange,_time : Render as a table

rename u_name as "User Name", session_ID as "Session ID", AdminRange as "User is Administrator" : Rename table fields to be more meaningful to the end user

Since we're limiting a session span to 24hours in the transaction search, you may as well reduce the time for this search to 24 hours as well. For a slight performance tweak on searches that display a table or graphic of specific fields you can change the search mode from "smart" to "fast".  Now let's add this to a new dashboard:
  1. After the results come up, click "Save As" the "Dashboard Panel"  on the upper right hand side.
  2. Select "New"
  3. Insert an appropriate title, i.e. "Mumble Statistics" following enterprise standards if applicable. Generally it is OK to let the Dashboard ID auto-populate with this name.
  4. Write a description if desired and change the Dashboard Permissions to "Shared in App" so we can share this information with others. 
  5. Type an appropriate Panel Tile such as "Users Online Now!". Accept the defaults for the remaining and click "Save".
For fun you can make a little tachometer displaying this information by opening the saved search and saving it to the existing dashboard under a different name, then changing the display to "Radial Gauge". End users generally like these "at a glance" graphics for important information, especially at the top of a dashboard.

Search 2: How many people have logged on per day for the last 30 days?

sourcetype=murmur u_name=* authenticated |timechart count

Will generate:


Where:

sourcetype=murmur u_name=* authenticated
: only search the appropriate sourcetype for events where u_name is populated and the word "authenticated" is present.

timechart count : charts all results using the very easy timechart command. Make sure you limit the search scope (below).

Set the search scope for 30 days, execute, then save to our existing Mumble Statistics dashboard as a panel named "Logins Per Day - Last 30 Days". I prefer this as a bar chart, which you can select at the dashboard level.

Search 3: How much bandwidth did the server use in the last day?

host=hostname_here sourcetype="Perfmon:Network Interface"|eval DataSrc=case((instance=="Microsoft Hyper-V Network Adapter" AND counter=="Bytes Received/sec"),"ETH0bitsSecIN",(instance=="Microsoft Hyper-V Network Adapter" AND counter=="Bytes Sent/sec"),"ETH0bitsSecOUT")|eval Kbits_Sec=Value*.008| timechart span=5m avg(Kbits_Sec) by DataSrc| rename LANbitsSecIN as "LAN Kbits/sec IN" LANbitsSecOUT as "LAN Kbits/sec OUT" WANbitsSecIN as "WAN Kbits/sec IN" WANbitsSecOUT as "WAN Kbits/sec OUT"

Will generate:


Where:

host=hostname_here sourcetype="Perfmon:Network Interface"
:  Specify events to search. You will need to change the hostname and potentially the sourcetype depending on your host platform, etc.

eval DataSrc=case((instance=="Microsoft Hyper-V Network Adapter" AND counter=="Bytes Received/sec"),"ETH0bytesSecIN",(instance=="Microsoft Hyper-V Network Adapter" AND counter=="Bytes Sent/sec"),"ETH0bytesSecOUT") : Here is where we use eval to map interfaces to directions. If you have multiple interfaces you'll need to address them on this line and also note you will need to change the "names" of the adapters to match your data.

eval Kbits_Sec=Value*.008 : Convert Bytes/Sec to Kbits/Sec

timechart span=5m avg(Kbits_Sec) by DataSrc : Chart the data by interface direction on a 5 minute average. Note if you made a longer-term chart you'll need to change the average to calculate on a wider basis to keep your datapoints low enough to chart.

rename ETH0bytesSecIN as "ETH0 Kbits/sec IN" ETH0bytesSecOUT as "ETH0 Kbits/sec OUT" : Rename the datapoints relative to our eval statement above.

Set the search scope for 24 hours, execute, then save to our existing Mumble Statistics dashboard as a panel named "Network Traffic 5m Average Last 24 Hours"

How many users on the sever? Plenty.

Those three should get you started; clearly there is substantially more one could do with all the data available to us. After you decide what else to add make sure you go through your dashboard and reposition/edit each panel as necessary. Keep in mind that you can rename x/y axis as well as change the way data is rendered. Hopefully this tutorial has demonstrated that even with a simple application you can use a tool like Splunk to make the day-to-day use and impact on an organization much more transparent. This methodology could easily be turned into an app and distributed throughout your enterprise and/or the Splunk community.

No comments: