Sunday, 31 January 2016

Real-time Oracle Database Monitoring Dashboard in ASP.NET

Introduction

Oracle Performance Dashboard (OPD) is a small ASP.NET website that shows you performance & problems of one or more Oracle instances in near real-time. It uses the Dynamic Performance Views (DPV) and runs some popular DBA scripts in order to get meaningful, easy to understand information out of the server. You can use it to quickly spot blocking queries, who is blocking who, expensive query that are consuming high CPU or disk, see if there's unusual locks, very high disk activity and so on.

Demo

You can see a live demo of this from here:
http://odp.omaralzabir.com

Get the code

The binaries are here, which you can just extract into an IIS folder, put the connection strings in the web.config file, and you are ready to roll. No need to install any Oracle client software on the server.
GitHub Project Binaries
You can get the source code from the GitHub project site:
https://github.com/oazabir/OraclePerformanceDashboard

Why not use OEM

OEM is pretty good, no doubt. But what I need is something that runs complex DBA scripts, processes the output and tells me what is wrong exactly. On standard monitoring tools like OEM, you get to see the key performance indicators and sometimes some generic alerts, but a much more powerful way to investigate problems in the server is to run those DBA scripts that you find in Oracle Performance Tuning books and blogs, that really get into the heart of the problem and shows you where things are going wrong. There are many powerful scripts available on the books, forums, blogs that are quite handy to detect issues on the server, gather relevant useful information about your database health. This tool tries to give those powerful scripts a nice front-end and easy to discover clues for those, who aren't professional Oracle DBAs like me.

Feature walkthrough

OPD comes with the following features in V1:
  • Summary of all your instances in one screen, showing important indicators on each instance. Quick way to check if all your databases are doing fine or not.
  • Instance Dashboard showing details of an instance:
    • CPU usage on the OS.
    • CPU consumed by each session
    • Important System Statistics like Buffer Gets, Parse to execute ratio which would indicate some common problems on the server.
    • Sessions and what the sessions are doing, how much resource they are consuming, whether they are hogging the disk or not.
    • Waits, Blocks, Locks, deadlocks that make database suffer.
    • Historical analysis on the databse showing you some very useful stuff:
      • Most expensive queries in terms of CPU and Buffer Get, which are immediate convern for your devs to sort out.
      • IO usage on data files. You can see if some data file is unusually loaded and getting hammered by physical IO.
      • Tablespace usage. Alerts you if some tablespace is over 85% full.
      • Stale stats on tables. You should always keep this clean.
      • Killer Indexes that will actually blow up your system and confuse Oracle query optimizer. You need to get rid of those indexes and rewrite queries that you thought will hit those indexes for better performance. They won't. They will kill your database.

Summary Screen

When you launch OPD, you see a quick summary of all your instances and indicators showing you whether there's something off in any of the instances. You can then click on the instance name and zoom into more details on each instance.
The summary is generated by running this long query which collects some important indicators from the server, does some calculation to decide whether something is off the chart or not.

SELECT 'Block' as Name, (select 
   nvl(sum(seconds_in_wait),0)
from 
   v$session
where 
   blocking_session is not NULL) as Value from Dual

union all

select 'Locks' as Name, (select    count(1)
from     v$session sn, 
    v$lock m    
where     
     ((sn.SID = m.SID and m.REQUEST != 0) 
or     (sn.SID = m.SID and m.REQUEST = 0 and LMODE != 4 and (ID1, ID2) in
        (select s.ID1, s.ID2 
         from     v$lock S 
         where     REQUEST != 0 
         and     s.ID1 = m.ID1 
         and     s.ID2 = m.ID2)))) as Value FROM Dual

union all

select 'Waits' as Name, (select count(1) FROM v$session_wait w, v$session s, dba_objects o WHERE  s.sid = w.sid AND w.p2 = o.object_id) as Value from Dual

union all

select 'Long op (sec rem)' as Name, (SELECT 
       nvl(sum(time_remaining),0) 
  FROM v$session_longops sl
INNER JOIN v$session s ON sl.SID = s.SID AND sl.SERIAL# = s.SERIAL#
WHERE time_remaining > 0) as Value FROM Dual

union all

select 'CPU' as Name, (
    with AASSTAT as (
      select
      decode(n.wait_class,'User I/O','User I/O',
      'Commit','Commit',
      'Wait') CLASS,
      sum(round(m.time_waited/m.INTSIZE_CSEC,3)) AAS,
      BEGIN_TIME ,
      END_TIME
      from v$waitclassmetric m,
      v$system_wait_class n
      where m.wait_class_id=n.wait_class_id
      and n.wait_class != 'Idle'
      group by decode(n.wait_class,'User I/O','User I/O', 'Commit','Commit', 'Wait'), BEGIN_TIME, END_TIME
      union
      select 'CPU_ORA_CONSUMED' CLASS,
      round(value/100,3) AAS,
      BEGIN_TIME ,
      END_TIME
      from v$sysmetric
      where metric_name='CPU Usage Per Sec'
      and group_id=2
      union
      select 'CPU_OS' CLASS ,
      round((prcnt.busy*parameter.cpu_count)/100,3) AAS,
      BEGIN_TIME ,
      END_TIME
      from
      ( select value busy, BEGIN_TIME,END_TIME from v$sysmetric where metric_name='Host CPU Utilization (%)' and group_id=2 ) prcnt,
      ( select value cpu_count from v$parameter where name='cpu_count' ) parameter
      union
      select
      'CPU_ORA_DEMAND' CLASS,
      nvl(round( sum(decode(session_state,'ON CPU',1,0))/60,2),0) AAS,
      cast(min(SAMPLE_TIME) as date) BEGIN_TIME ,
      cast(max(SAMPLE_TIME) as date) END_TIME
      from v$active_session_history ash
      where SAMPLE_TIME >= (select BEGIN_TIME from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2 )
      and SAMPLE_TIME < (select END_TIME from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2 )
    )
    select
      round(
        decode(sign(CPU_OS-CPU_ORA_CONSUMED), -1, 0, (CPU_OS - CPU_ORA_CONSUMED )) +
        CPU_ORA_CONSUMED +
        decode(sign(CPU_ORA_DEMAND-CPU_ORA_CONSUMED), -1, 0, (CPU_ORA_DEMAND - CPU_ORA_CONSUMED )) +
        COMMIT +
        READIO +
        WAIT
      , 2)   
      CPU_TOTAL
    from (
    select
      min(BEGIN_TIME) BEGIN_TIME,
      max(END_TIME) END_TIME,
      sum(decode(CLASS,'CPU_ORA_CONSUMED',AAS,0)) CPU_ORA_CONSUMED,
      sum(decode(CLASS,'CPU_ORA_DEMAND' ,AAS,0)) CPU_ORA_DEMAND,
      sum(decode(CLASS,'CPU_OS' ,AAS,0)) CPU_OS,
      sum(decode(CLASS,'Commit' ,AAS,0)) COMMIT,
      sum(decode(CLASS,'User I/O' ,AAS,0)) READIO,
      sum(decode(CLASS,'Wait' ,AAS,0)) WAIT
    from AASSTAT)

)
as Value FROM Dual

UNION ALL
select 'Space' as Name, (
  select count(1) FROM (
  SELECT df.tablespace_name,
         df.file_name,
         df.size_mb,
         f.free_mb,
         df.max_size_mb,
         f.free_mb + (df.max_size_mb - df.size_mb) AS max_free_mb,
         ROUND((df.max_size_mb-(f.free_mb + (df.max_size_mb - df.size_mb)))/max_size_mb,0) AS used_pct
  FROM   (SELECT file_id,
                 file_name,
                 tablespace_name,
                 TRUNC(bytes/1024/1024) AS size_mb,
                 TRUNC(GREATEST(bytes,maxbytes)/1024/1024) AS max_size_mb
          FROM   dba_data_files) df,
         (SELECT TRUNC(SUM(bytes)/1024/1024) AS free_mb,
                 file_id
          FROM dba_free_space
          GROUP BY file_id) f
  WHERE  df.file_id = f.file_id (+)
  ORDER BY df.tablespace_name,
           df.file_name
  ) where USED_PCT > 85) as Value from Dual
  
UNION ALL
  
select 'Invalid Objects' as Name, (SELECT count(1)
FROM   dba_objects
WHERE  status = 'INVALID') as Value FROM Dual

Instance Dashboard

When you zoom into an instance, you get the Dashboard for that instance.
The dashboard shows you the resource usage on the instance and some key system statistics. Here's the query used to collect various CPU counts from the server, which is run twice with one second interval and then the delta is calculated so derive the values for the chart:
SELECT systimestamp as DATETIME,
(SELECT value
  FROM v$osstat
  WHERE stat_name = 'NUM_CPUS') CPU,
          sum(decode(stat_name,'IDLE_TIME', value, NULL)) as idle_time,
           sum(decode(stat_name,'USER_TIME', value, NULL)) as user_time,
           sum(decode(stat_name,'SYS_TIME', value, NULL)) as sys_time,
           sum(decode(stat_name,'IOWAIT_TIME', value, NULL)) as iowait_time,
           sum(decode(stat_name,'NICE_TIME', value, NULL)) as nice_time           
    FROM v$osstat
    WHERE stat_name in ('IDLE_TIME','USER_TIME','SYS_TIME','IOWAIT_TIME','NICE_TIME')
    GROUP BY systimestamp

Resource graph

On the left, it shows the CPU usage on the OS and shows you how much CPU is used by user, sys, IO. Usually when there's a IO bottleneck, you will see the IO line goes up.
On the right, it shows the CPU consumed by each session. If some session is off the chart, you know that session is the culprit.
At the bottom, there are 4 key indicators. They will be red if they aren't optimal or at least what the best practice guideline says they should be. For example, Buffer Cache hit Ratio should be over 95%. You can see on the graph that sometimes it becomes green, because sometimes it is over 95%.

Session details

This shows details about the ACTIVE sessions. You can see if there's any session consuming high CPU, high physical read, hard parse and so on. The yellow warning indicator will come up if there's something off.
Here's the query that collects data for this:
SELECT 
      SID,
      v$session.STATUS,
      USERNAME || &#39; (&#39; || OSUSER || &#39;)&#39; &quot;USER&quot;, 
      MACHINE,
      Logon_time,      
      /*SYS.AUDIT_ACTIONS.NAME || ' ' || OBJECT_NAME &quot;Command&quot;,*/
      SYS.AUDIT_ACTIONS.NAME &quot;Command&quot;,
     (case 
        when v$session.STATUS = 'ACTIVE' and v$session.ROW_WAIT_OBJ# > 0 then (select OBJECT_NAME FROM dba_objects WHERE v$session.ROW_WAIT_OBJ# = dba_objects.object_ID)
        else ''
      end) &quot;Wait Object&quot;,
      sql_text,
         (SELECT ROUND (VALUE / 1024 / 1024, 2)
            FROM v$sesstat
           WHERE STATISTIC# = 25 AND v$sesstat.SID = v$session.sid)
            AS &quot;PGA Memory, in MB&quot;,
         (SELECT VALUE
            FROM v$sesstat
           WHERE STATISTIC# = 12 AND v$sesstat.SID = v$session.sid)
            AS &quot;CPU, used by session&quot;,
         ROUND ( (SELECT VALUE
                    FROM v$sesstat
                   WHERE STATISTIC# = 339 AND v$sesstat.SID = v$session.sid)
                / (SELECT DECODE (VALUE, 0, 1, VALUE)
                     FROM v$sesstat
                    WHERE STATISTIC# = 338 AND v$sesstat.SID = v$session.sid),
                2)
            AS &quot;Hard Parse, %&quot;,
         (SELECT ROUND (VALUE / 1024 / 1024, 2)
            FROM v$sesstat
           WHERE STATISTIC# = 58 AND v$sesstat.SID = v$session.sid)
            AS &quot;Physical read bytes, in MB&quot;,
         (SELECT ROUND (VALUE / 1024 / 1024, 2)
            FROM v$sesstat
           WHERE STATISTIC# = 66 AND v$sesstat.SID = v$session.sid)
            AS &quot;Physical write bytes, in MB&quot;,
         (SELECT ROUND (VALUE / 1024 / 1024, 2)
            FROM v$sesstat
           WHERE STATISTIC# = 139 AND v$sesstat.SID = v$session.sid)
            AS &quot;Redo size, in MB&quot;,
         (SELECT ROUND (VALUE / 1024 / 1024, 2)
            FROM v$sesstat
           WHERE STATISTIC# = 344 AND v$sesstat.SID = v$session.sid)
            AS &quot;Received from client, in MB&quot;,
         (SELECT ROUND (VALUE / 1024 / 1024, 2)
            FROM v$sesstat
           WHERE STATISTIC# = 343 AND v$sesstat.SID = v$session.sid)
            AS &quot;Sent to client, in MB&quot;,
         SERIAL#,
         PROGRAM,         
         OWNERID,
         PROCESS         
    FROM    v$session
         /*LEFT OUTER JOIN
            DBA_OBJECTS
         ON v$session.ROW_WAIT_OBJ# = dba_objects.object_ID*/
           
         LEFT OUTER JOIN
            v$sqlarea
         ON 
          v$session.sql_hash_value = v$sqlarea.hash_value 
          AND v$session.sql_address  = v$sqlarea.address  
        
         LEFT OUTER JOIN SYS.AUDIT_ACTIONS ON v$session.command = SYS.AUDIT_ACTIONS.action
         
   WHERE v$session.LOGON_TIME BETWEEN TRUNC (SYSDATE) AND SYSDATE
        AND
        v$session.sid != userenv('SESSIONID') 
        AND 
        rownum < 10
    ORDER BY 2, 10 DESC 
The output of this query is bound to a GridView, which then looks at each row and decides whether to show any warning or not:
<asp:GridView CssClass="table table-striped" ID="GridView1" runat="server" DataSourceID="sqlDataSource" EnableModelValidation="True">
    <Columns>
        <asp:TemplateField>
            <ItemTemplate>
                <%#HandleDbNull(Eval("Hard Parse, %"))>0 ? "<span class='label label-warning'>Hard Parse</span>" : ""%>
                <%#HandleDbNull(Eval("PGA Memory, in MB"))>10 ? "<span class='label label-warning'>PGA</span>" : ""%>
                <%#HandleDbNull(Eval("CPU, used by session"))>3000 ? "<span class='label label-warning'>CPU</span>" : ""%>
                <%#HandleDbNull(Eval("Physical read bytes, in MB"))>100 ? "<span class='label label-warning'>Physical Read</span>" : ""%>
                <%#HandleDbNull(Eval("Physical write bytes, in MB"))>100 ? "<span class='label label-warning'>Physical Write</span>" : ""%>
                <%#HandleDbNull(Eval("Redo size, in MB"))>10 ? "<span class='label label-warning'>Redo</span>" : ""%>
                
                
            </ItemTemplate>
        </asp:TemplateField>
    </Columns>                    
    
</asp:GridView>

Bad Indexes

This is very handy for developers because you can quickly see which indexes are bad by design, because you have columns in the index which have very low selectivity. For example, on the screenshot, you can see there are some killer index. Trying to use those indexes will actually kill your server. You can see the first example that there are just 7 unique values on column PRODUCT_ID, but there are over 5 million rows. So, each PRODUCT_ID has probably got over 1 million rows against it.
Some developer thought that there's a query that has WHERE PRODUCT_ID=X and thus created the index on PRODUCT_ID. But it is not going to work in this case.
Sometimes during development and testing, we do not have representative data from production database and thus we do not catch such issues. Only when we go-live and go down the next hour, we realize what we have done wrong. This view of Bad Indexes will help identify such wrong index designs.
Here's the query that finds these bad indexes:
select i.table_name,i.index_name,
(select rtrim (xmlagg (xmlelement (e, column_name || ', ')).extract ('//text()'), ',') column_names from USER_IND_COLUMNS where index_name=i.index_name) column_names,
TO_CHAR(ROUND((i.distinct_keys/(i.num_rows+1))*100, 3), '999.000') as Uniqueness,
case 
when i.distinct_keys/(i.num_rows+1) = 0 then 'IGNORE'
when i.distinct_keys/(i.num_rows+1) < 0.1 AND num_rows > 1000 then 'KILLER!'
when i.distinct_keys/(i.num_rows+1) < 0.1 AND num_rows < 1000 then 'IGNORE'
when i.distinct_keys/(i.num_rows+1) < 0.2 AND num_rows > 1000 then 'SERIOUSLY POOR'
when i.distinct_keys/(i.num_rows+1) < 0.5 AND num_rows > 1000 then 'POOR'
when i.distinct_keys/(i.num_rows+1) < 0.8 then 'OK'
when i.distinct_keys/(i.num_rows+1) < 0.9 then 'GOOD'
else 'VERY GOOD'
end Quality,
i.distinct_keys,i.num_rows,i.blevel,i.leaf_blocks,i.avg_leaf_blocks_per_key,i.avg_data_blocks_per_key from user_indexes i 
where i.table_name in (select table_name from user_tables) 
and i.num_rows > 0
group by table_name,i.index_name,i.distinct_keys,i.num_rows,i.blevel,i.leaf_blocks,i.avg_leaf_blocks_per_key,i.avg_data_blocks_per_key
order by Uniqueness, num_rows desc

Locks, Waits

Locks and waits are common culprits causing performance degradation on the servers. You can quickly see from this tool what locks are being held, and what are the expensive waits.
It will show you the exact query holding lock and causing contention or even deadlocks. Here you see that session ID1 and ID2 grabbing exclusive lock and getting into a deadlock.

How it works

Look ma, no AJAX!

You will notice that the panels are refreshing periodically. You might think I am using AJAX to call some serverside web service in order to get JSON/XML response, and then use some jQuery template to render the html output. Nope. I am using what our ancestors have happily used for generations. The natural, organic IFRAME solution, with no side effect. The html output for each panel comes from individual ASP.NET pages, via IFRAMEs and then they get injected into a DIV on the main Dashboard page.
There are several benefits to this approach:
  • The widgets are individual pages, which user can browse directly in full browser view.
  • Each widget is a plain ASP.NET page. No need to build webservices to return data in JSON/XML format. No need for any entity classes either that you usually use to serialize into JSON/XML.
  • The HTML content is generated server side, using regular ASP.NET. Thus there's no need to use any Javascript based HTML templating library.
  • As there's no need for AJAX or html templating, there's no need to worry about jQuery or its plugin breaking compatibility in new versions, and updating javascript libraries regularly.
Let's see how to do this. First the HTML markup to draw the panels:
<div class="row">
    <div class="panel panel-success">
        <div class="panel-heading"><a href="WhoIsActive.aspx?c=<%= ConnectionString %>">What's going on</a></div>
        <div class="panel-body panel-body-height" id="WhoIsActive">
            <div class="progress">
                <div class="progress-bar progress-bar-striped" style="width: 60%"><span class="sr-only">100% Complete</span></div>
            </div>
        </div>
        <iframe class="content_loader" onload="setContent(this, 'WhoIsActive')" src="WhoIsActive.aspx?c=<%= ConnectionString %>" style="width: 100%; height: 100%; border: none; display: none" frameborder="0"></iframe>
    </div>
</div>
This is the markup taken from the Twitter Bootstrap theme.
You will notice there's an invisible IFRAME there. When the IFRAME loads, it calls the setContent function. That function takes the whole content of the IFRAME and injects inside the panel-body div.
function setContent(iframe, id) {
    ...
    $('#' + id)
        .html($(iframe).contents().find("form").html())
        .dblclick(function () {
            iframe.contentWindow.location.reload();
        })
        ...
}
There you go, clean AJAX-like solution without any AJAX: no XMLHTTP, no JSON plumbing, no HTML templating, no server-side webservice.
Now this would not work for any event handler that is hooked inside the IFRAME. So, how does the click on a query show the popup window with the full query? Also if it was an IFRAME, shouldn't the popup actually come inside the IFRAME?
The click functionality is done on the main Dashboard page. After injecting the content into the DIV, it hooks the click handlers that shows the popup on the main page:
function setContent(iframe, id) {
    $('#' + id)
        .find('td.large-cell').off('click');

    if ($('#' + id).scrollLeft() == 0) {
        $('#' + id)
            .html($(iframe).contents().find("form").html())
            .dblclick(function () {
                iframe.contentWindow.location.reload();
            })
            .find('td.large-cell').find('div').click(function () {
                $('#content_text').text($(this).html());
                $('#basic-modal-content').modal();
            });
    }
Here it looks for any <TD> having the class large-cell. It then hooks the click even on it and shows the modal dialog box. The modal dialog box is from Eric Martin's SimpleModal plugin.

Plotting the charts

The chart uses the jQuery plugin Flot to render some of the performance counters as running charts.
There's a SysStats.aspx which is responsible for rendering the table showing the performance counters. It picks some important counters, and marks them to appear on the chart. First it runs through the table, looking for the counters, and marks the label of the counter as x-axis and value of the counter as y-axis:
var plot = ["Buffer cache hit ratio (95+)", "Parse CPU to total CPU ratio (<30)", "Shared pool free (~0)", "Parse to execute ratio (<0.3)"];
$('td').each(function (i, e) {
    td = $(e);
    if (td.text().trim().length > 0) {
        for (var i = 0; i < plot.length; i ++) {
            if (plot[i] == td.text().trim()) {
                td.addClass("x-axis");
                td.next().addClass("y-axis");
            }
        }
    }
})
Now this page is hosted on the Dashboard page inside an IFRAME. So, the Dashboard page scans the IFRAME content, looks for these labels, picks their values and passes to the Flot chart plugin:
$(iframe).contents().find("form").find(".x-axis").each(function (i, e) {
    var x = $(e);
    var y = x.next('.y-axis');
    var xname = x.text();
    var yvalue = parseInt(y.text());
    if (datasets[xname]) {
        var data = datasets[xname].data;
        data.pop();
        data.splice(0, 0, yvalue);
    }
});
Rest of the job of updating the Flot chart is done by the usual Flot code:
function updatePlot() {
    var index = 0;

    $.each(datasets, function (key, val) {
        var items = [];
        for (var i = 0; i < val.data.length; i++)
            items.push([i, val.data[i]]);

        var data = { color: val.color, data: items };

        if (plots[index] != null) {
            plot = plots[index];
            plot.setData([data]);
            plot.draw();
        }
        else {
            plot = $.plot("#placeholder" + (index + 1), [data], {
                series: {
                    //shadowSize: 0 // Drawing is faster without shadows
                },
                lines: { show: true, fill: true },
                grid: {
                    hoverable: true,
                    clickable: true
                },
                yaxis: {
                    min: 0,
                    max: val.ymax
                },
                xaxis: {
                    show: false
                }
            });
That's it! Again, no AJAX, no webservice, no html templating, no JSON plubing. Pure organic IFRAME and html.

No comments:

Post a Comment