Create a running average with SQL I was doing running averages on a flat file using Perl, and decided I had enough data to warrant changing to a database. I did some research, played around a bit, and finally figured out how to do running averages in SQL.
Running averages (moving averages) are very useful in statistics. The way they work is to replace a given value with the average of it and its predecessors. The more predecessors you use, the smoother your curve, but the more data that is hidden by the averaging.
For example, take the following 10 values:
15 25 8 98 17 22 100 35 65 76
If these averages are just plotted, it is difficult to see any trends in them. Are the values going up, on average, or staying the same.
However, a three period running average will give you a smoother curve. To create a three point running average, simply replace each value with the average of itself and its two predecessors.
15 15.0 just itself, since it has no predecessors 25 20.0 average of 20 and 15 8 16.0 now we are really working. Average of 8, 25 and 15 98 43.7 Average of 98, 8 and 25 17 41.0 22 45.7 100 46.3 35 52.3 65 66.7 76 58.7
What was basically a random looking string of numbers is now showing a clear trend towards increasing. Actually, with a little study you could have seen this, but image hundreds, or thousands of data points.
The example I have is a simple e-mail statistic storage. This is real life tool. I manage multiple mail servers, and use pflogsumm to summarize the throughput on the Postfix server every morning. While I am interested in spikes (generally spam storms), I am also interested in "are my clients increasing or decreasing their use". This allows me to tell whether I need to add more resources to e-mail.
Viewing a plot is the easiest way to see trends. Plots showing every daily point are good for showing whether the spammers are attacking again, but these plots are radically different from day to day. As a simple example, weekends show little use on a commercial account, but the users tend to make up for it on Mondays. To easily view overall trends, I wanted to smooth out the daily spikes and see things on a weekly basis. A 7 point running average shows that to me quite easily.
My primary table is named mailstats, and has only four columns: reportDate, server_id, statsCategory_id and value. Value is the number coming from pflogsumm, reportDate is the date of the report, and server_id is a foreign key into a server table. I do not know that the stats I am saving are the only ones I will want in the future, so instead of creating a separate column for bytes received, bytes sent, count recieved,and count sent, I created another table that tells what stat I am storing. The key into that table is statsCategory_id.
To get the runnign total, simply join the table back into itself, filtering the second instance of the table by the date range you want to average. If you want to do a 7 day running average like I did, the second table should return the average of the current value and the previous six dates.
/* basic seven day running average */ select t1.reportDate, t1.server_id, t1.statsCategory_id, avg(t2.value) from mailstats t1 left join mailstats t2 on t1.server_id = t2.server_id and t1.statsCategory_id = t2.statsCategory_id and t2.reportDate <= t1.reportDate and t2.reportDate > date_sub(t1.reportDate, interval 7 day) group by t1.reportDate, t1.server_id, t1.statsCategory_id
The only thing about the column list is that we specifically denote using the alias t1 for all columns except the average, which we get from the t2 alias. t1 and t2 are aliases into the same table, mailstats.
In the from clause, we set up the join of the table into itself, but we designate that t2 should only return the previous 7 days of data. Thus, for t1 on 2000-01-07, t2 will return all values found between 2000-01-01 and 2000-01-07.
The group by then returns a value for each date (from t1), but with the second column being the average of the previous seven days.
What I really wanted was to plot this with some additional, nice information such as the server name, the stats name (bytes received, count sent, etc...). The basic query for that follows. This query should be self expanatory as it is a simple three table join, with the servers and statsCategory tables being used simply to get the server name and the statistic name from the joined tables.
/* plot all points */ select serverName, reportDate, statsName, value from mailstats join statsCategory using (statsCategory_id) join servers using (server_id) order by reportDate, serverName, statsName
A simple solution now presents itself. We use the first query as a sub query in the from clause, substituting for the table mailstats in the second query. This results in a fairly efficient way to retrieve your data, with the added bonus that you can use the same code to plot it as you used with the standard three table join. Note that I named the subquery mailstats, so it looked like the previous query.
/* 7 day running average */ select serverName, reportDate, statsName, value from ( select t1.server_id,t1.statsCategory_id, t1.reportDate, avg(t2.value) value from mailstats t1 left join mailstats t2 on t1.server_id = t2.server_id and t1.statsCategory_id = t2.statsCategory_id and t2.reportDate <= t1.reportDate and t2.reportDate > date_sub(t1.reportDate, interval 7 day) group by t1.reportDate, t1.statsCategory_id, t1.server_id ) mailstats join statsCategory using (statsCategory_id) join servers using (server_id) order by reportDate, serverName,
Create a running average with SQL