Wednesday, January 23, 2013

SQL Part 2 - Counting and More

In the previous section, I only showed how to pull user-level data from the tables in question, which people usually aren't interested in because there are generally too many users for you to care about their individual actions. Suppose we wanted to know how many logins to WoW there were on January 1st. The query for this would be

SELECT
     count(*)
FROM logins
WHERE
     game_id=1 AND login_time::date='2013-1-1'1;


The count(*)2 function gives the number of rows that meet the criterion, and therefore (in this case) return one record with one column, with the value there being the count. You don't have to put a * in the parenthesis, you can also put a field in there. If we had put count(user_id) it would have counted the number of rows where user_id isn't NULL. NULL is a special value a field may have that means that no information was supplied. In this case, it would return the same result, since every login record should contain a user_id.

The above example is a bit silly, since if someone logged in multiple times on January 1st, they would have more than one record in logins on that date. Chances are we want a count of the distinct users that logged in on that particular date, which we would code as

SELECT
     count(distinct user_id)
FROM logins
WHERE
     game_id=1 AND login_time::date='2013-1-1';


Adding the distinct modifier inside the count() will have it count how many different user_ids appear in the selection. Combining all of this, we can figuring how many users logged into WoW on January 1st and how many times they logged in on that day, on average, as

SELECT
     count(distinct user_id) as "users",
     count(*)/count(distinct user_id) as "logins/user"
FROM logins
WHERE
     game_id=1 AND login_time::date='2013-1-1';


The first column will tell us how many people logged in on January 1st and the second column tells us the average number of times those users logged in on that date. It does so by taking the total number of logins and dividing by how many people did those logins. The 'as "X"' parts give our calculated columns names. This is something you don't have to do with a calculated column, since the DBMS will supply some generic name for it, but you should do. You can also do it for non-calculated columns. The double-quotes aren't necessarily except in some situations, like if you want a space in the name (eg: "logins per user").

This was a bit of a potpourri section to cover a couple topics before we dig into something more complicated next time.

Question: Given the char_info table outlined here, calculate the average number of quests that have been completed by a level 90 character. Hint: remember that an average is calculated as the sum of the values divided by how many values there are. Where count() counts how many non-NULL entries are in a selection, the sum() function will add up the values.

Answer: here.


1 The "::date" is necessary because login_time is a timestamp and we want to see if it's equal to a date. In short, the database considers  '2013-1-1 12:31:56' and '2013-1-1' to be different. '2013-1-1' is considered to be equal to '2013-1-1 00:00:00' (the midnight joining 2012-12-31 and 2013-1-1).
2 count() and many other similar functions that summarize data are called "aggregate functions".