A TPS at a university may be the payroll system for the faculty and staff or the student grade system. For each system you would need to determine the number of entries per year, and the size of the fields contained in each entry. For instance, consider a simple payroll system where the data generated each pay period are just the social security number and the amount paid per employee. The social security number is a 9 digit field and the amount paid might be an 8 digit field. Given these dimensions and an employee count of 2,000, the amount of data generate each year would be :
This is a very abbreviated look at a system, other fields would be included in most systems, and it is worth revisiting the data model on page 173 to get some ideas of these other fields.
The university may need a variety of data for decision making, coming from within the university and from external sources.
For instance, to plan for incoming students, demographic information would be important, as would drop out rate (internal data), dorm capacity (internal rate), and the faculty/staff availability.
To hire new faculty, the university may need information about average salaries for a particular field, current department size, the teaching load, number of students enrolled in that college, and average salary of current faculty members in that college.
There are many more decision making areas for the university and each will have its own needs for data (both internal and external).
Dirty data that may be found in a university's database system include:
A university may not need to revise its data warehouse frequently. Perhaps once term would be adequate.
Examples of verification questions:
Examples of data discovery questions:
Apply the basic prompts
Identifiers | When | Reservation date and time Flight date and time |
Where | Customer postal code (infer socioeconomic group) | |
What | Flight number Origin city Destination city Class of travel |
|
How | Reservation channel (phone, Web, travel agent) | |
Who | Frequent flyer status, Million miler status, demographic data | |
Why | Purpose of travel (business or leisure) | |
Variables | Outcomes | Ticket price Passenger costs Departure time differential Arrival time differential Miles Aircraft load factor Aircraft costs Flight complaints Flight bags lost Aircraft capacity |
Question | Method of analysis |
Which flights are on average more than 30 minutes late? | SQL |
Which flights are the most profitable? | analysis of variance |
What socioeconomic groups fly first class from Atlanta to Paris? | contingency table analysis |
Is there a relationship for a flight between the number of customer complaints and number of lost bags? | regression analysis |
What is the customer response (complain or not) to the length of the arrival delay? | logistic regression |
A telephone company needs your advice on the data it should include in its multidimensional database. It has an extensive relational database that captures details (e.g., calling and called phone numbers, time of day, cost, length of call) of every call. As well, it has access to extensive demographic data so it can allocate customers to one of 50 lifestyle categories. What data would you load into the multidimensional database? What aggregations would you use? It might help to identify initially the identifier and variable dimensions (use Table 15-6 on page 450).
Identifiers | When | Time of day, day of week, type of day (e.g., holiday) |
Where | Location of calling and called phone | |
What | Regular call, emergency, toll-free | |
How | Cell or land line | |
Who | Residential, business, family, lifestyle category | |
Variables | Outcomes | Length |
It is possible that data mining will uncover relationships that are not causal relationships. For instance, it may find that half of all the people who bought Spam also live in a specific zip code. The purchase of Spam may have nothing to do with zip code but instead be related to dwelling type (i.e. apartments). Therefore, targeting a specific zip code would not be as effective as targeting a dwelling type. To avoid such pitfalls, the data needs to be scrutinized by the analysts to ensure the conclusions made about data patterns are causal and not just coincidental. Also, do data mining on a portion of the database and test the findings against the remaining data to test that the relationship still holds.
Sum of REVENUE | ||||
WHERE | Catalog | Store | Web | Grand Total |
London |
50,310 | 151,015 | 13,009 | 214,334 |
New York | 8,712 | 28,060 | 2,351 | 39,123 |
Paris | 32,166 | 104,083 | 7,054 | 143,303 |
Sydney | 5,471 | 21,769 | 2,749 | 29,989 |
Tokyo | 12,103 | 42,610 | 2,003 | 56,716 |
Grand Total | 108,762 | 347,537 | 27,166 | 483,465 |
Continue to use the PivotTable Wizard to answer the following questions:
.01%
18.24%
WHEN | WHAT | QUANTITY | REVENUE |
1/9/97 | Hat - polar explorer | 4 | 136 |
1/9/97 | Hammock | 5 | 275 |
Total Value | 411 |
Qtr1 | 13,156 |
Qtr2 | 14,950 |
Qtr3 | 24,518 |
Qtr4 | 8,372 |
Jan | 0 |
Feb | 0 |
Mar | 5 |
Apr | 7 |
May | 2 |
Jun | 0 |
Jul | 1 |
Aug | 5 |
Sep | 2 |
Oct | 0 |
Nov | 0 |
Dec | 7 |
This page is part of the promotional and support
material for Data Management (open edition) by Richard T. Watson For questions and comments please contact the author |