Untitled Document

1.

Identify data captured by a TPS at your university. Estimate how much data are generated in a year.

A TPS at a university may be the payroll system for the faculty and staff or the student grade system. For each system you would need to determine the number of entries per year, and the size of the fields contained in each entry. For instance, consider a simple payroll system where the data generated each pay period are just the social security number and the amount paid per employee. The social security number is a 9 digit field and the amount paid might be an 8 digit field. Given these dimensions and an employee count of 2,000, the amount of data generate each year would be :

(9+8) * 2000 * 26 = 884,000

This is a very abbreviated look at a system, other fields would be included in most systems, and it is worth revisiting the data model on page 173 to get some ideas of these other fields.

2.

What data does your university need to support decision making? Does the data come from internal or external sources?

The university may need a variety of data for decision making, coming from within the university and from external sources.

For instance, to plan for incoming students, demographic information would be important, as would drop out rate (internal data), dorm capacity (internal rate), and the faculty/staff availability.

To hire new faculty, the university may need information about average salaries for a particular field, current department size, the teaching load, number of students enrolled in that college, and average salary of current faculty members in that college.

There are many more decision making areas for the university and each will have its own needs for data (both internal and external).

3.

If your university were to implement a data warehouse, what examples of dirty data might you expect to find?

Dirty data that may be found in a university's database system include:

Duplicate student records (one in the billing department, one in the grading system, and perhaps one in payroll);
Duplicate faculty records (one in the payroll and one in the grading system);
Multiple entries for the same student (for instance, if they changed their name or they took some time off and re-enrolled in the university);
Misuse of data fields (a database may have been designed prior to the widespread use of e-mail and used a non-populated field such as fax number for the e-mail addresses).

4.

How frequently do you think a university should refresh its data warehouse?

A university may not need to revise its data warehouse frequently. Perhaps once term would be adequate.

5.

Write five data verification questions for a university data warehouse.

Examples of verification questions:

What is the average GPA for MIS students?
What is the average age of a university student?
How many students have a 4.0 GPA?
How many students have a home address from the same state in which the university resides?
What is the average number of class hours a student registers for in a term?

6.

Write five data discovery questions for a university data warehouse.

Examples of data discovery questions:

What is the best predictor of GPA?
What is the best predictor of dropping out?
What is the profile of the best teachers in the university?
What is the best predictor of a student becoming a philosophy major?
What is the best predictor of the timeliness of student tuition being paid?

7.

Imagine you work as an analyst for a major global auto manufacturer. What techniques would you use for the following questions?

7a.

How do sports car buyers differ from other customers?

A discovery question using data mining.

7b.

How should the market for trucks be segmented?

Clustering.

7c.

Where does our major competitor have its dealers? d. How much money is a dealer likely to make from servicing a customer who buys a luxury car?

Visualization. An SQL query will give a list of the locations, but a map (visualization) is likely to give you a better feel for the competitor's impact.

7d.

How much money is a dealer likely to make from servicing a customer who buys a luxury car?

Calculate the predicted value of a customer.

7e.

What do people who buy midsize sedans have in common?

Find classification characteristics of people who buy midsize sedans.

7f.

What products do customers buy within six months of buying a new car?

Sequential pattern.

7g.

Who are the most likely prospects to buy a luxury car?

Classify customers into luxury and non-luxury car buyers, and then examine those attributes that discriminate between the two groups.

7h.

What were last year's sales of compacts in Europe by country and quarter?

Use a multidimensional database with a verification question.

7i.

We know a great deal about the sort of car a customer will buy based on demographic data (e.g., age, number of children, and type of job). What is a simple visual aid we can provide to sales personnel to help them show customers the car they are most likely to buy?

Decision tree. By asking a series of questions based on the decision tree, a salesperson could identify the car a person is most likely to buy.

8.

An international airline has commissioned you to design a multidimensional database for its marketing department. Choose identifier and variable dimensions (see the table in the chapter). List some of the analyses that could be performed against this database and the statistical technique that might be appropriate for them.

Apply the basic prompts

Identifiers	When	Reservation date and time Flight date and time
	Where	Customer postal code (infer socioeconomic group)
	What	Flight number Origin city Destination city Class of travel
	How	Reservation channel (phone, Web, travel agent)
	Who	Frequent flyer status, Million miler status, demographic data
	Why	Purpose of travel (business or leisure)
Variables	Outcomes	Ticket price Passenger costs Departure time differential Arrival time differential Miles Aircraft load factor Aircraft costs Flight complaints Flight bags lost Aircraft capacity

Question	Method of analysis
Which flights are on average more than 30 minutes late?	SQL
Which flights are the most profitable?	analysis of variance
What socioeconomic groups fly first class from Atlanta to Paris?	contingency table analysis
Is there a relationship for a flight between the number of customer complaints and number of lost bags?	regression analysis
What is the customer response (complain or not) to the length of the arrival delay?	logistic regression

9.

A telephone company needs your advice on the data it should include in its multidimensional database. It has an extensive relational database that captures details (e.g., calling and called phone numbers, time of day, cost, length of call) of every call. As well, it has access to extensive demographic data so it can allocate customers to one of 50 lifestyle categories. What data would you load into the multidimensional database? What aggregations would you use? It might help to identify initially the identifier and variable dimensions (use Table 15-6 on page 450).

Identifiers	When	Time of day, day of week, type of day (e.g., holiday)
	Where	Location of calling and called phone
	What	Regular call, emergency, toll-free
	How	Cell or land line
	Who	Residential, business, family, lifestyle category
Variables	Outcomes	Length Cost Revenue

10.

What are the possible dangers of data mining? How might you avoid these?

It is possible that data mining will uncover relationships that are not causal relationships. For instance, it may find that half of all the people who bought Spam also live in a specific zip code. The purchase of Spam may have nothing to do with zip code but instead be related to dwelling type (i.e. apartments). Therefore, targeting a specific zip code would not be as effective as targeting a dwelling type. To avoid such pitfalls, the data needs to be scrutinized by the analysts to ensure the conclusions made about data patterns are causal and not just coincidental. Also, do data mining on a portion of the database and test the findings against the remaining data to test that the relationship still holds.

11.

Download the file exped.xls from the book's Web site and open it with MS Excel. This file is a sample of 1,000 sales transactions for The Expeditioner. For each sale, there is a row recording when it was sold, where it was sold, what was sold, how it was sold, the quantity sold, and the sales revenue. Use the PivotTable Wizard (Data>PivotTable Report) to produce the following report:

Sum of REVENUE	HOW
WHERE	Catalog	Store	Web	Grand Total
London	50,310	151,015	13,009	214,334
New York	8,712	28,060	2,351	39,123
Paris	32,166	104,083	7,054	143,303
Sydney	5,471	21,769	2,749	29,989
Tokyo	12,103	42,610	2,003	56,716
Grand Total	108,762	347,537	27,166	483,465

Continue to use the PivotTable Wizard to answer the following questions:

11a.

What was the value of catalog sales for London in the first quarter?

7,810 (assuming first quarter is Jan.- March)

11b.

What percent of the total were Tokyo Web sales in the fourth quarter?

.01%

11c.

What percent of Sydney's annual sales were catalog sales?

18.24%

11d.

What was the value of catalog sales for London in January? Give details of the transactions.

WHEN	WHAT	QUANTITY	REVENUE
1/9/97	Hat - polar explorer	4	136
1/9/97	Hammock	5	275
	Total Value		411

11e.

What was the value of Camel saddle sales for Paris in 2002 by quarter?

Qtr1	13,156
Qtr2	14,950
Qtr3	24,518
Qtr4	8,372

11f.

How many Elephant polo sticks were sold in New York in each month of 2002?

Jan	0
Feb	0
Mar	5
Apr	7
May	2
Jun	0
Jul	1
Aug	5
Sep	2
Oct	0
Nov	0
Dec	7

14 - Organizational Intelligence

Answers to exercises

Date revised: 10-Dec-2021