Bus 352
hlidavss2000Chapter #1:
Beginning of the End … Or the End of the
Beginning?
The past few years have been challenging for Good Tunes & More (GT&M), a
business that traces its roots to Good Tunes, a store that exclusively sold music
CDs and vinyl records.
GT&M first broadened its merchandise to include home entertainment
and computer systems (the “More”), and then undertook an expansion to take
advantage of prime locations left empty by bankrupt former competitors. Today,
GT&M finds itself at a crossroads. Hoped-for increases in revenues that have
failed to occur and declining profit margins due to the competitive pressures of
online sellers have led management to reconsider the future of the business.
While some investors in the business have argued for an orderly retreat,
closing
stores and limiting the variety of merchandise, GT&M CEO Emma Levia
has decided to “double down” and expand the business
by purchasing Whitney
Wireless, a successful three-store chain that sells smartphones
and other mobile
devices.
Levia foresees creating a brand new “A-to-Z” electronics retailer but
first must establish a fair and reasonable price for the privately held Whitney
Wireless.
To do so, she has asked a group of analysts to identify the data that
would be helpful in setting a price for the wireless business. As part of that
group, you quickly realize that you need the data that would help to verify the
contents of the wireless company’s basic financial statements.
You focus on data associated with the company’s profit and loss statement
and quickly realize the need for sales and expense-related
variables.
You begin to
think about what the data for
such variables would look
like and how to collect those
data. You realize that you are
starting to apply the DCOVA
framework to the objective
of helping Levia acquire
Whitney Wireless.
Chapter Defining and
1 Collecting Data
Tyler Olson/Shutterstock
contents
1.1 Defining Variables
1.2 Collecting Data
1.3 Types of Sampling Methods
1.4 Types of Survey Errors
Think About This: New Media
Surveys/Old Sampling Problems
Using Statistics: Beginning of
the End … Revisited
Chapter 1 Excel Guide
Chapter 1 Minitab Guide
Objectives
Understand issues that arise
when defining variables
How to define variables
How to collect data
Identify the different ways to
collect a sample
Understand the types of
survey errors
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
1.1 Defining Variables 11
When Emma Levia decides to purchase Whitney Wireless, she has defined a new
goal or business objective for GT&M. Business objectives can arise from any
level of management and can be as varied as the following:
• A marketing analyst needs to assess the effectiveness of a new online advertising campaign.
• A pharmaceutical company needs to determine whether a new drug is more effective
than those currently in use.
• An operations manager wants to improve a manufacturing or service process.
• An auditor needs to review a company’s financial transactions to determine whether the
company is in compliance with generally accepted accounting principles.
Establishing an objective marks the end of a problem definition process. This end triggers
the new process of identifying the correct data to support the objective. In the GT&M scenario,
having decided to buy Whitney Wireless, Levia needs to identify the data that would be helpful
in setting a price for the wireless business. This process of identifying the correct data triggers
the start of applying the tasks of the DCOVA framework. In other words, the end of problem
definition marks the beginning of applying statistics to business decision making.
Identifying the correct data to support a business objective is a two-part job that requires
defining variables and collecting the data for those variables. These tasks are the first two tasks
of the DCOVA framework first defined in Section GS.1 and which can be restated here as:
• Define the variables that you want to study to solve a problem or meet an objective.
• Collect the data for those variables from appropriate sources.
This chapter discusses these two tasks which must always be done before the Organize, Visualize,
and Analyze tasks.
Defining variables at first may seem to be the simple process of making the list of things one
needs to help solve a problem or meet an objective. However, consider the GT&M scenario.
Most would quickly agree that yearly sales of Whitney Wireless would be part of the data
needed to meet Levia’s objective, but just placing “yearly sales” on a list could lead to confusion
and miscommunication: Does this variable refer to sales per year for the entire chain or
for individual stores? Does the variable refer to net or gross sales? Are the yearly sales values
expressed in number of units or as currency amounts such as U.S. dollar sales?
These questions illustrate that for each variable of interest that you identify you must supply
an operational definition, a universally accepted meaning that is clear to all associated
with an analysis. Operational definitions should also classify the variable, as explained in the
next section, and may include additional facts such as units of measures, allowed range of
values, and definitions of specific variable values, depending on how the variable is classified.
Classifying Variables by Type
When you operationally define a variable, you must classify the variable as being either categorical
or numerical. Categorical variables (also known as qualitative variables) take categories
as their values. Numerical variables (also known as quantitative variables) have values
that represent a counted or measured quantity. Classification also affects a variable’s operational
definition and getting the classification correct is important because certain statistical methods
can be applied correctly to one type or the other, while other methods may need a specific mix
of variable types.
Categorical variables can take the form of yes-and-no questions such as “Do you have a
Twitter account?” (in which yes and no form the variable’s two categories) or describe a trait
or characteristic that has many categories such as undergraduate class standing (which might
have the defined categories freshman, sophomore, junior, and senior). When defining a categorical
variable, the list of permissible category values must be included and each category
1.1 Defining Variables
Student Tip
Providing operational
definitions for concepts
is important, too, when
writing a textbook! The
end-of-chapter Key
Terms gives you an index
of operational definitions
and the most fundamental
definitions are
presented in boxes such
as the page 3 box that
defines variable and data.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
12 Chapter 1 Defining and Collecting Data
value should be defined, too, e.g., that a “freshman” is a student who has completed fewer
than 32 credit hours. Overlooking these requirements can lead to confusion and incorrect data
collection. In one famous example, when persons were asked by researchers to fill in a value
for the categorical variable sex, many answered yes and not male or female, the values that the
researchers intended. (Perhaps this is the reason that gender has replaced sex on many data collection
forms—gender’s operational definition is more self-apparent.)
The operational definitions of numerical variables are affected by whether the variable being
defined is discrete or continuous. Discrete variables such as “number of items purchased”
or “total amount paid” are numerical values that arise from a counting process. Continuous
variables such as “time spent on checkout line” or “distance from home to store” have numerical
values that arise from a measuring process and those values depend on the precision of the
measuring instrument used. For example, “time spent on checkout line” might be 2, 2.1, 2.14,
or 2.143 minutes, depending on the precision of the timing instrument being used. Units of
measures and the level of precision should be part of the operational definitions of continuous
variables, e.g., “tenths of a second” for “time spent on checkout line.” The definitions of any
numerical variable can include the allowed range of values, such as “must be greater than 0”
for “number of items purchased.”
When defining variables for survey collection (discussed in Section 1.2), thinking about
the responses you seek helps classify variables as Table 1.1 demonstrates. Thinking about how
a variable will be used to solve a problem or meet an objective can also be helpful when you
define a variable. The variable age might be a numerical (discrete) variable in some cases or
might be categorical with categories such as child, young adult, middle-aged, and retirement
aged in other contexts.
Problems for Section 1.1
Learning the Basics
1.1 Four different beverages are sold at a fast-food restaurant:
soft drinks, tea, coffee, and bottled water. Explain why the
type of beverage sold is an example of a categorical variable.
1.2 U.S. businesses are listed by size: small, medium, and large. Explain
why business size is an example of a categorical variable.
1.3 The time it takes to download a video from the Internet is
measured. Explain why the download time is a continuous
numerical variable.
Applying the Concepts
SELF
Test
1.4 For each of the following variables, determine
whether the variable is categorical or numerical. If the
variable is numerical, determine whether the variable is discrete or
continuous.
a. Number of cellphones in the household
b. Monthly data usage (in MB)
c. Number of text messages exchanged per month
d. Voice usage per month (in minutes)
e. Whether the cellphone is used for email
1.5 The following information is collected
Question Responses Variable Type
Do you have a Facebook
profile?
❑ Yes ❑ No Categorical
How many text messages have
you sent in the past three days?
______ Numerical
(discrete)
How long did the mobile app
update take to download?
______ seconds Numerical
(continuous)
Problems for Section 1.1
Learning the Basics
1.1 Four different beverages are sold at a fast-food restaurant:
soft drinks, tea, coffee, and bottled water. Explain why the
type of beverage sold is an example of a categorical variable.
1.2 U.S. businesses are listed by size: small, medium, and large. Explain
why business size is an example of a categorical variable.
1.3 The time it takes to download a video from the Internet is
measured. Explain why the download time is a continuous
numerical variable.
Applying the Concepts
SELF
Test
1.4 For each of the following variables, determine
whether the variable is categorical or numerical. If the
variable is numerical, determine whether the variable is discrete or
continuous.
a. Number of cellphones in the household
b. Monthly data usage (in MB)
c. Number of text messages exchanged per month
d. Voice usage per month (in minutes)
e. Whether the cellphone is used for email
1.5 The following information is collected from students upon
exiting the campus bookstore during the first week of classes.
a. Amount of time spent shopping in the bookstore
b. Number of textbooks purchased
c. Academic major
d. Gender
Classify each of these variables as categorical or numerical. If the
variable is numerical, determine whether the variable is discrete or
continuous.
1.6 For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether the variable is discrete or continuous.
a. Name of Internet service provider
b. Time, in hours, spent surfing the Internet per week
c. Whether the individual uses a mobile phone to connect to the
Internet
d. Number of online purchases made in a month
e. Where the individual uses social networks to find sought-after
information
Learn More
Read the Short Takes for
Chapter 1 for more examples
of classifying variables
as either
categorical or numerical.
Ta ble 1 . 1
Identifying Types of
Variables
Question Responses Variable Type
Do you have a Facebook
profile?
❑ Yes ❑ No Categorical
How many text messages have
you sent in the past three days?
______ Numerical
(discrete)
How long did the mobile app
update take to download?
______ seconds Numerical
(continuous)
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
1.2 Collecting Data 13
1.2 Collecting Data
After defining the variables that you want to study, you can proceed with the data collection
task. Collecting data is a critical task because if you collect data that are flawed by biases,
ambiguities, or other types of errors, the results you will get from using such data with even
the most sophisticated statistical methods will be suspect or in error. (For a famous example of
flawed data collection leading to incorrect results, read the Think About This essay on page 21.)
Data collection consists of identifying data sources, deciding whether the data you collect
will be from a population or a sample, cleaning your data, and sometimes recoding variables.
The rest of this section explains these aspects of data collection.
Data Sources
You collect data from either primary or secondary data sources. You are using a primary data
source if you collect your own data for analysis. You are using a secondary data source if the
data for your analysis have been collected by someone else.
You collect data by using any of the following:
• Data distributed by an organization or individual
• The outcomes of a designed experiment
• The responses from a survey
• The results of conducting an observational study
• Data collected by ongoing business activities
Market research companies and trade associations distribute data pertaining to specific industries
or markets. Investment services provide business and financial data on publicly listed
companies. Syndicated services such as The Nielsen Company provide consumer research data to
telecom and mobile media companies. Print and online media companies also distribute data that
they may have collected themselves or may be republishing from other sources.
The outcomes of a designed experiment are a second data source. For example, a consumer
electronics company might conduct an experiment that compares the sales of mobile
electronics merchandise for different store locations. Note that developing a proper experimental
design is mostly beyond the scope of this book, but Chapter 10 discusses some of the
fundamental experimental design concepts.
Survey responses represent a third type of data source. People being surveyed are asked
questions about their beliefs, attitudes, behaviors, and other characteristics. For example,
people could be asked which store location for mobile electronics merchandise is preferable.
(Such a survey could lead to data that differ from the data collected from the outcomes of the
1.7 For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether the variable is discrete or continuous.
a. Amount of money spent on clothing in the past month
b. Favorite department store
c. Most likely time period during which shopping for clothing
takes place (weekday, weeknight, or weekend)
d. Number of pairs of shoes owned
1.8 Suppose the following information is collected from Robert
Keeler on his application for a home mortgage loan at the Metro
County Savings and Loan Association.
a. Monthly payments: $2,227
b. Number of jobs in past 10 years: 1
c. Annual family income: $96,000
d. Marital status: Married
Classify each of the responses by type of data.
1.9 One of the variables most often included in surveys is income.
Sometimes the question is phrased “What is your income
(in thousands of dollars)?” In other surveys, the respondent is
asked to “Select the circle corresponding to your income level”
and is given a number of income ranges to choose from.
a. In the first format, explain why income might be considered
either discrete or continuous.
b. Which of these two formats would you prefer to use if you
were conducting a survey? Why?
1.10 If two students score a 90 on the same examination,
what arguments could be used to show that the underlying
variable—test score—is continuous?
1.11 The director of market research at a large department store
chain wanted to conduct a survey throughout a metropolitan area
to determine the amount of time working women spend shopping
for clothing in a typical month.
a. Indicate the type of data the director might want to collect.
b. Develop a first draft of the questionnaire needed in (a) by writing
three categorical questions and three numerical questions
that you feel would be appropriate for this survey
One of the variables most often included in surveys is income.
Sometimes the question is phrased “What is your income
1.2 Collecting Data
After defining the variables that you want to study, you can proceed with the data collection
task. Collecting data is a critical task because if you collect data that are flawed by biases,
ambiguities, or other types of errors, the results you will get from using such data with even
the most sophisticated statistical methods will be suspect or in error. (For a famous example of
flawed data collection leading to incorrect results, read the Think About This essay on page 21.)
Data collection consists of identifying data sources, deciding whether the data you collect
will be from a population or a sample, cleaning your data, and sometimes recoding variables.
The rest of this section explains these aspects of data collection.
Data Sources
You collect data from either primary or secondary data sources. You are using a primary data
source if you collect your own data for analysis. You are using a secondary data source if the
data for your analysis have been collected by someone else.
You collect data by using any of the following:
• Data distributed by an organization or individual
• The outcomes of a designed experiment
• The responses from a survey
• The results of conducting an observational study
• Data collected by ongoing business activities
Market research companies and trade associations distribute data pertaining to specific industries
or markets. Investment services provide business and financial data on publicly listed
companies. Syndicated services such as The Nielsen Company provide consumer research data to
telecom and mobile media companies. Print and online media companies also distribute data that
they may have collected themselves or may be republishing from other sources.
The outcomes of a designed experiment are a second data source. For example, a consumer
electronics company might conduct an experiment that compares the sales of mobile
electronics merchandise for different store locations. Note that developing a proper experimental
design is mostly beyond the scope of this book, but Chapter 10 discusses some of the
fundamental experimental design concepts.
Survey responses represent a third type of data source. People being surveyed are asked
questions about their beliefs, attitudes, behaviors, and other characteristics. For example,
people could be asked which store location for mobile electronics merchandise is preferable.
(Such a survey could lead to data that differ from the data collected from the outcomes of the
1.7 For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether the variable is discrete or continuous.
a. Amount of money spent on clothing in the past month
b. Favorite department store
c. Most likely time period during which shopping for clothing
takes place (weekday, weeknight, or weekend)
d. Number of pairs of shoes owned
1.8 Suppose the following information is collected from Robert
Keeler on his application for a home mortgage loan at the Metro
County Savings and Loan Association.
a. Monthly payments: $2,227
b. Number of jobs in past 10 years: 1
c. Annual family income: $96,000
d. Marital status: Married
Classify each of the responses by type of data.
1.9 One of the variables most often included in surveys is income.
Sometimes the question is phrased “What is your income
(in thousands of dollars)?” In other surveys, the respondent is
asked to “Select the circle corresponding to your income level”
and is given a number of income ranges to choose from.
a. In the first format, explain why income might be considered
either discrete or continuous.
b. Which of these two formats would you prefer to use if you
were conducting a survey? Why?
1.10 If two students score a 90 on the same examination,
what arguments could be used to show that the underlying
variable—test score—is continuous?
1.11 The director of market research at a large department store
chain wanted to conduct a survey throughout a metropolitan area
to determine the amount of time working women spend shopping
for clothing in a typical month.
a. Indicate the type of data the director might want to collect.
b. Develop a first draft of the questionnaire needed in (a) by writing
three categorical questions and three numerical questions
that you feel would be appropriate for this survey.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
14 Chapter 1 Defining and Collecting Data
designed experiment of the previous paragraph.) Surveys can be affected by any of the four
types of errors that are discussed in Section 1.4.
Observational study results are a fourth data source. A researcher collects data by directly
observing a behavior, usually in a natural or neutral setting. Observational studies are a common
tool for data collection in business. For example, market researchers use focus groups
to elicit unstructured responses to open-ended questions posed by a moderator to a target audience.
Observational studies are also commonly used to enhance teamwork or improve the
quality of products and services.
Data collected by ongoing business activities are a fifth data source. Such data can be
collected from operational and transactional systems that exist in both physical “bricks-andmortar”
and online settings but can also be gathered from secondary sources such as third-party
social media networks and online apps and website services that collect tracking and usage data.
For example, a bank might analyze a decade’s worth of financial transaction data to identify
patterns of fraud, and a marketer might use tracking data to determine the effectiveness of a
website.
Sources for big data (see Section GS.3) tend to be a mix of primary and secondary sources
of this last type. For example, a retailer interested in increasing sales might mine Facebook
and
Twitter accounts to identify sentiment about certain products or to pinpoint top influencers and
then match those data to its own data collected during customer transactions.
Populations and Samples
You collect your data from either a population or a sample. A population consists of all the
items or individuals about which you want to reach conclusions. All the GT&M sales transactions
for a specific year, all the full-time students enrolled in a college, and all the registered
voters in Ohio are examples of populations. In Chapter 3, you will learn that when you analyze
data from a population you compute parameters.
A sample is a portion of a population selected for analysis. The results of analyzing a
sample are used to estimate characteristics of the entire population. From the three examples
of populations just given, you could select a sample of 200 GT&M sales transactions randomly
selected by an auditor for study, a sample of 50 full-time students selected for a marketing
study, and a sample of 500 registered voters in Ohio contacted via telephone for a political
poll. In each of these examples, the transactions or people in the sample represent a portion of
the items or individuals that make up the population. In Chapter 3, you will learn that when
you analyze data from a sample you compute statistics .
You collect data from a sample when any of the following applies:
• Selecting a sample is less time consuming than selecting every item in the population.
• Selecting a sample is less costly than selecting every item in the population.
• Analyzing a sample is less cumbersome and more practical than analyzing the entire
population.
Structured Versus Unstructured Data
The data you collect may be formatted in a variety of ways, some of which add to the data
collection task. For example, suppose that you wanted to collect electronic financial data
about a sample of companies. That data might exist as tables of data, the contents of standardized
documents such as fill-in-the-blank surveys, a continuous stream of data such as a
stock ticker, or text messages or emails delivered from email systems or social media websites.
Some of these forms, such as a set of text messages have very little or no repeating
structure, are examples of unstructured data. Although unstructured data forms can form a
part of a big data collection,
collecting data in unstructured forms for the statistical methods
discussed in this book requires conversion of the data to a structured form. For example,
after collecting text messages,
you could convert their contents to a structured form by defining
a set of variables that might include a numerical variable that counts the number of
words in the message and various categorical variables that help classify the content of the
message.
Learn More
Read the Short Takes
for Chapter 1 for a further
discussion about data
sources.
Student Tip
To help remember the
difference between a
sample and a population,
think of a pie. The
entire pie represents the
population, and the pie
slice that you select is
the sample.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
1.2 Collecting Data 15
Electronic Formats and Encodings
The same form of data can exist in more than one electronic format, with some formats more
immediately usable than others. For example, a table of data might exist as a scanned image
or as data in a worksheet file. The worksheet data could be immediately used in a statistical
analysis, but the scanned image would need to be first converted to worksheet data using a
character-scanning program that can recognize numbers in an image.
Data can also be encoded in more than one way, as you may have learned in an information
systems course. Different encodings may affect the recorded precision of values for
continuous variables and lead to values more imprecise or values that convey a false sense of
precision, such as a time measurement that gets encoded in ten-thousandths of a second when
the original measurement was only in tenths of a second. This changed precision can violate
the operational definition of a continuous variable and sometimes affect results calculated.
Data Cleaning
Whatever ways you choose to collect data, you may find irregularities in the values you collect
such as undefined or impossible values. For a categorical variable, an undefined value would
be a value that does not represent one of the categories defined for the variable. For a numerical
variable, an impossible value would be a value that falls outside a defined range of possible
values for the variable. For a numerical variable without a defined range of possible values,
you might also find outliers, values that seem excessively different from most of the rest of the
values. Such values may or may not be errors, but they demand a second review.
Values that are missing are another type of irregularity. A missing value is a value that was
not able to be collected (and therefore not available to be analyzed). For example, you would
record a nonresponse to a survey question as a missing value. You can represent missing values
in Minitab by using an asterisk value for a numerical variable or by using a blank value for a
categorical variable, and such values will be properly excluded from analysis. The more limited
Excel has no special values that represent a missing value. When using Excel, you must
find and then exclude missing values manually.
When you spot an irregularity in the data you have collected, you may have to “clean” the
data. Although a full discussion of data cleaning is beyond the scope of this book (see reference
8), you can learn more about the ways you can use Excel or Minitab for data cleaning in
the Short Takes for Chapter 1.
Recoding Variables
After you have collected data, you may discover that you need to reconsider the categories that
you have defined for a categorical variable or that you need to transform a numerical variable
into a categorical variable by assigning the individual numeric data values to one of several
groups. In either case, you can define a recoded variable that supplements or replaces the
original variable in your analysis.
For example, having already defined the variable undergraduate class standing with the categories
freshmen, sophomore, junior, and senior, you realize that you are more interested in investigating
the differences between lowerclassmen (defined as freshman or sophomore) and upperclassmen
(junior or senior). You can create a new variable UpperLower and assign the value Upper if a
student
is a junior or senior and assign the value Lower if the student is a freshman or sophomore.
When recoding variables, be sure that the category definitions cause each data value to
be placed in one and only one category, a property known as being mutually exclusive. Also
ensure that the set of categories you create for the new, recoded variables include all the data
values being recoded, a property known as being collectively exhaustive. If you are recoding
a categorical variable, you can preserve one or more of the original categories, as long as your
recodings are both mutually exclusive and collectively exhaustive.
When recoding numerical variables, pay particular attention to the operational definitions
of the categories you create for the recoded variable, especially if the categories are not selfdefining
ranges. For example, while the recoded categories Under 12, 12–20, 21–34, 35–54,
and 55 and Over are self-defining for age, the categories Child, Youth, Young Adult, Middle
Aged, and Senior need their own operational definitions.
Student Tip
While encoding issues
go beyond the scope
of this book, the Short
Takes for Chapter 1
includes an experiment
that you can perform in
either Microsoft Excel
or Minitab that illustrates
how data encoding can
affect the precision of
values.
Data cleaning will not be
necessary when you use the
(previously cleaned) data for
the examples and problems
in this book.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
16 Chapter 1 Defining and Collecting Data
Problems for Section 1.2
Applying the Concepts
1.12 The Data and Story Library (DASL) is an online library of
data files and stories that illustrate the use of basic statistical methods.
Visit lib.stat.cmu.edu/index.php, click DASL, and explore a
data set of interest to you. Which of the five sources of data best
describes the sources of the data set you selected?
1.13 Visit the website of the Gallup organization at www.gallup
.com. Read today’s top story. What type of data source is the top
story based on?
1.14 Visit the website of the Pew Research organization at www
.pewresearch.org. Read today’s top story. What type of data
source is the top story based on?
1.15 Transportation engineers and planners want to address the
dynamic properties of travel behavior by describing in detail the
driving characteristics of drivers over the course of a month. What
type of data collection source do you think the transportation engineers
and planners should use?
1.16 Visit the opening page of the Statistics Portal “Statista” at
(statista.com). Examine the “CHART OF THE DAY” panel on
the page. What type of data source is the information presented
here based on?
When you collect data by selecting a sample, you begin by defining the frame. The frame is
a complete or partial listing of the items that make up the population from which the sample
will be selected. Inaccurate or biased results can occur if a frame excludes certain groups, or
portions of the population. Using different frames to collect data can lead to different, even opposite,
conclusions.
Using your frame, you select either a nonprobability sample or a probability sample. In
a nonprobability sample, you select the items or individuals without knowing their probabilities
of selection. In a probability sample, you select items based on known probabilities.
Whenever possible, you should use a probability sample as such a sample will allow you to
make inferences about the population being analyzed.
Nonprobability samples can have certain advantages, such as convenience, speed, and low
cost. Such samples are typically used to obtain informal approximations or as small-scale initial
or pilot analyses. However, because the theory of statistical inference depends on probability
sampling, nonprobability samples cannot be used for statistical inference and this more
than offsets those advantages in more formal analyses.
Figure 1.1 shows the subcategories of the two types of sampling. A nonprobability sample
can be either a convenience sample or a judgment sample. To collect a convenience sample,
you select items that are easy, inexpensive, or convenient to sample. For example, in a warehouse
of stacked items, selecting only the items located on the tops of each stack and within
easy reach would create a convenience sample. So, too, would be the responses to surveys that
the websites of many companies offer visitors. While such surveys can provide large amounts
of data quickly and inexpensively, the convenience samples selected from these responses will
consist of self-selected website visitors. (Read the Think About This essay on page 21 for a
related story.)
1.3 Types of Sampling Methods
F i g u r e 1 . 1
Types of samples
Nonprobability Samples
Judgment
Sample
Systematic
Sample
Stratied
Sample
Simple
Random
Sample
Cluster
Sample
Probability Samples
Convenience
Sample
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
1.3 Types of Sampling Methods 17
To collect a judgment sample, you collect the opinions of preselected experts
To collect a judgment sample, you collect the opinions of preselected experts in the subject
matter. Although the experts may be well informed, you cannot generalize their results to
the population.
The types of probability samples most commonly used include simple random, systematic,
stratified, and cluster samples. These four types of probability samples vary in terms of
cost, accuracy, and complexity, and they are the subject of the rest of this section.
Simple Random Sample
In a simple random sample, every item from a frame has the same chance of selection as every
other item, and every sample of a fixed size has the same chance of selection as every other
sample of that size. Simple random sampling is the most elementary random sampling technique.
It forms the basis for the other random sampling techniques. However, simple random
sampling has its disadvantages. Its results are often subject to more variation than other sampling
methods. In addition, when the frame used is very large, carrying out a simple random
sample may be time consuming and expensive.
With simple random sampling, you use n to represent the sample size and N to represent
the frame size. You number every item in the frame from 1 to N. The chance that you will select
any particular member of the frame on the first selection is 1>N.
You select samples with replacement or without replacement. Sampling with replacement
means that after you select an item, you return it to the frame, where it has the same
probability of being selected again. Imagine that you have a fishbowl containing N business
cards, one card for each person. On the first selection, you select the card for Grace Kim. You
record pertinent information and replace the business card in the bowl. You then mix up the
cards in the bowl and select a second card. On the second selection, Grace Kim has the same
probability of being selected again, 1>N. You repeat this process until you have selected the
desired sample size, n.
Typically, you do not want the same item or individual to be selected again in a sample.
Sampling without replacement means that once you select an item, you cannot select
it again. The chance that you will select any particular item in the frame—for example, the
business card for Grace Kim—on the first selection is 1>N. The chance that you will select any
card not previously chosen on the second selection is now 1 out of N - 1. This process continues
until you have selected the desired sample of size n.
When creating a simple random sample, you should avoid the “fishbowl” method of selecting
a sample because this method lacks the ability to thoroughly mix the cards and, therefore,
randomly select a sample. You should use a more rigorous selection method.
One such method is to use a table of random numbers, such as Table E.1 in Appendix E,
for selecting the sample. A table of random numbers consists of a series of digits listed in
a randomly generated sequence. To use a random number table for selecting a sample, you
first need to assign code numbers to the individual items of the frame. Then you generate the
random sample by reading the table of random numbers and selecting those individuals from
the frame whose assigned code numbers match the digits found in the table. Because the number
system uses 10 digits 10, 1, 2,c, 92, the chance that you will randomly generate any
particular digit is equal to the probability of generating any other digit. This probability is 1
out of 10. Hence, if you generate a sequence of 800 digits, you would expect about 80 to be the
digit 0, 80 to be the digit 1, and so on. Because every digit or sequence of digits in the table is
random, the table can be read either horizontally or vertically. The margins of the table designate
row numbers and column numbers. The digits themselves are grouped into sequences of
five in order to make reading the table easier.
Learn More
Learn to use a table of
random numbers to select a
simple random sample in a
Chapter 1 online section.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
18 Chapter 1 Defining and Collecting Data
Systematic Sample
In a systematic sample, you partition the N items in the frame into n groups of k items, where
k =
N
n
You round k to the nearest integer. To select a systematic sample, you choose the first item to
be selected at random from the first k items in the frame. Then, you select the remaining n - 1
items by taking every kth item thereafter from the entire frame.
If the frame consists of a list of prenumbered checks, sales receipts, or invoices, taking a
systematic sample is faster and easier than taking a simple random sample. A systematic sample
is also a convenient mechanism for collecting data from membership directories, electoral
registers, class rosters, and consecutive items coming off an assembly line.
To take a systematic sample of n = 40 from the population of N = 800 full-time employees,
you partition the frame of 800 into 40 groups, each of which contains 20 employees. You
then select a random number from the first 20 individuals and include every twentieth individual
after the first selection in the sample. For example, if the first random number you select
is 008, your subsequent selections are 028, 048, 068, 088, 108,c, 768, and 788.
Simple random sampling and systematic sampling are simpler than other, more sophisticated,
probability sampling methods, but they generally require a larger sample size. In addition,
systematic sampling is prone to selection bias that can occur when there is a pattern in
the frame. To overcome the inefficiency of simple random sampling and the potential selection
bias involved with systematic sampling, you can use either stratified sampling methods or
cluster sampling methods.
Stratified Sample
In a stratified sample, you first subdivide the N items in the frame into separate subpopulations,
or strata. A stratum is defined by some common characteristic, such as gender or year
in school. You select a simple random sample within each of the strata and combine the results
from the separate simple random samples. Stratified sampling is more efficient than either
simple random sampling or systematic sampling because you are ensured of the representation
of items across the entire population. The homogeneity of items within each stratum provides
greater precision in the estimates of underlying population parameters. In addition, stratified
sampling enables you to reach conclusions about each strata in the frame. However, using a
stratified sample requires that you can determine the variable(s) on which to base the stratification
and can also be expensive to implement.
Cluster Sample
In a cluster sample, you divide the N items in the frame into clusters that contain several
items. Clusters are often naturally occurring groups, such as counties, election districts, city
blocks, households, or sales territories. You then take a random sample of one or more clusters
and study all items in each selected cluster.
Cluster sampling is often more cost-effective than simple random sampling, particularly
if the population is spread over a wide geographic region. However, cluster sampling often requires
a larger sample size to produce results as precise as those from simple random sampling
or stratified sampling. A detailed discussion of systematic sampling, stratified sampling, and
cluster sampling procedures can be found in references 2, 4, and 5.
Learn More
Learn how to select a
stratified sample in a
Chapter 1 online section.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
1.4 Types of Survey Errors 19
Problems for Section 1.3
Learning the Basics
1.17 For a population containing N = 902 individuals, what
code number would you assign for
a. the first person on the list?
b. the fortieth person on the list?
c. the last person on the list?
1.18 For a population of N = 902, verify that by starting in row 05,
column 01 of the table of random numbers (Table E.1), you need only
six rows to select a sample of N = 60 without replacement.
1.19 Given a population of N = 93, starting in row 29, column 01
of the table of random numbers (Table E.1), and reading across the
row, select a sample of N = 15
a. without replacement.
b. with replacement.
Applying the Concepts
1.20 For a study that consists of personal interviews with participants
(rather than mail or phone surveys), explain why simple random
sampling might be less practical than some other sampling methods.
1.21 You want to select a random sample of n = 1 from a population
of three items (which are called A, B, and C). The rule for
selecting the sample is as follows: Flip a coin; if it is heads, pick
item A; if it is tails, flip the coin again; this time, if it is heads,
choose B; if it is tails, choose C. Explain why this is a probability
sample but not a simple random sample.
1.22 A population has four members (called A, B, C, and D). You
would like to select a random sample of n = 2, which you decide
to do in the following way: Flip a coin; if it is heads, the sample will
be items A and B; if it is tails, the sample will be items C and D.
Although this is a random sample, it is not a simple random sample.
Explain why. (Compare the procedure described in Problem
1.21 with the procedure described in this problem.)
1.23 The registrar of a university with a population of N = 4,000
full-time students is asked by the president to conduct a survey
to measure satisfaction with the quality of life on campus. The
following table contains a breakdown of the 4,000 registered
full-time students, by gender and class designation:
The registrar intends to take a probability sample of n = 200 students
and project the results from the sample to the entire population
of full-time students.
a. If the frame available from the registrar’s files is an alphabetical
listing of the names of all N = 4,000 registered full-time
students, what type of sample could you take? Discuss.
b. What is the advantage of selecting a simple random sample
in (a)?
c. What is the advantage of selecting a systematic sample in (a)?
d. If the frame available from the registrar’s files is a list of the
names of all N = 4,000 registered full-time students compiled
from eight separate alphabetical lists, based on the gender and
class designation breakdowns shown in the class designation
table, what type of sample should you take? Discuss.
e. Suppose that each of the N = 4,000 registered full-time students
lived in one of the 10 campus dormitories. Each dormitory
accommodates 400 students. It is college policy to fully
integrate students by gender and class designation in each dormitory.
If the registrar is able to compile a listing of all students
by dormitory, explain how you could take a cluster sample.
SELF
Test
1.24 Prenumbered sales invoices are kept in a
sales journal. The invoices are numbered from 0001
to 5000.
a. Beginning in row 16, column 01, and proceeding horizontally
in a table of random numbers (Table E.1), select a simple random
sample of 50 invoice numbers.
b. Select a systematic sample of 50 invoice numbers. Use the random
numbers in row 20, columns 05–07, as the starting point
for your selection.
c. Are the invoices selected in (a) the same as those selected in
(b)? Why or why not?
1.25 Suppose that 10,000 customers in a retailer’s customer database
are categorized by three customer types: 3,500 prospective
buyers, 4,500 first time buyers, and 2,000 repeat (loyal) buyers.
A sample of 1,000 customers is needed.
a. What type of sampling should you do? Why?
b. Explain how you would carry out the sampling according to the
method stated in (a).
c. Why is the sampling in (a) not simple random sampling?
Class Designation
Gender Fr. So. Jr. Sr. Total
Female 700 520 500 480 2,200
Male 560 460 400 380 1,800
Total 1,260 980 900 860 4,000
1.4 Types of Survey Errors
As you learned in Section 1.2, responses from a survey represent a source of data. Nearly
every day, you read or hear about survey or opinion poll results in newspapers, on the
Internet, or on radio or television. To identify surveys that lack objectivity or credibility,
you must critically evaluate what you read and hear by examining the validity of the survey
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
20 Chapter 1 Defining and Collecting Data
results. First, you must evaluate the purpose of the survey, why it was conducted, and for
whom it was conducted.
The second step in evaluating the validity of a survey is to determine whether it was based
on a probability or nonprobability sample (as discussed in Section 1.3). You need to remember
that the only way to make valid statistical inferences from a sample to a population is by using
a probability sample. Surveys that use nonprobability sampling methods are subject to serious
biases that may make the results meaningless.
Even when surveys use probability sampling methods, they are subject to four types of
potential survey errors:
• Coverage error
• Nonresponse error
• Sampling error
• Measurement error
Well-designed surveys reduce or minimize these four types of errors, often at considerable cost.
Coverage Error
The key to proper sample selection is having an adequate frame. Coverage error occurs if
certain groups of items are excluded from the frame so that they have no chance of being selected
in the sample or if items are included from outside the frame. Coverage error results in
a selection bias. If the frame is inadequate because certain groups of items in the population
were not properly included, any probability sample selected will provide only an estimate of
the characteristics of the frame, not the actual population.
Nonresponse Error
Not everyone is willing to respond to a survey. Nonresponse error arises from failure to collect
data on all items in the sample and results in a nonresponse bias. Because you cannot always
assume that persons who do not respond to surveys are similar to those who do, you need
to follow up on the nonresponses after a specified period of time. You should make several
attempts to convince such individuals to complete the survey and possibly offer an incentive
to participate. The follow-up responses are then compared to the initial responses in order to
make valid inferences from the survey (see references 2, 4, and 5). The mode of response you
use, such as face-to-face interview, telephone interview, paper questionnaire, or computerized
questionnaire, affects the rate of response. Personal interviews and telephone interviews usually
produce a higher response rate than do mail surveys—but at a higher cost.
Sampling Error
When conducting a probability sample, chance dictates which individuals or items will or will
not be included in the sample. Sampling error reflects the variation, or “chance differences,”
from sample to sample, based on the probability of particular individuals or items being selected
in the particular samples.
When you read about the results of surveys or polls in newspapers or on the Internet, there
is often a statement regarding a margin of error, such as “the results of this poll are expected
to be within {4 percentage points of the actual value.” This margin of error is the sampling
error.
You can reduce sampling error by using larger sample sizes. Of course, doing so increases
the cost of conducting the survey.
Measurement Error
In the practice of good survey research, you design surveys with the intention of gathering
meaningful and accurate information. Unfortunately, the survey results you get are often only a
proxy for the ones you really desire. Unlike height or weight, certain information about behaviors
and psychological states is impossible or impractical to obtain directly.
When surveys rely on self-reported information, the mode of data collection, the respondent
to the survey, and or the survey itself can be possible sources of measurement error.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
1.4 Types of Survey Errors 21
Satisficing, social desirability, reading ability, and/or interviewer effects can be dependent
on the mode of data collection. The social desirability bias or cognitive/memory limitations
of a respondent can affect the results. And vague questions, double-barreled questions
that ask about multiple issues but require a single response, or questions that ask the
respondent to report something that occurs over time but fail to clearly define the extent
of time about which the question asks (the reference period) are some of the survey flaws
that can cause errors.
To minimize measurement error, you need to standardize survey administration and respondent
understanding of questions, but there are many barriers to this (see references 1, 3,
and 10).
Ethical Issues About Surveys
Ethical considerations arise with respect to the four types of survey error. Coverage error
can result in selection bias and becomes an ethical issue if particular groups or individuals
are purposely excluded from the frame so that the survey results are more favorable to the
survey’s sponsor. Nonresponse error can lead to nonresponse bias and becomes an ethical
issue if the sponsor knowingly designs the survey so that particular groups or individuals
are less likely than others to respond. Sampling error becomes an ethical issue if the findings
are purposely presented without reference to sample size and margin of error so that
T h i n k About T h i s New Media Surveys/Old Sampling Problems
A software company executive decided to create
a “customer experience improvement program” to
record how customers use its products, with the
goal of using the collected data to make product
enhancements. An editor of a news website decides
to create an instant poll to ask website visitors
about important political issues. A marketer of
products aimed at a specific demographic decides
to use a social networking site to collect consumer
feedback. What do these decisions have in common
with a dead-tree publication that went out of
business over 70 years ago?
By 1932, long before the Internet, “straw
polls” conducted by the magazine Literary Digest
had successfully predicted five U.S. presidential
elections in a row. For the 1936 election, the
magazine promised its largest poll ever and sent
about 10 million ballots to people all across the
country. After receiving and tabulating more than
2.3 million ballots, the Digest confidently proclaimed
that Alf Landon would be an easy winner
over Franklin D. Roosevelt. As things turned
out, FDR won in a landslide, with Landon receiving
the fewest electoral votes in U.S. history.
The reputation of Literary Digest was ruined; the
magazine would cease publication less than two
years later.
The failure of the Literary Digest poll was a
watershed event in the history of sample surveys
and polls. This failure refuted the notion that the
larger the sample is, the better. (Remember this
the next time someone complains about a political
survey’s “small” sample size.) The failure opened
the door to new and more modern methods of
sampling discussed in this chapter. Using the predecessors
of those methods, George Gallup, the
“Gallup” of the famous poll, and Elmo Roper, of the
eponymous reports, both first gained widespread
public notice for their correct “scientific” predictions
of the 1936 election.
The failed Literary Digest poll became fodder
for several postmortems, and the reason
for the failure became almost an urban legend.
Typically, the explanation is coverage error: The
ballots were sent mostly to “rich people,” and
this created a frame that excluded poorer citizens
(presumably more inclined to vote for the
Democrat Roosevelt than the Republican Landon).
However, later analyses suggest that this was not
true; instead, low rates of response (2.3 million
ballots represented less than 25% of the ballots
distributed) and/or nonresponse error (Roosevelt
voters were less likely to mail in a ballot than
Landon voters) were significant reasons for the
failure (see reference 9).
When Microsoft first revealed its Office
Ribbon interface, a manager explained how Microsoft
had applied data collected from its “Customer
Experience Improvement Program” to the user interface
redesign. This led others to speculate that
the data were biased toward beginners—who
might be less likely to decline participation in the
program—and that, in turn, had led Microsoft to
create a user interface that ended up perplexing
more experienced users. This was another case of
nonresponse error!
The editor’s instant poll mentioned earlier
is targeted to the visitors of the news website,
and the social network–based survey is aimed
at “friends” of a product; such polls can also
suffer
from nonresponse errors. Often, marketers
extol how much they “know” about survey
respondents,
thanks to data that can be collected
from a social network community. But no amount
of information about the respondents can tell
marketers
who the nonrespondents are. Therefore,
new media surveys fall prey to the same old
type of error that proved fatal to Literary Digest
way back when.
Today, companies establish formal surveys
based on probability sampling and go to great
lengths—and spend large sums—to deal with
coverage error, nonresponse error, sampling error,
and measurement error. Instant polling and tell-afriend
surveys can be interesting and fun, but they
are not replacements for the methods discussed in
this chapter.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
22 Chapter 1 Defining and Collecting Data
Problems for Section 1.4
Applying the Concepts
1.26 A survey indicates that the vast majority of college students
own their own personal computers. What information would you
want to know before you accepted the results of this survey?
1.27 A simple random sample of n = 300 full-time employees
is selected from a company list containing the names of
all N = 5,000 full-time employees in order to evaluate job
satisfaction.
a. Give an example of possible coverage error.
b. Give an example of possible nonresponse error.
c. Give an example of possible sampling error.
d. Give an example of possible measurement error.
SELF
Test
1.28 The results of a 2013 Adobe Systems study on
retail apps and buying habits reveal insights on perceptions
and attitudes toward mobile shopping using retail apps and
browsers, providing new direction for retailers to develop their
digital publishing strategies (adobe.ly/11gt8Rq). Increased consumer
interest in using shopping applications means retailers
must adapt to meet the rising expectations for specialized mobile
shopping experiences. The results indicate that tablet users (55%)
are almost twice as likely as smartphone users (28%) to use their
device to purchase products and services. The findings also reveal
that retail and catalog apps are rapidly catching up to mobile
browsers as a viable shopping channel: nearly half of all mobile
shoppers are interested in using apps instead of a mobile browser
(45% of tablet shoppers and 49% of smartphone shoppers). The
research is based on an online survey with a sample of 1,003 consumers.
Identify potential concerns with coverage, nonresponse,
sampling, and measurement errors.
1.29 A recent PwC Supply Global Chain survey indicated that
companies that acknowledge the supply chain as a strategic
asset achieve 70% higher performance (pwc.to/VaFpGz). The
“Leaders” in the survey point to next-generation supply chains,
which are fast, flexible, and responsive. They are more concerned
with skills that separate a company from the crowd: 51% say differentiating
capabilities is the real key to success. What additional
information would you want to know about the survey before you
accepted the results of the study?
1.30 A recent survey points to a next generation of consumers
seeking a more mobile TV experience. The 2013 KPMG
International Consumer Media Behavior study found that while
TV is still the most popular media activity with 88% of U.S.
consumers watching TV, a relatively high proportion of U.S. consumers,
14%, now prefer to watch TV via their mobile device or
tablet for greater flexibility (bit.ly/Wb8Jv9). What additional
information would you want to know about the survey before you
accepted the results of the study?
The analysts charged by GT&M CEO Emma Levia to
identify, define, and collect the data that would be helpful
in setting a price for Whitney Wireless have completed
their task. The group has identified a number of variables
to analyze. In the course of doing this work, the group realized
that most of the variables to study would be discrete
numerical variables based on data that (ac)counts the financials
of the business. These data would mostly be from the
primary source of
the business itself,
but some supplemental
variables
about economic conditions and other factors that might
affect the long-term prospects of the business might
come from a secondary data source, such as an economic
agency.
U s i n g S tat i s t i c s
Beginning of the End… Revisited
Tyler Olson/Shutterstock
the sponsor can promote a viewpoint that might otherwise be inappropriate. Measurement
error can become an ethical issue in one of three ways: (1) a survey sponsor chooses leading
questions that guide the respondent in a particular direction; (2) an interviewer, through
mannerisms and tone, purposely makes a respondent obligated to please the interviewer
or otherwise guides the respondent in a particular direction; or (3) a respondent willfully
provides false information.
Ethical issues also arise when the results of nonprobability samples are used to form conclusions
about the entire population. When you use a nonprobability sampling method, you
need to explain the sampling procedures and state that the results cannot be generalized beyond the sample
Problems for Section 1.4
Applying the Concepts
1.26 A survey indicates that the vast majority of college students
own their own personal computers. What information would you
want to know before you accepted the results of this survey?
1.27 A simple random sample of n = 300 full-time employees
is selected from a company list containing the names of
all N = 5,000 full-time employees in order to evaluate job
satisfaction.
a. Give an example of possible coverage error.
b. Give an example of possible nonresponse error.
c. Give an example of possible sampling error.
d. Give an example of possible measurement error.
SELF
Test
1.28 The results of a 2013 Adobe Systems study on
retail apps and buying habits reveal insights on perceptions
and attitudes toward mobile shopping using retail apps and
browsers, providing new direction for retailers to develop their
digital publishing strategies (adobe.ly/11gt8Rq). Increased consumer
interest in using shopping applications means retailers
must adapt to meet the rising expectations for specialized mobile
shopping experiences. The results indicate that tablet users (55%)
are almost twice as likely as smartphone users (28%) to use their
device to purchase products and services. The findings also reveal
that retail and catalog apps are rapidly catching up to mobile
browsers as a viable shopping channel: nearly half of all mobile
shoppers are interested in using apps instead of a mobile browser
(45% of tablet shoppers and 49% of smartphone shoppers). The
research is based on an online survey with a sample of 1,003 consumers.
Identify potential concerns with coverage, nonresponse,
sampling, and measurement errors.
1.29 A recent PwC Supply Global Chain survey indicated that
companies that acknowledge the supply chain as a strategic
asset achieve 70% higher performance (pwc.to/VaFpGz). The
“Leaders” in the survey point to next-generation supply chains,
which are fast, flexible, and responsive. They are more concerned
with skills that separate a company from the crowd: 51% say differentiating
capabilities is the real key to success. What additional
information would you want to know about the survey before you
accepted the results of the study?
1.30 A recent survey points to a next generation of consumers
seeking a more mobile TV experience. The 2013 KPMG
International Consumer Media Behavior study found that while
TV is still the most popular media activity with 88% of U.S.
consumers watching TV, a relatively high proportion of U.S. consumers,
14%, now prefer to watch TV via their mobile device or
tablet for greater flexibility (bit.ly/Wb8Jv9). What additional
information would you want to know about the survey before you
accepted the results of the study?
The analysts charged by GT&M CEO Emma Levia to
identify, define, and collect the data that would be helpful
in setting a price for Whitney Wireless have completed
their task. The group has identified a number of variables
to analyze. In the course of doing this work, the group realized
that most of the variables to study would be discrete
numerical variables based on data that (ac)counts the financials
of the business. These data would mostly be from the
primary source of
the business itself,
but some supplemental
variables
about economic conditions and other factors that might
affect the long-term prospects of the business might
come from a secondary data source, such as an economic
agency.
U s i n g S tat i s t i c s
Beginning of the End… Revisited
Tyler Olson/Shutterstock
the sponsor can promote a viewpoint that might otherwise be inappropriate. Measurement
error can become an ethical issue in one of three ways: (1) a survey sponsor chooses leading
questions that guide the respondent in a particular direction; (2) an interviewer, through
mannerisms and tone, purposely makes a respondent obligated to please the interviewer
or otherwise guides the respondent in a particular direction; or (3) a respondent willfully
provides false information.
Ethical issues also arise when the results of nonprobability samples are used to form conclusions
about the entire population. When you use a nonprobability sampling method, you
need to explain the sampling procedures and state that the results cannot be generalized beyond
the sample.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
Summary
In this chapter, you learned about the various types of
variables used in business. In addition, you learned about
different methods of collecting data, several statistical
sampling methods, and issues involved
in taking samples.
In the next two chapters, you will study a variety of tables
and charts and descriptive measures that are used to present
and analyze data.
References
1. Biemer, P. B., R. M. Graves, L. E. Lyberg, A. Mathiowetz, and
S. Sudman. Measurement Errors in Surveys. New York: Wiley
Interscience, 2004.
2. Cochran, W. G. Sampling Techniques, 3rd ed. New York:
Wiley,
1977.
3. Fowler, F. J. Improving Survey Questions: Design and Evaluation,
Applied Special Research Methods Series, Vol. 38,
Thousand
Oaks, CA: Sage Publications, 1995.
4. Groves R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski,
E. Singer, and R. Tourangeau. Survey Methodology, 2nd ed.
New York: John Wiley, 2009.
5. Lohr, S. L. Sampling Design and Analysis, 2nd ed. Boston,
MA: Brooks/Cole Cengage Learning, 2010.
6. Microsoft Excel 2013. Redmond, WA: Microsoft Corporation,
2012.
7. Minitab Release 16. State College, PA: Minitab, Inc., 2010.
8. Osbourne, J. Best Practices in Data Cleaning. Thousand Oaks,
CA: Sage Publications, 2012.
9. Squire, P. “Why the 1936 Literary Digest Poll Failed.” Public
Opinion Quarterly 52 (1988): 125–133.
10. Sudman, S., N. M. Bradburn, and N. Schwarz. Thinking About
Answers: The Application of Cognitive Processes to Survey
Methodology. San Francisco, CA: Jossey-Bass, 1993.
Key Terms
categorical variable 11
cluster 18
cluster sample 18
collect 11
collectively exhaustive 15
continuous variable 12
convenience sample 16
coverage error 20
define 11
discrete variable 12
frame 16
judgment sample 17
margin of error 20
measurement error 20
missing value 15
mutually exclusive 15
nonprobability sample 16
nonresponse bias 20
nonresponse error 20
numerical variable 11
operational definition 11
outlier 15
parameter 14
population 14
primary data source 13
probability sample 16
qualitative variable 11
quantitative variable 11
recoded variable 15
sample 14
sampling error 20
sampling with replacement 17
sampling without replacement 17
secondary data source 13
selection bias 20
simple random sample 17
statistics 14
strata 18
stratified sample 18
systematic sample 18
table of random numbers 17
unstructured data 14
The group foresaw that examining several categorical variables
related to the customers of both GT&M and Whitney
Wireless would be necessary. The group discovered that the affinity
(“shopper’s card”) programs of both firms had already
collected demographic data of interest when customers enrolled
in those programs. That primary source, when combined
with secondary data gleaned from the social media networks
to which the business belongs, might prove useful in getting a
rough approximation of the profile of a typical customer that
might be interested in doing business with an “A-to-Z” electronic retailer.
In this chapter, you learned about the various types of
variables used in business. In addition, you learned about
different methods of collecting data, several statistical
sampling methods, and issues involved
in taking samples.
In the next two chapters, you will study a variety of tables
and charts and descriptive measures that are used to present
and analyze data.
References
1. Biemer, P. B., R. M. Graves, L. E. Lyberg, A. Mathiowetz, and
S. Sudman. Measurement Errors in Surveys. New York: Wiley
Interscience, 2004.
2. Cochran, W. G. Sampling Techniques, 3rd ed. New York:
Wiley,
1977.
3. Fowler, F. J. Improving Survey Questions: Design and Evaluation,
Applied Special Research Methods Series, Vol. 38,
Thousand
Oaks, CA: Sage Publications, 1995.
4. Groves R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski,
E. Singer, and R. Tourangeau. Survey Methodology, 2nd ed.
New York: John Wiley, 2009.
5. Lohr, S. L. Sampling Design and Analysis, 2nd ed. Boston,
MA: Brooks/Cole Cengage Learning, 2010.
6. Microsoft Excel 2013. Redmond, WA: Microsoft Corporation,
2012.
7. Minitab Release 16. State College, PA: Minitab, Inc., 2010.
8. Osbourne, J. Best Practices in Data Cleaning. Thousand Oaks,
CA: Sage Publications, 2012.
9. Squire, P. “Why the 1936 Literary Digest Poll Failed.” Public
Opinion Quarterly 52 (1988): 125–133.
10. Sudman, S., N. M. Bradburn, and N. Schwarz. Thinking About
Answers: The Application of Cognitive Processes to Survey
Methodology. San Francisco, CA: Jossey-Bass, 1993.
categorical variable 11
cluster 18
cluster sample 18
collect 11
collectively exhaustive 15
continuous variable 12
convenience sample 16
coverage error 20
define 11
discrete variable 12
frame 16
judgment sample 17
margin of error 20
measurement error 20
missing value 15
mutually exclusive 15
nonprobability sample 16
nonresponse bias 20
nonresponse error 20
numerical variable 11
operational definition 11
outlier 15
parameter 14
population 14
primary data source 13
probability sample 16
qualitative variable 11
quantitative variable 11
recoded variable 15
sample 14
sampling error 20
sampling with replacement 17
sampling without replacement 17
secondary data source 13
selection bias 20
simple random sample 17
statistics 14
strata 18
stratified sample 18
systematic sample 18
table of random numbers 17
unstructured data 14
The group foresaw that examining several categorical variables
related to the customers of both GT&M and Whitney
Wireless would be necessary. The group discovered that the affinity
(“shopper’s card”) programs of both firms had already
collected demographic data of interest when customers enrolled
in those programs. That primary source, when combined
with secondary data gleaned from the social media networks
to which the business belongs, might prove useful in getting a
rough approximation of the profile of a typical customer that
might be interested in doing business with an “A-to-Z” electronics
retailer.
Key Terms 23
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
24 Chapter 1 Defining and Collecting Data
Checking Your Understanding
1.31 What is the difference between a sample and a population?
1.32 What is the difference between a statistic and a parameter?
1.33 What is the difference between a categorical variable and a
numerical variable?
1.34 What is the difference between a discrete numerical variable
and a continuous numerical variable?
1.35 What is the difference between probability sampling and nonprobability
sampling?
Chapter Review Problems
1.36 Visit the official website for either Excel (www.office
.microsoft.com/excel) or Minitab (www.minitab.com/products
/minitab). Read about the program you chose and then think about the
ways the program could be useful in statistical analysis.
1.37 Results of a 2013 Adobe Systems study on retail apps and
buying habits reveals insights on perceptions and attitudes toward
mobile shopping using retail apps and browsers, providing new direction
for retailers to develop their digital publishing strategies.
Increased consumer interest in using shopping applications means
retailers must adapt to meet the rising expectations for specialized
mobile shopping experiences. The results indicate that tablet users
(55%) are almost twice as likely as smartphone users (28%) to
use their device to purchase products and services. The findings
also reveal that retail and catalog apps are rapidly catching up to
mobile browsers as a viable shopping channel: Nearly half of all
mobile shoppers are interested in using apps instead of a mobile
browser (45% of tablet shoppers and 49% of smartphone shoppers).
The research is based on an online survey with a sample
of 1,003 18–54 year olds who currently own a smartphone and/or
tablet; it includes consumers who use and do not use these devices
to shop (adobe.ly/11gt8Rq).
a. Describe the population of interest.
b. Describe the sample that was collected.
c. Describe a parameter of interest.
d. Describe the statistic used to estimate the parameter in (c).
1.38 The Gallup organization releases the results of recent polls
at its website, www.gallup.com. Visit this site and read an article
of interest.
a. Describe the population of interest.
b. Describe the sample that was collected.
c. Describe a parameter of interest.
d. Describe the statistic used to estimate the parameter in (c).
1.39 A recent PwC Supply Global Chain survey indicated that companies
that acknowledge the supply chain as a strategic asset achieve
70% higher performance. The “Leaders” in the survey point to nextgeneration
supply chains, which are fast, flexible, and responsive. They
are more concerned with skills that separate a company from the crowd:
51% say differentiating capabilities is the real key to success (pwc.to
/VaFpGz). The results are based on a survey of 503 supply chain
executives
in a wide range of industries representing a mix of company
sizes from across three global regions: Asia, Europe, and the
Americas.
a. Describe the population of interest.
b. Describe the sample that was collected.
c. Describe a parameter of interest.
d. Describe the statistic used to estimate the parameter in (c).
1.40 The Data and Story Library (DASL) is an online library of
data files and stories that illustrate the use of basic statistical methods.
Visit lib.stat.cmu.edu/index.php, click DASL, and explore a
data set of interest to you.
a. Describe a variable in the data set you selected.
b. Is the variable categorical or numerical?
c. If the variable is numerical, is it discrete or continuous?
1.41 Download and examine the U.S. Census Bureau’s “Business and
Professional Classification Survey (SQ-CLASS),” available through
the Get Help with Your Form link at www.census.gov/econ/.
a. Give an example of a categorical variable included in the survey.
b. Give an example of a numerical variable included in the survey.
1.42 Three professors examined awareness of four widely disseminated
retirement rules among employees at the University of Utah.
These rules provide simple answers to questions about retirement planning
(R. N. Mayer, C. D. Zick, and M. Glaittle, “Public Awareness of
Retirement Planning Rules of Thumb,” Journal of Personal Finance,
2011 10(1), 12–35). At the time of the investigation, there were approximately
10,000 benefited employees, and 3,095 participated in the
study. Demographic data collected on these 3,095 employees included
gender, age (years), education level (years completed), marital status,
household income ($), and employment category.
a. Describe the population of interest.
b. Describe the sample that was collected.
c. Indicate whether each of the demographic variables mentioned
is categorical or numerical.
1.43 A manufacturer of cat food is planning to survey households in the United States to determine purchasing habits of cat owners.
Among the variables to be collected are the following:
i. The primary place of purchase for cat food
ii. Whether dry or moist cat food is purchased
iii. The number of cats living in the household
iv. Whether any cat living in the household is pedigreed
a. For each of the four items listed, indicate whether the variable
is categorical or numerical. If it is numerical, is it discrete or
continuous?
b. Develop five categorical questions for the survey.
c. Develop five numerical questions for the survey.
Cases f o r Ch a p t e r 1
Managing Ashland MultiComm Services
Ashland MultiComm Services (AMS) provides high-quality
communications networks in the Greater Ashland
area. AMS traces its roots to Ashland Community Access
Television (ACATV), a small company that redistributed the
broadcast television signals from nearby major metropolitan
areas but has evolved into a provider of a wide range of
broadband services for residential customers.
AMS offers subscription-based services for digital cable
video programming, local and long-distance telephone
services, and high-speed Internet access. Recently, AMS has
faced competition from other network providers that have
expanded into the Ashland area. AMS has also seen decreases
in the number of new digital cable installations and
the rate of digital cable renewals.
AMS management believes that a combination of increased
promotional expenditures, adjustment in subscription
fees, and improved customer service will allow AMS
to successfully face the competition from other network
providers. However, AMS management worries about the
possible effects that new Internet-based methods of program
delivery may have had on their digital cable business. They
decide that they need to conduct some research and organize
a team of research specialists to examine the current status
of the business and the marketplace in which it competes.
The managers suggest that the research team examine
the company’s own historical data for number of subscribers,
revenues, and subscription renewal rates for the past
few years. They direct the team to examine year-to-date data
as well, as the managers suspect that some of the changes
they have seen have been a relatively recent phenomena.
1. What type of data source would the company’s own
historical
data be? Identify other possible data sources
that the research team might use to examine the current
marketplace for residential broadband services in a city
such as Ashland.
2. What type of data collection techniques might the team
employ?
3. In their suggestions and directions, the AMS managers
have named a number of possible variables to study, but
offered no operational definitions for those variables.
What types of possible misunderstandings could arise if
the team and managers do not first properly define each
variable cited?
CardioGood Fitness
CardioGood Fitness is a developer of high-quality cardiovascular
exercise equipment. Its products include treadmills,
fitness bikes, elliptical machines, and e-glides. CardioGood
Fitness looks to increase the sales of its treadmill products
and has hired The AdRight Agency, a small advertising
firm, to create and implement an advertising program. The
AdRight Agency plans to identify particular market segments
that are most likely to buy their clients’ goods and
services and then locates advertising outlets that will reach
that market group. This activity includes collecting data on
clients’ actual sales and on the customers who make the
purchases, with the goal of determining whether there is a
distinct profile of the typical customer for a particular product
or service. If a distinct profile emerges, efforts are made
to match that profile to advertising outlets known to reflect
the particular profile, thus targeting advertising directly to
high-potential customers.
CardioGood Fitness sells three different lines of treadmills.
The TM195 is an entry-level treadmill. It is as dependable
as other models offered by CardioGood Fitness,
but with fewer programs and features. It is suitable for individuals
who thrive on minimal programming and the desire
for simplicity to initiate their walk or hike. The TM195 sells
for $1,500.
The middle-line TM498 adds to the features of the
entry-level model two user programs and up to 15% elevation
upgrade. The TM498 is suitable for individuals who are
walkers at a transitional stage from walking to running or
midlevel runners. The TM498 sells for $1,750.
The top-of-the-line TM798 is structurally larger and
heavier and has more features than the other models. Its
unique features include a bright blue backlit LCD console,
quick speed and incline keys, a wireless heart rate monitor
with a telemetric chest strap, remote speed and incline controls,
and an anatomical figure that specifies which muscles
are minimally and maximally activated. This model features
a nonfolding platform base that is designed to handle rigorous,
frequent running; the TM798 is therefore appealing
to someone who is a power walker or a runner. The selling
price is $2,500.
As a first step, the market research team at AdRight is
assigned the task of identifying the profile of the typical
customer for each treadmill product offered by CardioGood
Fitness. The market research team decides to investigate
26 Chapter 1 Defining and Collecting Data
Clear Mountain State Student Surveys
1. The Student News Service at Clear Mountain State
University (CMSU) has decided to gather data about
the undergraduate students who attend CMSU. They
create and distribute a survey of 14 questions and
receive responses from 62 undergraduates (stored
in UndergradSurvey ). Download (see Appendix C) and
review the survey document CMUndergradSurvey
.pdf. For each question asked in the survey, determine
whether the variable is categorical or numerical. If
you determine that the variable is numerical, identify
whether it is discrete or continuous.
2. The dean of students at CMSU has learned about the
undergraduate
survey and has decided to undertake a similar
survey for graduate students at CMSU. She creates
and
distributes a survey of 14 questions and receives responses
from 44 graduate students (stored in GradSurvey ). Download
(see Appendix C) and review the survey document
CMGradSurvey.pdf. For each question asked in the survey,
determine whether the variable is categorical or numerical.
If you determine that the variable is numerical,
identify whether it is discrete or continuous.
whether there are differences across the product lines with
respect to customer characteristics. The team decides to collect
data on individuals who purchased a treadmill at a CardioGood
Fitness retail store during the prior three months.
The team decides to use both business transactional
data and the results of a personal profile survey that every
purchaser completes as their sources of data. The team
identifies the following customer variables to study: product
purchased—TM195, TM498, or TM798; gender; age,
in years; education, in years; relationship status, single or
partnered; annual household income ($); mean number
of times the customer plans to use the treadmill each week;
mean number of miles the customer expects to walk/run
each week; and self-rated fitness on an 1-to-5 scale, where
1 is poor shape and 5 is excellent shape. For this set of
variables:
1. Which variables in the survey are categorical?
2. Which variables in the survey are numerical?
3. Which variables are discrete numerical variables?
Learning with the Digital Cases
As you have already learned in this book, decision makers
use statistical methods to help analyze data and communicate
results. Every day, somewhere, someone misuses these
techniques either by accident or intentional choice. Identifying
and preventing such misuses of statistics is an important
responsibility for all managers. The Digital Cases give you
the practice you need to help develop the skills necessary
for this important task.
Each chapter’s Digital Case tests your understanding of
how to apply an important statistical concept taught in the
chapter. As in many business situations, not all of the information
you encounter will be relevant to your task, and you
may occasionally discover conflicting information that you
have to resolve in order to complete the case.
To assist your learning, each Digital Case begins with
a learning objective and a summary of the problem or issue
at hand. Each case directs you to the information necessary
to reach your own conclusions and to answer the case
questions. Many cases, such as the sample case worked out
next, extend a chapter’s Using Statistics scenario. You can
download digital case files for later use or retrieve them online
from a MyStatLab course for this book, as explained in
Appendix C.
To illustrate learning with a Digital Case, open the
Digital Case file WhitneyWireless.pdf that contains summary
information about the Whitney Wireless business.
Recall from the Using Statistics scenario for this chapter
that Good Tunes & More (GT&M) is a retailer seeking to
expand by purchasing Whitney Wireless, a small chain that
sells mobile media devices. Apparently, from the claim on
the title page, this business is celebrating its “best sales
year ever.”
Review the Who We Are, What We Do, and What We
Plan to Do sections on the second page. Do these sections
contain any useful information? What questions does this
passage raise? Did you notice that while many facts are presented,
no data that would support the claim of “best sales
year ever” are presented? And were those mobile “mobilemobiles”
used solely for promotion? Or did they generate
any sales? Do you think that a talk-with-your-mouth-full
event, however novel, would be a success?
Continue to the third page and the Our Best Sales Year
Ever! section. How would you support such a claim? With
a table of numbers? Remarks attributed to a knowledgeable
source? Whitney Wireless has used a chart to present
“two years ago” and “latest twelve months” sales data by
category.
Are there any problems with what the company
has done? Absolutely!
First, note that there are no scales for the symbols
used, so you cannot know what the actual sales volumes
are. In fact, as you will learn in Section 2.7, charts that incorporate
icons as shown on the third page are considered
examples of chartjunk and would never be used by people
seeking to properly visualize data. The use of chartjunk
symbols creates the impression that unit sales data are being
presented. If the data are unit sales, does such data best
support the claim being made, or would something else,
such as dollar volumes, be a better indicator of sales at the
retailer?
For the moment, let’s assume that unit sales are being
visualized. What are you to make of the second row,
in which the three icons on the right side are much wider
than the three on the left? Does that row represent a newer
(wider) model being sold or a greater sales volume? Examine
the fourth row. Does that row represent a decline in sales
or an increase? (Do two partial icons represent more than
one whole icon?) As for the fifth row, what are we to think?
Is a black icon worth more than a red icon or vice versa?
At least the third row seems to tell some sort of tale of
increased sales, and the sixth row tells a tale of constant
sales. But what is the “story” about the seventh row? There,
the partial icon is so small that we have no idea what product
category the icon represents.
Perhaps a more serious issue is those curious chart labels.
“Latest twelve months” is ambiguous; it could include
months from the current year as well as months from one
year ago and therefore may not be an equivalent time period
to “two years ago.” But the business was established in 2001,
and the claim being made is “best sales year ever,” so why
hasn’t management included sales figures for every year?
Are the Whitney Wireless managers hiding something,
or are they just unaware of the proper use of statistics? Either
way, they have failed to properly organize and visualize
their data and therefore have failed to communicate a vital
aspect of their story.
In subsequent Digital Cases, you will be asked to provide
this type of analysis, using the open-ended case questions
as your guide. Not all the cases are as straightforward
as this example, and some cases include perfectly appropriate
applications of statistical methods.
EG1.1 Defining Variables
Classifying Variables by Type
Microsoft Excel infers the variable type from the data you enter
into a column. If Excel discovers a column that contains numbers,
it treats the column as a numerical variable. If Excel discovers a
column that contains words or alphanumeric entries, it treats the
column as a non-numerical (categorical) variable.
This imperfect method works most of the time, especially if
you make sure that the categories for your categorical variables are
words or phrases such as “yes” and “no.” However, because you
cannot explicitly define the variable type, Excel can mistakenly
offer or allow you to do nonsensical things such as using a statistical
method that is designed for numerical variables on categorical
variables. If you must use coded values such as 1, 2, or 3, enter
them preceded with an apostrophe, as Excel treats all values that
begin with an apostrophe as non-numerical data. (You can check
whether a cell entry includes a leading apostrophe by selecting a
cell and viewing the contents of the cell in the formula bar.)
EG1.2 Collecting Data
Recoding Variables
Key Technique To recode a categorical variable, you first copy
the original variable’s column of data and then use the find-andreplace
function on the copied data. To recode a numerical variable,
enter a formula that returns a recoded value in a new column.
Example Using the DATA worksheet of the Recoded workbook,
create the recoded variable UpperLower from the categorical
variable Class and create the recoded Variable Dean’s List
from the numerical variable GPA.
In-Depth Excel Use the RECODED worksheet of the
Recoded
workbook as a model.
The worksheet already contains UpperLower, a recoded version
of Class that uses the operational definitions on page 15, and
Dean’s List, a recoded version of GPA, in which the value No recodes
all GPA values less than 3.3 and Yes recodes all values 3.3
or greater than 3.3. The RECODED_FORMULAS worksheet in
the same workbook shows how formulas in column I use the IF
function to recode GPA as the Dean’s List variable.
These recoded variables were created by first opening to the
DATA worksheet in the same workbook and then following these
steps:
1. Right-click column D (right-click over the shaded “D” at the
top of column D) and click Copy in the shortcut menu.
2. Right-click column H and click the first choice in the Paste
Options gallery.
3. Enter UpperLower in cell H1.
4. Select column H. With column H selected, click Home ➔
Find & Select ➔ Replace.
In the Replace tab of the Find and Replace dialog box:
5. Enter Senior as Find what, Upper as Replace with, and
then click Replace All.
6. Click OK to close the dialog box that reports the results of
the replacement command.
7. Still in the Find and Replace dialog box, enter Junior as
Find what (replacing Senior), and then click Replace All.
8. Click OK to close the dialog box that reports the results of
the replacement command.
9. Still in the Find and Replace dialog box, enter Sophomore
as Find what, Lower as Replace with, and then click
Replace All.
10. Click OK to close the dialog box that reports the results of
the replacement command.
11. Still in the Find and Replace dialog box, enter Freshman as
Find what and then click Replace All.
12. Click OK to close the dialog box that reports the results of
the replacement command.
(This creates the recoded variable UpperLower in column H.)
13. Enter Dean’s List in cell I1.
14. Enter the formula =IF(G2 < 3.3, "No", "Yes") in cell I2.
15. Copy this formula down the column to the last row that contains
student data (row 63).
(This creates the recoded variable Dean’s List in column I.)
The RECODED worksheet uses the IF function (See
Appendix F) to recode the numerical variable into two categories.
Numerical variables can also be recoded into multiple categories
by using the VLOOKUP function. Read the Short Takes for Chapter
1 to learn more about this advanced recoding technique.
EG1.3 Types of Sampling Methods
Simple Random Sample
Key Technique Use the RANDBETWEEN(smallest integer,
largest integer) function to generate a random integer that can
then be used to select an item from a frame.
Example 1 Create a simple random sample with replacement of
size 40 from a population of 800 items.
In-Depth Excel Enter a formula that uses this function and
then copy the formula down a column for as many rows as is necessary.
For example, to create a simple random sample with replacement
of size 40 from a population of 800 items, open to a
new worksheet.
Enter Sample in cell A1 and enter the formula
=RANDBETWEEN(1, 800) in cell A2. Then copy the formula
down the column to cell A41.
Excel contains no functions to select a random sample without
replacement. Such samples are most easily created using an
add-in such as PHStat or the Analysis ToolPak, as described in the following paragraphs.
Analysis ToolPak Use Sampling to create a random sample
with replacement.
For the example, open to the worksheet that contains the population
of 800 items in column A and that contains a column heading
in cell A1. Select Data ➔ Data Analysis. In the Data Analysis
dialog box, select Sampling from the Analysis Tools list and then
click OK. In the procedure’s dialog box (shown below):
1. Enter A1:A801 as the Input Range and check Labels.
2. Click Random and enter 40 as the Number of Samples.
3. Click New Worksheet Ply and then click OK.
Example 2 Create a simple random sample without replacement
of size 40 from a population of 800 items.
PHStat Use Random Sample Generation.
For the example, select PHStat ➔ Sampling ➔ Random Sample
Generation. In the procedure’s dialog box (shown in next column):
1. Enter 40 as the Sample Size.
2. Click Generate list of random numbers and enter 800 as
the Population Size.
3. Enter a Title and click OK.
Unlike most other PHStat results worksheets, the worksheet created
contains no formulas.
In-Depth Excel Use the COMPUTE worksheet of the
Random
workbook as a template.
The worksheet already contains 40 copies of the formula
=RANDBETWEEN(1, 800) in column B. Because the
RANDBETWEEN function samples with replacement as discussed
at the start of this section, you may need to add additional copies of
the formula in new column B rows until you have 40 unique values.
If your intended sample size is large, you may find it difficult
to spot duplicates. Read the Short Takes for Chapter 1 to learn
more about an advanced technique that uses formulas to detect duplicate
values.
MG1.1 Defining Variables
Classifying Variables by Type
When Minitab adds a “-T” suffix to a column name, it is classifying
the column as a categorical, or text, variable. When Minitab
does not add a suffix, it is classifying the column as a numerical
variable. (A column name with the “-D” suffix is a date variable, a
special type of a numerical variable.)
Sometimes, Minitab will misclassify a variable, for example,
mistaking a numerical variable for a categorical (text) variable. In
such cases, select the column, then select Data ➔ Change Data
Type, and then select one of the choices, for example, Text to
Numeric
for the case of when Minitab has mistaken a numerical
variable as a categorical variable.
MG1.2 Collecting Data
Recoding Variables
Use the Replace command to recode a categorical variable and
Calculator to recode a numerical variable.
For example, to create the recoded variable UpperLower from
the categorical variable Class (C4-T), open to the DATA worksheet
of the Recode project and:
1. Select the Class column (C4-T).
2. Select Editor ➔ Replace.
In the Replace in Data Window dialog box:
3. Enter Senior as Find what, Upper as Replace with, and
then click Replace All.
4. Click OK to close the dialog box that reports the results of
the replacement command.
Ch a p t e r 1 M i n i ta b Gui d e
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-
30 Chapter 1 Defining and Collecting Data
5. Still in the Find and Replace dialog box, enter Junior as
Find what (replacing Senior), and then click Replace All.
6. Click OK to close the dialog box that reports the results of
the replacement command.
7. Still in the Find and Replace dialog box, enter Sophomore as
Find what, Lower as Replace with, and then click Replace All.
8. Click OK to close the dialog box that reports the results of
the replacement command.
9. Still in the Find and Replace dialog box, enter Freshman as
Find what, and then click Replace All.
10. Click OK to close the dialog box that reports the results of
the replacement command.
To create the recoded variable Dean’s List from the numerical
variable GPA (C7), with the DATA worksheet of the Recode project
still open:
1. Enter Dean’s List as the name of the empty column C8.
2. Select Calc ➔ Calculator.
In the Calculator dialog box (shown below):
3. Enter C8 in the Store result in variable box.
4. Enter IF(GPA < 3.3, "No", "Yes") in the Expression box.
5. Click OK.
Variables can also be recoded into multiple categories by using the
Data ➔ Code command. Read the Short Takes for Chapter 1 to
learn more about this advanced recoding technique.
MG1.3 Types of Sampling Methods
Simple Random Samples
Use Sample From Columns.
For example, to create a simple random sample with replacement
of size 40 from a population of 800 items, first create the list
of 800 employee numbers in column C1.
Select Calc ➔ Make Patterned Data ➔ Simple Set of Numbers.
In the Simple Set of Numbers dialog box (shown below):
1. Enter C1 in the Store patterned data in box.
2. Enter 1 in the From first value box.
3. Enter 800 in the To last value box.
1. 4. Click OK.