3 Data Architecture

Data Architecture3—What do we really mean by data?

Data are pieces of information about individuals or observations organized into variables. By an individual or observation, we mean a particular person or object. By a variable, we mean a particular characteristic of the individual or observation.

A dataset is a collection of information, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given individual (or observation) within the dataset.

Relying on datasets, statistics pulls all of the behavioral, physical, and social sciences together. It’s arguably the one language that we all have in common. While you may think that data is very, very different from discipline to discipline, it is not. What you measure is different, and your research question is obviously dramatically different; whom you observe and whom you collect data from—or what you collect data from—can be very different, but once you have the data, approaches to analyzing it statistically are quite similar regardless of individual discipline.

Example: Medical Recordings The following dataset shows medical records from a particular survey:

In this example, the individuals are patients, and the variables are Gender, Age, Weight, Height, Smoking, and Race. Each row, then, gives us all the information about a particular individual or observation (in this case, patient), and each column gives us the information about a particular characteristic of all the patients.

Variables can be classified into one of two types: quantitative or categorical.

  • Quantitative Variables take numerical values and represent some kind of measurement.
  • Categorical Variables take category or label values and place an individual into one of several groups.

These types can be further refined; categorical can be nominal or ordinal and quantitative can be interval or ratio.

  • Nominal responses are categories with names without order, such as hair color or religion.
  • Ordinal responses are categories with a sensible order, but no fixed distances between the levels. A list such as “small”, “medium”, and “large” is ordinal. “Likert” scales with responses such as “disagree”, “neutral”, and “agree” have this structure.
  • Interval responses are numeric with meaningful spacing so that the difference between two numbers is meaningful, but the value 0 is not meaningful. In other words, addition and subtraction make sense, but multiplication and division do not. Temperature is an example, where there is a 10 degree difference between 60 and 70 or between 10 and 20, but 20 is not twice as hot as 10.
  • Ratio responses are numeric with a meaningful 0 point where multiplication makes sense. Examples include height and weight.

In our example of medical records, there are several variables of each type:

  • Age, Weight, and Height are quantitative ratio variables.
  • Race and Gender are categorical nominal variables.
  • Smoking is a categorical ordinal variable.

Notice that the values of the categorical variable, Smoking, have been coded as the numbers 0 or 1. It is quite common to code the values of a categorical variable as numbers, but you should remember that these are just codes (often called indicator codes). They have no arithmetic meaning (i.e., it does not make sense to add, subtract, multiply, divide, or compare the magnitude of such values).

A unique identifier is a variable that is meant to distinctively define each of the individuals or observations in your data set. Examples might include serial numbers (for data on a particular product), social security numbers (for data on individual persons), or random numbers (generated for any type of observations). Every data set should have a variable that uniquely identifies the observations. In this example, the patient number (1 through 75) is a unique identifier.

Medical Records Assignment

Although you will be working with previously collected data, it is important to understand what data looks like as well as how it is coded and entered into a spreadsheet or dataset for analysis. Pretend you are designing an intake form for a hospital emergency room. Create abbreviated medical records for 5 patients seeking treatment.

  1. Select 4 variables recorded on the medical forms (one should be a unique identifier, at least one should be a quantitative variable and at least one should be a categorical variable)
  2. Select a brief name (ideally 8 characters or less) for each variable
  3. Determine what range of values is needed for recording each variable (create indicator codes as needed)
  4. Label variables within a spreadsheet
  5. Enter data for each patient in the spreadsheet
  6. List the variable names, labels, types, and, response codes below the data set (i.e., the code book).
  7. Compile your Rmd file, have a peer review it for you, then submit it.


Personal Code Book (AKA Research Question) Assignment

One of the simplest research questions that can be asked is whether two constructs are associated. For example: a) Is medical treatment seeking associated with socio-economic status?

  1. Is water fluorination associated with number of cavities during dentist visits?

  2. Is humidity associated with caterpillar reproduction?


After looking through the codebook for the NESARC study, I have decided that I am particularly interested in nicotine dependence. I am not sure which variables I will use regarding nicotine dependence (e.g., symptoms or diagnosis) so for now I will include all of the relevant variables in my personal codebook.

At this point, you should continue to explore the code book for the data set you have selected.

After choosing a data set, you should:

  1. Identify a specific topic of interest

  2. Prepare a codebook of your own (i.e., print individual pages (or copy screen and paste into a new document) from the larger codebook that includes the questions/items/variables that measure your selected topics.)


While nicotine dependence is a good starting point, I need to determine what it is about nicotine dependence that I am interested in. It strikes me that friends and acquaintances that I have known through the years that became hooked on cigarettes did so across very different periods of time. Some seemed to be dependent soon after their first few experiences with smoking and others after many years of generally irregular smoking behavior. I decide that I am most interested in exploring the association between level of smoking and nicotine dependence. I add to my codebook variables reflecting smoking levels (e.g., smoking quantity and frequency).

During a second review of the codebook for the dataset that you have selected, you should:

  1. Identify a second topic that you would like to explore in terms of its association with your original topic

  2. Add questions/items/variables documenting this second topic to your personal codebook.

Following completion of the steps described above, show your instructor and peer mentor (in class) a hard copy of your personal codebook. Keep this in a folder or binder for your own use throughout the course.