Databases and SQL - Library edition

Data Hygiene

Learning Objectives

  • Explain what an atomic value is.
  • Distinguish between atomic and non-atomic values.
  • Explain why every value in a database should be atomic.
  • Explain what a primary key is and why every record should have one.
  • Identify primary keys in database tables.
  • Explain why database entries should not contain redundant information.
  • Identify redundant information in databases.

Now that we have seen how joins work, we can see why the relational model is so useful and how best to use it. The first rule is that every value should be atomic, i.e., not contain parts that we might want to work with separately. We store personal and family names in separate columns instead of putting the entire name in one column so that we don’t have to use substring operations to get the name’s components. More importantly, we store the two parts of the name separately because splitting on spaces is unreliable: just think of a name like “Eloise St. Cyr” or “Jan Mikkel Steubart”.

The second rule is that every record should have a unique primary key. This can be a serial number that has no intrinsic meaning (like the work_ID field in the Works table), one of the values in the record (the barcode field in the Items table could have been used instead of the item_ID), or even a combination of values.

The third rule is that there should be no redundant information. For example, we could have a single table combining the data from the Items and the Works tables, like this:

Item_ID Barcode Acquired Status Title ISBN Date Place Publisher Edition Pages
1 081722942611 2009 Loaned SQL in a nutshell 9780596518844 2009 Sebastopol O’Reilly 3rd ed. 578
2 492437609065 2011 On shelf SQL in a nutshell 9780596518844 2009 Sebastopol O’Reilly 3rd ed. 578
3 172480710952 2013 On shelf SQL for dummies 9781118607961 2013 Hoboken Wiley 8th ed.
4 708014968732 2013 Missing PHP & MySQL 9781449325572 2013 Sebastopol O’Reilly 2nd ed. 532
5 819783404942 2014 Loaned PHP & MySQL 9781449325572 2013 Sebastopol O’Reilly 2nd ed. 532

In fact, we could use a single table that recorded all the information about each item in each row, just as a spreadsheet would, including the author information in individual columns. The problem is that it’s very hard to keep data organized this way consistent: if we realize that the bibliographic information for a particular title is wrong, we have to change all the item records for this title. Storing the author/contributor information would pose another challenge still: our spreasheet would need to have as many columns to store this information so as to fit the records with the greatest number of contributors in our database. Those columns would mostly be empty for records that have a smaller number of contributors. What’s worse, if we added a title with an even greater number of contributors to the database, we would need to alter our table to add even more columns to it.

The fourth rule is that the units for every value should be stored explicitly. If alongside the number of pages we were also recording the size of each item in our library, we would need to specify whether that size is expressed in inches or centimeters, for example.

Stepping back, data and the tools used to store it have a symbiotic relationship: we use tables and joins because it’s efficient, provided our data is organized a certain way, but organize our data that way because we have tools to manipulate it efficiently. As anthropologists say, the tool shapes the hand that shapes the tool.

Identifying Atomic Values

Which of the following are atomic values? Which are not? Why?

  • New Zealand
  • 87 Turing Avenue
  • January 25, 1971
  • the XY coordinate (0.5, 3.3)

Identifying a Primary Key

What is the primary key in this table? I.e., what value or combination of values uniquely identifies a record?

Work_ID Title ISBN Date Place Publisher
16 Microsoft SQL server 2012 9780132977661 2013 Indianapolis Sams