home Relational Database tutorial - Data Modeling

Free
Relational Database
Tutorial

Home Lesson 1, 2, 3, 4


Home

E-mail

Free Relational Database Tutorial and Tips For Beginners!

Intro To Data Modeling


Lesson 4

How many files do I need to create?

Actually, each situation requires a different number of files. Sometimes one file, sometimes 300 or more. Here's an oversimplified way to design your files.

First, gather all information regarding your project. Let's say, you're creating a computer system to run a college. Get the college info (name of college, building locations, class room locations, etc., address, phone numbers, etc.), all info kept regarding students (not the actual data for every student, but the types of info in general, for example, take note that the college maintains this info: student name, student address, student date of birth, etc.), professor information, student course info (subject taken, semester taken, professor name, etc.).

Collect a sample of each type of info, for example, student report card, professor schedule, any and all documents, existing computer files, etc.

Okay, now you have giant pile of every piece of info regarding the college. Next step, is figuring out what the entities are. An oversimplified way of describing what entities are is to say that they can be a person, place or thing. Here's what some of the entities would be for our college example. The college, the students, the professors, the buildings, student grades, courses, course schedule, etc. Separate out all of the entities.

Next step, find all of the properties of the entities and list them under the entities. For example, student address is a property of student entity. Here's more examples:

STUDENT
student name
student address
student social security number
student date of birth
etc.

COURSE
course name
course description
course unique id
department course is taught in
etc.

PROFESSOR
professor name
professor social security number
professor home address
professor office location
professor phone number
etc.

So we started out with this giant pile of information, and now we're picking up each piece from the big pile and dropping them in little entity piles where they belong, until the big pile is gone. How about the course schedule? What properties would that have? Well, we can see that the course schedule entity, is an intersection file like in the library example.

Here's what the course schedule would look like:

COURSE SCHEDULE
course unique id (matches id from the course file)
semester (from the semester file)
classroom location (matches id from the classroom file)
class time (E.g., Mondays at 10 am)
professor social security number (matches SSN from the professor file)

Notice, like in the library example, that you don't have to redundantly type in lots of information in the course schedule file to uniquely identify a specific class. For example, you don't have to type in the course description over and over again, because the software pulls that from the course file! The course unique id is enough to let you know exactly which course it is and the software connects the two files when you need to see all of the info! Also, you don't have to type in the professor's name over and over again, as that is pulled from the professor file. Just the professors' social security numbers are all that are required in the course schedule file.

You may need many intersection files. For example, the grades file would be another intersection file.

GRADES
student social security number
course unique id
semester
classroom location
class time
grade (e.g., A, B, C, D, F)

For the grades file, notice that you don't have to type in the student's name or address over and over again. Same with the the course; just unique course id is enough so you don't have to type in the course description over and over again thousands of times for each student taking it! You only need to type in the course description once, in the course file. The software will pull the course description in and merge it with the grades and display it on a computer screen or printout when someone wants to see it.

Looking at data modeling like this, its possible to develop an intuitive feel for setting up your files with the goal of removing unnecessary redundancy. Anybody can get this! Just remember that one of the properties for each file must uniquely identify each instance of the entity. For example, the professor file absolutely must contain some property that uniquely identifies each professor, e.g., social security number. Same with the building file, it absolutely must contain a property that uniquely identifies it from all other buildings in the college, e.g., building name. This is called putting the data in third normal form, which you can study more formally in the books on the side bars.

  • Each instance of the entity is called a "record". For example, in the professor file, John Doe and his properties make up a record, Susan Smith and her properties constitute another record.
  • Each property is called a "field". E.g., name field, address field, date of birth field, social security number field, etc.
  • The property that uniquely identifies each instance is called the "key". For example, the key for the professor records is professor social security number.
  • Designing data files is called "data modeling".
  • Different software programs sometimes use a different naming system.

The software connects the various files by matching keys or other fields. E.g., it will match professor social security number from the course schedule file with the social security number for the professor in the professor file. For each use that you want to make of the data, you find the files that contain your information and merge it onto your computer screen or paper printout. This merge is called a "join". The join doesn't create permanent new files, it just temporarily matches the information from the various files that you're interested in and displays them on your computer screen or paper printout.

Here's an example: the dean wants a printed report of every classroom in the college with its street address. You don't need the student or professor files for this. Here's two files that you would need:

BUILDING
building name
street address

CLASSROOM
building name
classroom number
classroom size

There is a right way and a wrong way to design files. Notice that you do not want to have a "street address" field in the classroom file because that can already be derived from the building file. If you put the street address field in the classroom file, you would be unnecessarily repeating the street address over and over again. You know which building the classroom is in, and the building file has the street address in it. Are you feeling your intuition working yet? You will. Once you start seeing actual examples of your experiments, let's say the street address is repeated hundreds of times in the class room file, then you kind of start seeing that the street address really belongs in the building file. When designing your files, put some actual data in them and look for high repetitions of the same information. See if your intuition is telling you that maybe a field in one file doesn't belong there but really belongs in another file.

Continuing on with the task that the dean has assigned you, you would "join" both files, matching them by the building name, which is a property of both files. Building name is the key to the records of the building file. The result would look like this:

BUILDING NAME BUILDING STREET ADDRESS CLASSROOM NUMBER
Art Building 100 Van Ness Ave., S.F. Room 300
Science Building 234 Van Ness Ave., S.F. Room 120

But the above would not be another file. Its just derived from the other files and displayed on the computer screen or paper printout at the time the viewer wants to see it.

Don't create unnecessary files (for example, a file that looks like the one above) if the information can be derived from the already existing files.

Sometimes the key of a record is made up of more than one field (property). For example, in the course schedule file above, you can't uniquely identify the schedule for a specific course without including course unique id, semester, classroom location, class time. Think about it, the exact same class could be given 3 or 4 times in the same classroom, so you absolutely need to include the time that the class meets to distinguish it from the same course being taught in the same room but at another time.

When designing your data files, notice that they do not look like the paper documents in the college! For example, the course schedule has a lot of info trimmed off compared to the paper printout version because it can derive what it doesn't have from the related course, building and professor files. This confuses a lot of beginners but, you get used to designing the data files like this. The intersection files have just the bare minimum required to uniquely identify the entities that they will be pulling the rest of the info from.

When designing files, get rid of unnecessary redundancy! Some redundancy is necessary though, for example in the course schedule file, the semester that the courses are taught in will have to be repeated over and over again, to make sure that students and professors know which semester each class will be held in. We're talking about getting rid of unnecessary redundancy, not necessary redundancy.

A single file is called a "flat" file, but the collection of related files is called a "database".

When creating reports from your college database, you may want to know how to retrieve only certain records that you are searching for. To solve this problem, software languages have been invented to manipulate and retrieve data selectively. SQL (Structured Query Language) is one such language. In essence, using a software language, you can frame questions to your system of files that you have designed, like this: "Please match all of the course, and course schedule records, and all of the professor records using the "professor social security number" and "unique course id" to match them with, then list only those courses in the sociology department this semester."

By finding a unique key for each file, then adding its attributes (fields) and removing redundancy you've designed the files so that they can be grabbed or "joined" by the software in the most flexible and most mathematically efficient manner possible! The fields that match in two or more files are like bridges that connect the files.

Now all kinds of reports can be generated on the fly by just joining the files and printing them out. Even unusual requests for reports can be easily generated on the fly, for example, lets say the dean wants a printed out report of every student that has had a specific professor. The way the files are designed, this is a snap. You would use your computer language, like SQL, to select only the professor that you are looking for. Then take that record from the professor file and join that with the course schedule file using the professor social security number as the matching field to build a bridge between the two files. Then added to that, you would join the student file using the student social security number as the matching field to build a bridge to the already joined files.

In general, you first model your data, design your files so that each entity has a unique key and then add the rest of the fields that are attributes of that entity. Then the rest is easy, just create desired reports from your data by joining your files. This means matching the keys from relevant files with the matching fields in the related files. You'll be amazed how very complicated problems can be easily solved using relational databases.

In real life, we don't want to use social security number as a person's unique ID any more because of the growing problem of identity theft. Instead, you, or often the computer program, can make up a unique ID number for each person in your database.

Got suggestions on how to clarify or improve this? Send me an e-mail!