Tasked with creating an enterprise data model? Need to build a relational database that can house terabytes worth of data?
STEP BY STEP
Don’t worry about tables just yet. It’s obvious that you’re building a database, and databases (relational ones anyway) are primarily made up of tables, which are made up of rows and columns (tuples and attributes if you’re really into it).
Worry about entity relationships. Your first goal is to map out the relationships that different business objects have. This is the “logical modeling” portion. The “physical model” is the actual implementation. Confuse / combine the two at your peril.
The requirements are hard to get and painful. A talented business analyst at this time would be Heaven sent.
Be prepared to wage a solo war, with only you dedicated to quality normalization. Most databases are pieces of garbage because the people who design them are lazy and “just want to get something out there. We can always fix it later.” Yeah, right.
Once it’s time to write tables, concentrate on lookups and type tables (zip codes, statuses, product categories, etc). You’ll need them for foreign key relationships on your “real” tables. Plus, it gives you a little warm up before you get into the core transactional tables.
As a rule of thumb: don’t store data that can be inferred from other fields. If you know the date of birth and some start date, then you also know the age at the start date, so don’t include this age in the table.
No nulls. A null value represents an undefined attribute of an entity. If entities can have or not have a particular attribute, then it needs to be handled via an intersect table.
Contradiction NULL values are in themselves useful to identify attributes that have not yet been populated by users. This is especially useful when a user needs to select a default value in order to determine proper business rules to apply. Case 2 how would you design an address table where Address1 was populated and Address2 was not required, but if Address2 was populated it must conform to the business rules of the field. Sure you could default an empty space, is this better than knowing the user did not edit the field? Try 3rd normal form on an international address… Can it be done probably but look at the complexity of restructuring the data in a meaningful way.
NULL/NOT NULL Check any database forums and this is a hot topic advocates on both sides advantages / disadvantages for each.
However all agree that you should never allow nulls in key variables. These are fields that are being used to identify a record uniquely, e.g. a customer identification number.
The null school say that you should use the nulls freely in all other fields. For instance customers are not obliged to have a cell phone, nor to tell you their number. Using a null and nothing but a null is the most efficient to record that cell phone is not available.
If it is really important to know why it not there it is better to introduce a new variable that states the reason, as opposed to introducing fancy codes to be stored in the placeholder for cell phone numbers. Be reluctant to add fields like these, because a) the customer is also not obliged to tell the reason why he is not giving his cell phone number, nor does this question make a nice conversation, nor is he likely to tell his reasons spontaneously, and b) nobody will ever look at them because of a). Why missing variables generally just waste time.
Be aware that yes/no variables (booleans) often can’t hold a null. Therefore they often contain useless information, such as “either he was republican, or he refused to answer”.
Get comfortable with intersect (many to many) tables. You’ll use them everywhere if you built things right. One example would be a high school database where one table is a list of teachers and another for students. Students have more than one teacher, and teachers have more than one student, so the intersect table, separate from ‘teacher’ and ‘student’, would have two columns : foreign keys pointing at both these two. The primary key would then be the combination of the two.
Use a good naming convention. For the invoices, put them in a table called “invoice”. Products go in “product”. The intersect would be “invoiceProduct”, or “productInvoice”, depending upon which table is really the center of the relationship.
If you’re going to have replication or log shipping, try to have that set up as you develop so you can see how it works.
Inner joins are great, but there’s probably a lot of LEFT OUTER JOIN statements that you’ll be doing as well. Get used to the different join statements (except UNION).
If you have to deal with a legacy application, build your schema independent of its (don’t even look at it). Focus on the business rules and relationships that it is trying to enforce, but you can get distracted if you look at the way that someone set it up. Refer to step #4.
Migrating from your legacy systems into a tighter model with proper normalization is difficult, but can be made a little more manageable by using temporary tables for your imports. Also, keep tabs on the legacy IDs for people to search by.
- Backups are important in development too. Make sure they are at least nightly. Verify them every week to make sure you don’t lose months worth of work (and possibly your job) if there’s a massive hardware casualty.
- In regards to entity relationships, as an example, a client can have many phone numbers. A client can also have many contacts, and each of those contacts can have many phone numbers. An invoice can only be associated with one client, however. There is one account rep that can be assigned to a client, except in certain cases where there are two, etc. These are the types of things you need to map out well before you write a single line of sql.
- Since you’re defining absolute relationships, a good way to get questions from reluctant participants is to ask things like “So is it absolutely true that there can only be one client on an invoice?” Questions like that tend to invoke a response in people.
- Leave the “s” off your table names (“invoice” table); it’s understood that since it’s a database there’s a good chance that there will be more than one invoice in there.
- Don’t expect replication, log shipping, or mirroring to work when you go into production. Develop and test with it from the start. Make it a part of your application.
- The flexibility and power from a relational model is mind boggling compared to a flatter structure.
- There are some extremely specific cases where you need to denormalize tables for performance reasons. But that’s easy to do; focus on the hard part which is proper normalization.
- Data modeling is a critical skill, and very few people building relational databases are any good at it.
- Don’t cheap out and do bad work just because it’s easier. If they wanted garbage, they would have hired someone else.
- If you do cheap out (nulls, denormalized tables for bad reasons), it will have direct and real consequences (orphans, crummy data integrity, joins wont work, etc).
- If you do not normalize correctly then the reports made from the database data will sometimes be plain wrong, and your boss will be very upset.