Data Design Analysis

By: Gregory Keller

Abstract: Jason Tiret's paper on designing and standardizing database designs

rule

A bottom-up approach to designing and standardizing database design

By Jason Tiret

ER/Studio Product Manager, Embarcadero Technologies, Inc.

August, 2004

rule

Corporate Headquarters

EMEA Headquarters

Embarcadero Technologies

100 California Street, 12th Floor

San Francisco, CA 941111

www.embarcadero.com

Embarcadero Technologies, Ltd.

Thames House
17 Marlow Road
Maidenhead
Berkshire
SL6 7AA
United Kingdom

www.embarcadero.co.uk


Table of Contents

    Introduction

This paper takes a bottom-up approach to dissecting and resolving design problems. It then discusses how teams can work together in a secure collaborative environment where standards can be enforced in the maintenance and construction of database designs.

    Problem Overview

Organizations face many challenges when building and maintaining database systems. Enforcing standards in database construction is very important as a database structure evolves, development times decrease and teams grow. Without specific guidelines and standards in place, potential problems can arise in the database that can lead to data inconsistencies, data integration issues, and even data loss. From development to implementation to maintenance, various groups are responsible for care and evolution of the database. Developers down the hall from each other may be throwing new tables into existing databases as change requests are submitted, but who is to say that every one is on the same page? What happens if one developer is using VARCHAR(30) for “name” columns while another developer is using VARCHAR(50)? Inconsistencies in the design can arise if those developers don’t have a glossary of data elements and business rules defined with specific standards in mind. This can in turn, have costly trickle down effects such as complex data integration and reporting inaccuracies.

    Solution Overview

Modeling tools can help tremendously in our pursuit design standardization and data integrity analysis. For this problem we will use Embarcadero’s data modeling solution, ER/Studio, because of its sophisticated data dictionary system and advanced collaborative server, ER/Studio Repository. In order to perform our analysis we will need to visualize the database. We will start with reverse engineering an Oracle schema. ER/Studio will connect to the database and build a logical and physical model based on the metadata in the system catalog. This will then allow us to dissect the schema to see potential problems in the design that could lead to data integration problems. The problems can then be corrected directly in the modeling environment and the lessons learned can be applied to new designs that are created in the future.

    Building the Roadmap for Analysis

The roadmap we will use will be the ER/Studio diagram. ER/Studio will build for us a data dictionary with a first-cut list of domains and tie them to a logical and physical model. Once we have a data model constructed, we will have the blueprint we need to assist in our standardization of the database.

    Reverse engineering an existing database

  1. Open ER/Studio.
  2. Go to File > New > Reverse Engineer
  3. In the first page, choose “native connection,” enter the Oracle service name and login information.
  4. In the second page, choose the schema and objects desired from the database. In this case, tables only will be selected, but note that many DBMS-specific objects can be included.
  5. In the third page, this where the specific objects from the selected schema can be chosen. For example, if there are test tables or store objects that are not needed, they can be removed.
  6. In the fourth page, keep the defaults and make sure to check the “Infer Domains” option.

Hide image
Click to see full-sized image

IMPORTANT NOTE: This option will build a domain for every column that is unique by name AND data type. If ER/Studio comes across two columns that have the same name and data type, then they will map to the same domain, otherwise multiple domains are created and each respective column will have its own domain. This is how we will do our analysis of the database design.

  1. In the fifth page, hit Finish.

    What has been done?

  • We have just built a “roadmap” or “blueprint” of the Oracle database
  • We have built a physical model AND a logical model that are managed as separate models with independent metadata underwriting their respective objects, i.e., we can make changes to one without affecting the other. Changes can be selectively merged back and forth using the compare and merge system.
  • New physical models can be created from the logical model if we need to do some prototyping on other DBMS platforms, need to build a test or development environment, etc.
  • Automatically built a list of domains that reference all of the columns in each of the logical and physical models. This can be used for analysis and is a starting point in our analysis and standardization of the database design.

    Analyzing the consistency of the database design

Now that the model has been created and the domains have been inferred, the analysis can begin. We will start by looking at the various data elements in the database tables. This is done by going to the domains. The domains will be the “short list” of data elements (i.e., columns) from the database. The infer operation will show us where the properties of the data elements differ in any manner. Duplicate domains are a red flag, since ER/Studio will automatically create new domains if it finds the same column name with different properties, like data type, rule, constraint etc. Let’s take a look at the list.

Hide image

  1. Navigate to the Data Dictionary Tab. It is the second tab below the data model explorer:
  2. Go to the Domains node.
  3. Expand the node if it is not already expanded.

Below is part of the list that was created from the infer operation.

Hide image
Click to see full-sized image

Notice we have two DESCRIPTION domains. By taking a closer look in the respective domains editors, you will see that one is VARCHAR(255) and the other is VARCHAR(500). This is a potential example of a divergence from standards. The difference here may be on-purpose or not. If it is on-purpose, then we should look at possibly improving naming conventions to DESCRIPTION for short strings and NOTES for longer strings. If it was not on-purpose, then we have just isolated a problem in the design that could lead to data loss if data from each of these columns ever needed to be integrated.

To perform further analysis, you can right click on the domain and going to “View Domain Bindings.” This will show two things:

  1. Where the domain lives – more specifically, what attributes and columns it is bound to. You can now visualize the relationship between common data elements and the associated attributes and columns.
  2. Impact analysis can be done before making a change to groups of attributes or columns. Any change made to the domain will propagate down to the attribute/column it is bound to. Global updates can be made in the domain editor rather than individually changing common columns when business decisions are driving changes.

If we do this for DESCRIPTION we get:

Hide image

This tells us exactly where this domain is used in the logical and physical models. You can also see this information in the “Binding Information” tab of the domain editor. That tab can also be used to assign the domain to other columns or attributes in either the physical or logical models.

    Correcting potential problems

Assuming that the conflicting data types are in fact a problem, the problem now needs to be corrected. Since domains are used, this is very easy to do. Because DESCRIPTION_2 is the domain that has the larger width – 500 – this is what we will use as our standard for description data. What needs to be done is all of the columns and attributes that reference DESCRIPTION need to be directed to DESCRIPTION_2. We can also look through the domain list to see any other domains that could be candidates for description data.

We have already seen how we can determine where the domain is used by using the context menu or the “Binding Information” tab. The easiest way to do this, since both domains are currently in use, is to leverage ER/Studio’s Automation Interface and employ one of the macros that ships with the product, Switch Domain Bindings.


  1. Hide image
    Navigate to the Macro tab below the data model explorer tree.
  2. Right click on the macro and select Run.
  3. This will launch the below UI. In the source list select DESCRIPTION. In the target list, select DESCRIPTION_2.

Hide image
Click to see full-sized image

  1. Hit Switch to move the domain binding from the source domain to the target. While we are at it, we can look through the source list to see if there are other domains we want to standardize on this DESCRIPTION_2 domain. RELEASE_COMMENTS looks like a good one. You can repeat Steps 3 and 4 for that domain.
  2. Hide image
  3. We can verify that the domain bindings have been switched by once again using the “View Domain Bindings” option for DESCRIPTION_2 domain. It should look like the example to the right.
  1. We can now safely delete the DESCRIPTION and RELEASE_COMMENTS domains so that no one else uses them going forward.

We have now re-factored the domain list so that we have only one standard for description data listed. Now for any developer, DBA, or data architect who uses the domain on a new column, the standard properties will be used and the organizational standards will be followed.

    Moving to a Collaborative Environment

What we have done thus far is analyze and enforce standards within the scope of a single Oracle schema. We have done nothing to make this available to our peers who may need to run their own analysis, make changes to the logical or physical design, or possibly generate reports on the metadata. We also haven’t incorporated any of our organizational business rules, standard data elements, or extended metadata that may be defined on a more global scope across applications. In order to broaden the scope of our development and standards enforcement across a team, we will leverage ER/Studio’s collaborative server, ER/Studio Repository. This will allow real-time, team-based access to diagrams managed within the Repository collaboration server in a safe and secure manner. The Repository installation can be deployed on Oracle 8.x/9.x, DB2 UDB 7.x/8.x, SQL Server 2000, Sybase 12.x.

    Setting up the Collaborative Environment

In order to work collaboratively on the model that was reverse-engineered, it will need to be added to the ER/Studio Repository. This will save the diagram information into the Repository database so that any other ER/Studio user can access this model provided they have the proper permissions to do so.

The diagram can be initialized in the following dialog:

Hide image
Click to see full-sized image

    Component sharing, reuse, and impact analysis

During the process of putting the diagram in the Repository, we can add any enterprise data dictionaries that exist in the Repository to it. The Enterprise Data Dictionary system will allow us to globally define domains and other dictionary objects, such as rules, attachments, defaults and UDT, so that we can enforce standards across any diagram in the ER/Studio Repository. Notice in the screen shot on the previous page that we added the EMBT dictionary to our diagram before it got added to the repository. This will attach all of the standard components from that dictionary to the diagram so that we can leverage everything contained in that dictionary.

Hide image
Click to see full-sized image

We can slowly start to promote attributes and columns that don’t have domains or that are still leveraging the local data dictionary to reference domains in the EMBT dictionary allowing us to globally manage them more easily. This will also allow us to perform the same impact analysis, but this time across the entire Repository because that dictionary is used on other diagrams that are housed in the Repository. For example, we may have an Identifier domain in our EMBT dictionary that we use to manage all of our ID attributes and columns. By binding ID columns from various models (NOTE: This can be done through the Binding Information tab of the enterprise domain or by using the macro to promote attributes/columns to enterprise domains), we will be able to manage all of our ID attributes/columns through the use of one domain. We will also have traceability across our collaborative environment so that we can see where Identifier is used.

Hide image
Click to see full-sized image

    Enforcing the standards

It is now time to decide what we want to manage through the enterprise dictionary and what may be okay if left referenced to the local dictionary. We still have the list of domains that were inferred during the reverse engineering operation. Some of those can be used for application-specific purposes on this particular diagram that manages the Oracle schema. Others we may want to start promoting to reference domains in EMBT dictionary. The description attributes/columns may be good candidates for that. We consolidated them to one domain in the local dictionary, but looking in the EMBT dictionary, there is a “Long String” domain that can be used for similar types of data elements. Looking at the properties of the domain, it looks like it is the right fit to manage the attributes/columns that were assigned to the DESCRIPTION_2 domains.

Hide image

It makes sense to manage all of these by one global domain. That way if organization-wide initiatives are driving changes to database structures, they can be swiftly implemented, and standards are enforced on all applications throughout the development process. To promote the description data elements that reference the local DESCRIPTION_2 domain, we can employ the same Switch Domain Bindings macro. This time we need to be careful about selecting the proper source and target dictionaries and the domains.

Hide image
Click to see full-sized image

The DESCRIPTION_2 domain can then be deleted from the local dictionary so that no one uses it by mistake.

    Conclusion

With the help of ER/Studio and its collaboration server, ER/Studio Repository, we were able to isolate potential integration problems within an Oracle database. What we saw by leveraging domains and the infer functionality within the product was how conflicting data types between common data elements can surface in a live database. How they got there may not be known, but this can lead to a variety of data integration issues. The problem was efficiently corrected by consolidating on one standard while new standards were brought into the design process with the repository so that the problem wouldn’t surface again. Data integrity problems can arise in many ways, but can be prevented by incorporating and enforcing standards when working in a team environment.

Embarcadero, the Embarcadero Technologies logos and all other Embarcadero Technologies product or service names are trademarks of Embarcadero Technologies, Inc.   All other trademarks are property of their respective owners.

Server Response from: ETNASC04