Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

Within one Web application I am working on, all database operations are abstracted using some generic repositories defined over Entity Framework ORM.

However, in order to have a simple design for the generic repositories, all involved tables must define an unique integer (Int32 in C#, int in SQL). Until now, this has been always the PK of the table and also the IDENTITY.

Foreign keys are heavily used and they reference these integer columns. They are required for both consistency and for generating navigational properties by the ORM.

The application layer typically does the following operations:

  • initial data load from table (*) - SELECT * FROM table
  • Update - UPDATE table SET Col1 = Val1 WHERE Id = IdVal
  • Delete - DELETE FROM table WHERE Id = IdVal
  • Insert - INSERT INTO table (cols) VALUES (...)

Less frequent operations:

  • Bulk insert - BULK INSERT ... into table followed (*) by all data load (to retrieve generated identifiers)
  • Bulk delete - this is a normal delete operation, but "bulky" from ORM's perspective: DELETE FROM table where OtherThanIdCol = SomeValue
  • Bulk update - this is a normal update operation, but "bulky" from ORM's perspective: UPDATE table SET SomeCol = SomeVal WHERE OtherThanIdCol = OtherValue

*all small tables are cached at application level and almost all SELECTs will not reach database. A typical pattern is initial load and lots of INSERTs, UPDATEs and DELETEs.

Based on current application usage, there is very small chance of ever reaching 100M records in any of the tables.

Question: From a DBA's perspective, are there significant problems I can run into by having this table design limitation?

share|improve this question
2  
This link will help you. – Kin 18 hours ago
    
@Kin - thanks for the interesting article. However, shifting to Guids is not option now (it involves a complete refactoring of tables). Also, dealing with integers is much easier (human readable) than dealing with Guids. – Alexei 17 hours ago
    
You might find these posts of relevance, as some logical, physical and practical aspects are discussed with regards to the use of columns with system-generated surrogate values. – MDCCL 17 hours ago

Other than additional disk space (and in turn memory usage and I/O), there's not really any harm in adding an IDENTITY column even to tables that don't need one (an example of a table that doesn't need an IDENTITY column is a simple junction table, like mapping a user to his/her permissions).

I rail against blindly adding them to every single table in a blog post from 2010:

But surrogate keys do have valid use cases - just be careful not to assume that they guarantee uniqueness (which is sometimes why they get added - they should not be the only way to uniquely identify a row). If you need to use an ORM framework, and your ORM framework requires single-column integer keys even in cases when your real key is either not an integer, or not a single column, or neither, make sure that you define unique constraints/indexes for your real keys, too.

share|improve this answer
    
Thanks for the quick reply. Yes, the application uses an ORM (EF). It does not require single integer column keys, but I have introduced this restriction to make some generic operations much easier (design-wise). Also, all application caches stores everything in maps (dictionaries) for fast retrievals by key and the key must be unique. Since, I have chosen ints over guids, I am forced to use IDENTITY for any table that I insert into. For fixed values tables, IDENTITY is not required. – Alexei 18 hours ago

In my experience with various databases, an Integer primary key is always better than the applications that have no keys defined at all. Or that have keys that join half a dozen varchar columns in awkward ways that aren't logical... (sigh)

I've seen applications that switched from integer PKs to GUIDs. Their reason for doing so was because there was a need to merge data from multiple source databases in certain cases. The developers switched all of the keys to GUIDs so that the merges could happen without fear of data collisions, even on tables that weren't part of the merge (just in case those tables ever became part of a future merge).

I'd say an integer PK is not going to bite you unless you plan to merge data from separate sources or you might have data that goes beyond your integer size limits -- it is all fun and games until you run out of space for inserts.

I will say, though, that it can make sense to set your clustered index on a column other than your PK, if the table will be queried more frequently that way. But that's an outliar case, especially if the bulk of updates and selects are based on the PK values.

share|improve this answer

Putting aside:

  • The religious wars (google surrogate vs natural key)
  • The separate issue of what clustered indexes to define on your tables
  • The viability of caching all your data

Provided you're using bulk delete/update where appropriate, and have indexes to support such operations, I don't think you'll run into trouble due to the PK standard you use.
It's possible that if you later have EF generate queries with joins etc, that they won't be as efficient as they would be with a natural key based repository, but I don't know enough about that area to say for sure either way.

share|improve this answer
2  
I can't think of a single case where a join on a natural key would be more efficient than a join on an integer - not many natural keys can be smaller than 4 bytes, and if they are, there can't be enough unique rows to make the difference material. – Aaron Bertrand 17 hours ago
    
For competent, optimisable SQL I agree, but I was referring to possible limitations of SQL generators. My only experience in this area is being asked to create extensive views with which EF could be spoon-fed--though it's possible the .net devs didn't know enough about EF, or that there were other reasons. – T.H. 17 hours ago
    
@AaronBertrand I would say that the only way in which they might be more efficient is if a join was not needed at all. The only places I consider the use of natural keys is with standard code lists such as ISO4127 currency codes (which are human-recognisable), and I might use GBP, EUR etc as the foreign key to an primary or alternative key on the currency code table. – David Aldridge 16 hours ago
    
@David Of course, I was talking about cases where joins are necessary. There are a lot of cases where I don't want the natural key proliferated throughout all the related tables, because natural keys can change, and that is a painful thing. – Aaron Bertrand 16 hours ago
    
Hmmm, I see how my answer could be misunderstood to be promoting natural foreign keys over surrogate. To be clear, I actually only mentioned them because a) I read Alexei's question as "is it a problem that we don't use natural keys?", b) Alexei's wrap-up question started with "from a DBA's perspective" and I felt I should sort of acknowledge there's more than one perspective and c) because I would think the ORM features to be used largely dictates the choice (if it actually can make a difference). I'm firmly in the surrogate foreign key camp myself. – T.H. 14 hours ago

From my experience, the main and overwhelming reason to use a separate ID for every table is the following:

In almost every case my customer swore a blood oath in the conception phase that some external, "natural" field XYZBLARGH_ID will stay unique forever, and will never change for a given entity, and will never be re-used, there eventually appeared cases where the Primary Key properties were broken. It just does not work out that way.

Then, from a DBA point of view, the things that makes a DB slow or bloated are certainly not 4 bytes (or whatever) per row, but things like wrong or missing indexes, forgotten table/index reorganizations, wrong RAM/tablespace tuning parameters, neglecting to use bind variables and so on. Those can slow down the DB by factors of 10, 100, 10000... not an additional ID column.

So, even if there were a technical, measurable downside of having an additional 32 bit per row, it is not a question of whether you can optimize the ID away, but whether the ID will be essential at some point, which it will be more likely than not. And I'm not going to count off all the "soft" benefits from a software development stance (like your ORM example, or the fact that it makes it easier for software developers when all IDs by design have the same datatype and so on).

N.B.: note that you do not need a separate ID for n:m association tables because for such tables the IDs of the associated entities should form a primary key. A counterexample would be a weird n:m association which allows multiple associations between the same two entities for whatever bizarre reason - those would need their own ID column then, to create a PK. There are ORM libraries which cannot handle multi-column PKs though, so that would be a reason to be lenient with the developers, if they have to work with such a library.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.