Jim's Otec Blog

Saturday, October 28, 2006

De-duplication of files

I recently read an article on VTLs that stated that the killer app for this technology could be something they called "Data de-duplication". They appeared to assume savings from not backing up the same file twice. While this is totally valid within the realm of vitualized tape libraries, it misses the main benefits of de-duplication.

Organizations are rife with files stored in multiple places. E-mails are even worse but at least most e-mails sent out to huge lists of people are relatively small in average size. Files are typically large enough to make an impact in your storage costs. Furthermore, most files that get proliferated are ones that are read and not altered like Powerpoint presentations and PDFs.

The duplicate file problem is one that should already be solved in most oranizations, by using a system to manage files. The most prevalent systems available to do this are DAM sysstems and document management systems. They store files and associated data (metadata) in a database which can be searched. These systems can also ensure a single existence of each file and control access to files. Some systems also have workflow features to control approval and flow of documents.

Using these systems to minimize duplication will solve the problem at the earliest possible stage in the document's life cycle. This will eliminate the benefit and therefore necessity of building these features into a backup system. A backup system should remain totally focused on backing up all data on volumes in the backup set. If a backup system is allowed to choose what to backup based on criteria, it opens the door for it to compromise itself with those choices.

Tuesday, October 17, 2006

File Naming Standards


When you may access a file multiple times, the effort required to do so is greatly simplified by naming the file to facilitate finding it when needed. Success is largely determined by two attributes of the file name: uniqueness and meaningfulness. The name must be unique enough that you aren't presented with a passel of candidates to choose from when you try to search for it. The name must be meaningful, so that you can enter successful search criteria. I think we have all seen the downside of these attributes being handled poorly, when we search for information on web search engines. Better results are achieved with well-named and well-tagged documents.

Files used in the course of running a business are even more critical to be described properly as it can cost valuable time and money to access them. There is a site where they have posted and discussed this very question at: http://www.whatdoiknow.org/archives/000442.shtml

Does your business have a DAM system or document management system? This can be a great help in this respect, particularly if it supports versioning, as this can also eliminate duplicate files and provide document history for governance purposes. You will still have similar issues with naming but most DAM systems have decent search engines.

Naming standards can be very different depending on what you do. For example, if you produce catalogs or flyers as a repeat business for a client and there is some reuse of assets, it probably makes most sense to inherit at least some of the name from the client i.e. using product SKU as part of the name. If you are constantly doing new and custom work, you are probably better off using a client-docket-page-position style naming standard and organizational hierarchy as you might use in a legal or medical practice or a publishing environment.

Our company sells DAM systems and we actually offer a 1 day consultation that we call The DAM Primer to address these issues. We call it a consultation as opposed to a course, because it really depends on the client needs as to what makes sense as far as a naming standard. We usually discuss a couple of common methods and then listen to the client to help develop a standard appropriate for their workflow. We also address metadata in this consultation. We recently did this consultation with a library, and learned almost as much as we taught.

Metadata Portability for Digital Assets

I was checking out some DAM products recently at a show and came across a product feature that was highly touted by the company but ran counter to my own opinions. The feature is the ability to have metadata embedded within an asset

This feature is not new - MS-Office, PDF, TIFF, IPTC and other specs have metadata embedded within them. The XMP standard allows for embedding and is supported by many DAM systems. However, the metadata is usually either general like Dublin core or there for a specific use like
IPTC for newspaper photographers. The idea that this is a must-have for any DAM system and that the scope encompasses all asset types and all metadata, is an assertion that I never came across prior to this seminar.

While I like the open sharing aspect of this, I also have some serious concerns about whether this is required or even advisable in many cases. Some obvious areas of concern are:
  • Security - metadata for medical imaging or legal documents can be very confidential to the point of being enforced by law. We all know that encryption is a temporary barrier for access to data i.e. time and processing power can always provide access. Furthermore, most asset repositories I have been involved with have a number of levels of access to information that can't effectively be managed when embedded within an asset, without compromising openness.
  • Data integrity - what happens when an asset is created with incorrect or incomplete data, and proliferated by distribution. What then is the authoritative source of data? Is it not batter to manage this type of data centrally. While an asset may remain unchanged for a long time after it is complete, the data describing it will often be updated.
  • Accessibility - a lot of metadata belongs in a database and is much more accessible in a database. It is easier to query, group and search on in a database than in files. This issue can be managed much more easily if you have a system that supports embedded metadata assets, so I consider it the lesser of the points, but it still makes more sense to me for the authoritative source for most metadata to be a database.
There many aspects of XMP that I find attractive, but I just thought I'd throw this one out for discussion as I would like to see what other think.