Dangers of Document Metadata
Document metadata comes in many forms. Below is a list of the types of metadata found in Microsoft Office documents and the risks that each type of metadata poses to a corporation.
Track
Changes and Document Revisions
Microsoft Word, Microsoft Excel and Microsoft PowerPoint documents. The Track
Changes feature tracks changes (inserted, deleted, and moved text) made to a
document during a review. As changes are made to a document using Track Changes,
a new revision of the document is kept by the application. This revision history
exists, even after changes to the document have been accepted or rejected.
Risks:
Track Changes shows the history of changes to the document. If Track Changes is
left on, but the highlight on the screen is turned off, every change made to the
document still remains. This is like recording every single keystroke made to
the document that can be viewed by subsequent reviewers. Thus, even though the
Track Changes are not visible, it still travels with the document and, in some
circumstances, it can be sent to and seen by an unintentional party with
potentially disastrous consequences.
Comments
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Comments are
notes and suggestions that are added to a document via the comment feature to
help facilitate an online review.
Risks:
Comments, like hidden text, unless intentionally removed can display sensitive
information to external parties because comment metadata travels with the
document. Microsoft Excel and Microsoft PowerPoint documents are especially
susceptible to this risk as there is no internal mechanism built into these
applications to warn a user that comments are embedded.
Document Properties
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Document
properties are details about a file that help identify it and include
descriptive title, subject, author, manager, company, category, keywords,
comments, and hyperlink base. Document properties display information about a
file to help organize the files so that they can be easily found at a later
date.
Risks:
The names of authors and the name of the company can display sensitive
information about a corporation. It is possible that if a document has been sent
outside your own corporation, the author name and company name contained in the
built-in properties could be a name other than your own. In addition, if
documents are re-purposed or used as a template for a new document, information
that is specific to a previous client such as pricing, terms, or the client's
name can be stored as hidden information within the new document.
Document
Statistics and File Dates
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Document
statistics include information on when the document was created, when it was
modified, when it was accessed, and when it was printed. In addition, document
statistics display the name of the person it was last saved by, the revision
number, and the total editing time. Other statistics can include number of
pagers, paragraphs, lines, words, and characters.
Risks:
Document statistics can create embarrassing situations.
For example, the "last saved by" metadata shows the last person who edited the
document. Repurposing previous documents can reveal a history that you may not
want to share with another person or organization.
Document Reviewers
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Document
reviewers consist of a list of users that have added or accepted any track
changes. When the names of reviewers are removed, but not the Track Changes, the
revisions remain with the document. However, the user name associated with each
revision will be removed. It is recommended that the names of the document
reviewers be removed when removing track changes.
Risks:
The risk from the Document Reviewers metadata is that it can expose who has
previously reviewed the document and who has suggested changes.
Custom
Properties
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Custom
Properties include any property fields added manually to a document or by
various programs to help manage and track files. Common types of custom
properties used to identify specific data are DocumentID, department and status.
Risks:
Custom Properties are normally things specific to an organization and may
represent proprietary information or competitive business practice. The
potential risk arises because it is easy to see a history of this document and
reveal internal practices.
Hidden
Text
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Hidden text are
text blocks that have been formatted as hidden. Unless specifically selected to
be viewed in Microsoft Word, hidden text is not displayed within the document.
Risks:
Hidden text can contain notes that are particular to a document. As hidden
information that is not cleansed, the hidden text can potentially be viewed by
unintentional parties.
Header and Footers
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Headers and
footers are areas in the top and bottom margins of each page in a document. Text
or graphics can be inserted in headers and footers—for example, page numbers,
the date, a company logo, the document's title or file name, or the author's
name—that are printed at the top or bottom of each page in a document.
Risks:
Custom header and footers can contain descriptions such as filename, path, the
date and time the document was modified, or other information that is deemed
important to make it easy to retrieve and edit a file. Unfortunately, the
information contained in footers and headers is often overlooked when the
document is shared. Failure to remove this information can result in revealing
confidential information.
Footnotes
Microsoft Word documents only. Footnotes attributed to content are embedded as
metadata into Microsoft Word documents.
Risks:
Footnotes may expose private, internal directions about how the document is used
in the organization.
White
Text
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. White text is
blocks of text that have been formatted with a font color of white on a
background of white. The text appears invisible when viewed or printed and can
be used to hide information in a document. This method of blocking out text
using White text is often called redaction.
Risks:
White text is commonly used when documents are posted to the Internet so that
can be more readily found by search engines and to hide confidential information
in redacted documents. However, white text can also be viewed by external users.
Depending upon what was actually written as white text, the information can be
very damaging. White text can also be used for particular field codes such as
the "include text" field code, which can point to a file location. If this file
location code is embedded in a document, users can unknowingly be updating the
code and can potentially expose the document to a hacker.
Small
Text
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Any text block
contained in a document that is less that five (5) points is considered small
text. The text is so small that it will not be visible when viewed or printed
and can be used to hide information in a document.
Risks:
Like white text, small text is commonly used to put information in documents so
they can be found by search engines. Small text can also include sensitive
information that was not meant to be distributed externally.
Macros
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. If a task is
repeated in Microsoft Word, Excel or PowerPoint, it can be automated using a
macro. A macro is a series of commands and instructions that are grouped
together as a single command to accomplish a task automatically.
Risks:
There are several reasons to strip out custom macros.
For example, macros can be set for templates that may have some amount of
pre-populated data. There may be a time when the information contained in these
templates should not be seen by external audiences. Another example, macros can
be linked to internal databases or intranets. Having access to the internal file
naming structure is generally information that most corporations do not want
outside their firewall. Lastly, macros are often quite complex and, if developed
in-house, may represent the company's intellectual property. If macros are
included in the document, the information is freely shared with any outside
party.
Previous
Versions
Microsoft Word documents only. Previous versions show the number of times that a
document has been versioned over its lifetime. This function enables Microsoft
Word to save prior versions of a document as a part of the electronic file.
Risks:
The risk associated with previous versions is that a recipient can access any of
the previous versions that have been saved. Therefore, the party reviewing the
document can go back to any version and see what was changed in the document
lifecycle. This metadata, while useful in some instances, can disclose sensitive
information.
Routing
Slips
Microsoft Word, Microsoft Excel, Microsoft PowerPoint documents. Routing slips
are used to create a distribution list of reviewers in a particular order.
Routing slips are manually created by adding in recipients' email addresses.
When files are routed, it is sent as an attachment in an email message.
Risks:
Routing slips reveal the names and email addresses of the people that the
document was sent to for review. This may be information that should stay
confidential rather than distributed externally. An example of how this
information can be used is when email addresses are put in the routing slips. If
this document is then published to the Internet, the email address can be
displayed for all to see.
Fast Saves
Microsoft Word documents only. Fast saves is an option in Microsoft Word that
saves just the changes that were made to a document, resulting in the history of
the changes being saved with the document file. Turning fast saves off and
saving the document will remove the changes and store only the final version of
the document.
Risks:
Like other metadata, changes saved during a fast save can expose sensitive
information to external parties when viewed using a text or hex-editor. Deleted
text can still exist in the electronic file. According to the Gartner Group's
Research Note on Metadata in Office, "users can easily forget that metadata
exists when they send the document to someone else. Some metadata is never
visible, such as pieces deleted by users but not really deleted by Microsoft
Office when operating with fast save turned on.
Hidden
Slides
Microsoft PowerPoint documents only. Hidden slides are slides that are hidden so
that they are not shown during a slide show.
Risks:
A master Microsoft PowerPoint slide deck may contain
some slides that are uses as backup or that are for internal use only. To
prevent accidental showing of these slides, it is best to strip out any hidden
slides before sending the slide deck out externally.
Hyperlinks
Microsoft Word and Microsoft Excel documents only. Documents can contain
hyperlinks to other documents or Web pages and are displayed as blue underlined
text. Hyperlinks in Microsoft Excel files can be seen in: a link to a cell in
another Microsoft Excel document, a named link to a named reference in another
Microsoft Excel document, a link to another document, an OLE link that inserts
another document as an icon, and an OLE link that inserts another document as
text.
Risks:
Hyperlinks can maintain a link to a site that corporations may not wish to
disseminate such as files that may exist on a computer's local file system, on a
corporation's internal database, or on an intranet. Disclosing the file path, or
the location of where the files are stored can invite potential hackers to
gather sensitive corporate information.
Source: metadatarisk.org
Accessed: 22/9/05