Metadata and Practical Examples of How to Handle It
Jon Lu - August 19, 2021
1. INTRODUCTION
1.1. METADATA
Wikipedia defines Metadata to be "data that provides information about other data". Overall, there are six distinct types of metadata, descriptive Metadata, legal Metadata, administrative Metadata, reference Metadata, statistical Metadata, and structural Metadata.
From all the above, one of the most important is descriptive Metadata that describes information about a resource, and it is used for discovery and identification with elements such as title, abstract, author, and keywords coming into the picture.
Often Metadata is used to enhance data reuse, and it plays an important role in the data discoverability process and data relationships.
As an example, the OpenDataSoft company describes that Metadata represents the following information:
- What - When - Where - Who - How - Which - Why
BACKGROUND
In 2008, Larry Pesce published for SANS a whitepaper named “Document Metadata, the Silent Killer...“.
Overall, his paper represents by far the most comprehensive guide for anyone looking to understand the risks associated with the exposure to unwanted metadata information within published resources. In addition, the document outlines several techniques and tools that can still be used.
The scope of this blog post is not going too deep in that direction, though. Instead, we will explore a couple of technical solutions that any business can consider during the process of limiting information exposure through its public documents. This finding is often underrated and is usually missing completely from an Appsec pentesting engagement report. However, when this issue is detailed, it often does look like this.
Today, more than 70% of the publicly exposed documents are in pdf format, so that we will focus only on this format.
As a client, you might ask, okay, is there any easy way to deal with this matter without breaking the bank? Well, yes, you can consider a couple of options we are trying to detail further.
2. PRACTICAL SOLUTIONS
2.1. Using “Bash”
In 2017, Josh Lemon released a whitepaper detailing a process of handling the PDF metadata information exposure. Assuming you are using an Ubuntu Linux with the qpdf already installed, you can replicate all his steps and build up a simple automated process. Here's a potential example script.
Cons :: It seems the final output might be reversed under specific circumstances. While it does its job, we don’t consider this solution a good fit for corporations or enterprise-grade businesses.
2.2. Using Python PyPDF2
Luckily, Python does have various libraries that allow PDF manipulation, PyPDF2, ReportLab, pdfkit, etc.
All of them are great libraries, but we found PyPDF2 friendly enough during the goal of achieving PDF metadata changes.
For the sake of this article, we will randomly pick a CRM company, like HubSpot. They are one of the major players in their industry, with many public documents published.
We picked up a file named “Introduction to SEO eBook” through a quick search engine wizardry. Good content, by the way, and the people wrote it did a great job.
Checking the metadata of the file, we got the following:
To summarize the info, the original file is from 2011, initially wrote in Microsoft Word 2007, and then exported in PDF format by sgoliger, as its Author. The current details expose timestamps associated with 2011 and an obsolete 2007 MS Office version; beyond that, the document proved to be a good candidate pinpointing the amount of information stored inside the metadata fields.
Hint :: If the Producer field returns “Skia/PDF m83”, the PDF was exported using Google Docs. More details can be found in the Reference section.
Ideally, any business should sanitize all its documents before publishing. The internet is a vast ocean, and while you can still take down and remove some of your content, the process is complicated and time-consuming.
Altering a PDF metadata through a Python script can be achieved through a snippet like this:
Cons :: We tested this option quite intensively, and it was found that the results were not consistent. Also, the pypdf2 project does not seem maintained actively, with a plethora of forks and derivates.
From the supply chain security perspective, adopting this solution might not be the best call, so it is up to you if you want to dig deeper or not.
2.3. Python and PDFTK
We remembered that every story has three heroes back in childhood, so here is the third potential practical solution. We will use Python again, just because it is much easier to understand and deal with. There are other ways of doing it, and you can use whatever makes you comfortable to get the job well done.
This time, we will not import any python pdf libraries but invoke an external program, pdftk (on an Ubuntu 18.04+, “sudo apt install pdftk” should do the trick).
Cons :: We cannot pinpoint any at this point. However, additional testing might be required. What worked for us does not necessarily work for you. 😊
Q. Is there any way to automate this process even further? A. Yes. It can be done, and an option on the table would be using GitHub’s “Actions” feature. We performed some testing and the execution time for a 100 pdf files list was under one minute. That gives for free 2000 minutes free execution time, say, more enough to automate the process. Note :: A barebone Github Actions YAML file skeleton might look like this. All you need to do is to replace the missing bits with your code, and it should be ready to roll.
3. CONCLUSION
As mentioned before, metadata provides a substantial set of information that could be used to develop further attack vectors. Also, the metadata can be used beyond the scope of a cybersecurity attack and for competitive intelligence. However, this aspect might be covered in the next parts.
This blog post is aimed at beginners. It does not present any novel research and only serves as an introduction, covering the basics and offering simple, practical free solutions to address the metadata information exposure.