Searching contents of an embedded PDF file

SharePoint can of course index content of a PDF file if you use an iFilter. There are several of those iFilters available, the best-known ones are from Adobe (free) and Foxit (paid but faster).

So let’s say you have a SharePoint box with a PDF iFilter installed.
You upload a PDF file, do an incremental crawl of your content source, and search for a word inside the PDF. Result: the PDF is found.

Now you create a Word document (let’s say Word 2010), embed the same PDF file inside it. Do another incremental crawl, and search for a word inside the PDF.
You would think that the PDF is found right? Because SharePoint can ‘read’ the contents of MS Word and also of the PDF.
The actual result however is that the PDF is not found. Not even on the title of the embedded PDF file.

So what is happening here?

Although you do not actually see this, the contents of Office files (like MS Word) are also indexed using an iFilter for Office files. This iFilter is already included by default, you never had to install it separately so you might not even have been aware of it.
When SharePoint indexes your MS Word document, it reads the content and then it encounters the embedded file.

The Office iFilter now goes “hey, what’s this?”, but since the Office iFilter cannot read PDF files it simply skips the entire embedded file.
Only if the embedded file is also a file that the Office iFilter can read (for example if you embed a PowerPoint file inside a Word file) will the contents of the embedded file be read.

What you would have liked the Office iFilter to do is:

  • Examine the embedded object to determine which other iFilter to call to read the contents
  • When the embedded object has been read, go back to the rest of the container document

But that kind of intelligence is a bit much to ask, since there is no limit to the kind of objects you might embed in an Office file (although PDF support would have been nice considering how widely that format is used).

It’s logical if you think about it – although unpleasant still the same.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s