Searching contents of an embedded PDF file

SharePoint can of course index content of a PDF file if you use an iFilter. There are several of those iFilters available, the best-known ones are from Adobe (free) and Foxit (paid but faster).

So let’s say you have a SharePoint box with a PDF iFilter installed.
You upload a PDF file, do an incremental crawl of your content source, and search for a word inside the PDF. Result: the PDF is found.

Now you create a Word document (let’s say Word 2010), embed the same PDF file inside it. Do another incremental crawl, and search for a word inside the PDF.
You would think that the PDF is found right? Because SharePoint can ‘read’ the contents of MS Word and also of the PDF.
The actual result however is that the PDF is not found. Not even on the title of the embedded PDF file.

So what is happening here?

Although you do not actually see this, the contents of Office files (like MS Word) are also indexed using an iFilter for Office files. This iFilter is already included by default, you never had to install it separately so you might not even have been aware of it.
When SharePoint indexes your MS Word document, it reads the content and then it encounters the embedded file.

The Office iFilter now goes “hey, what’s this?”, but since the Office iFilter cannot read PDF files it simply skips the entire embedded file.
Only if the embedded file is also a file that the Office iFilter can read (for example if you embed a PowerPoint file inside a Word file) will the contents of the embedded file be read.

What you would have liked the Office iFilter to do is:

  • Examine the embedded object to determine which other iFilter to call to read the contents
  • When the embedded object has been read, go back to the rest of the container document

But that kind of intelligence is a bit much to ask, since there is no limit to the kind of objects you might embed in an Office file (although PDF support would have been nice considering how widely that format is used).

It’s logical if you think about it – although unpleasant still the same.

Advertisements

Audit report on Records Management events

Let’s say you use Records Management in SharePoint 2010. If you’re using records then you are probably also sensitive about proper auditing, and you might want to see who did what in the area of (un)declaring records.

Unfortunately, even if you check auditing for all possible events on a Site Collection and then view the list of available Audit Reports (remember to enable the Reporting feature, otherwise you will not get the Audit Log Reports link), there is no standard report about just Record Management. Worse even, it appears that none of the standard reports include those Records Management events in any way.

So how do you get an audit report about these then? The answer is through the custom report option.
Start by selecting the option to Run a custom report.

custom report selection

Inside the option screen that appears now, you need to select the custom events way down at the bottom (why Records Management events are custom events in SharePoint 2010 is a mystery to me, this is out of the box functionality).

rm2

Run your report and save the resulting Excel, et voilà! Your Record Management events are listed in the Excel as an event of type Custom, but with Source Name “Records Management”.

RM events in the audit report