Convention for EPiServer Find to ignore large files

EPiServer Find has a limit when indexing files (last I heard this was 50MB), but nothing really happens if you try to index a large files except that you get an error message from the indexing job. But if you have some very large files, you might bump in to performance problems on the server. Therefore it might be a good idea to add a convention that ignores large files.

In this case we will save the files size on the media and then check this property in the convention.

First we add a property to the media type. In this case we have a media type named GenericMedia that inherits from MediaData.

[ContentType(GUID = "EE3BD195-7CB0-4756-AB5F-E5E223CD9820")]
public class GenericMedia : MediaData
{
public virtual int FileSizeInKb { get; set; }
}

 

After that, we create a startup module named ContentInitialization that listens to the CreatingContent event. In the creating content event, we find out how big the file is, and saves it to the property we just created.

[InitializableModule]
[ModuleDependency(typeof(EPiServer.Web.InitializationModule))]
public class ContentInitialization : IInitializableModule
{
public void Initialize(InitializationEngine context)
{
var eventRegistry = ServiceLocator.Current.GetInstance();
eventRegistry.CreatingContent += OnCreatingContent;
}

private static void OnCreatingContent(object sender, ContentEventArgs e)
{
var content = e.Content as GenericMedia;
if (content != null)
{
var path = (content.BinaryData as FileBlob).FilePath;
var length = new FileInfo(path).Length;
content.FileSizeInKb = (int)(length / 1024);
}
}

public void Preload(string[] parameters) {}
public void Uninitialize(InitializationEngine context) {}
}

 

Now the only thing missing is the convention. For this we create another module named FindInitialization and add the convention there. The reason for creating several startup modules is both for code readability, but they might also have different dependencys to other modules. In this case we only let files smaller than 20000kb to be indexed.

[ModuleDependency(typeof(EPiServer.Find.Cms.Module.IndexingModule))]
public class FindInitialization : IInitializableModule
{
public void Initialize(InitializationEngine context)
{
ContentIndexer.Instance.Conventions.ForInstancesOf().ShouldIndex(x => (x.FileSizeInKb < 20000));
}

public void Uninitialize(InitializationEngine context) {}
public void Preload(string[] parameters) {}
}

Some tips if you run into problem

The schedule job fails with thread was being aborted
Make sure that there is no resource limit on the application pool or that the server runs out of memory. When indexing a lot of content, the server will use a lot of CPU and memory.

EPiServer Find tells me that “The remote server returned an error: (413) Request Entity Too Large. Request entity too large.” but I have no file large files.
When the schedule job runs, it sends content in batches, and it seems like a batch cannot be larger than 50MB(?). You can set the size of a batch by

ContentIndexer.Instance.MediaBatchSize = somesize; // Not sure if this is in use anymore though
ContentIndexer.Instance.ContentBatchSize = somesize;

But if you have a decent amount of content, this will fail due to many requests.