1

I have code where I'm trying to split PDFs into a list of jpg MemoryStream files. I the split portion working, where it takes around under a second or less and creates 100 pdf streams. However once I get to the the point where I'm turning the PDFs into images performance drags down to a snails pace. ImageMagick uses GhostScript to perform this action. My theory is that each time it calls out to GhostScript it has to set it up to be called it produces overhead. I'm wondering if there is a way to make batch calls. The way I understand it MagickImageCollection can only take one page at a time, which is why I do it in a separate method.

I'm open to using a different tool to split the images or convert them. I'm looking into BlackIce but I'm waiting to hear about our license.

namespace PDFTools;

using ImageMagick;
using iText.Kernel.Pdf;
using iText.Layout;

public class PDFUtilities(string temporaryDirectory)
{
    private readonly string TemporaryDirectory = temporaryDirectory;

    public async Task<List<byte[]>> ConvertPdfToImageAsync(Stream stream)
    {
        List<byte[]> results = new List<byte[]>();
        MagickNET.SetTempDirectory(this.TemporaryDirectory);
        List<MemoryStream> pdfPages = this.SplitPdf(stream);
        List<MemoryStream> output = new MemoryStream();

        var tasks = pdfPages.Select((pdfPage, index) => new OrderedTask
        {
            Index = index,
            Task = this.ConvertPageToImageStreamAsync(pdfPage)
        }).ToList();

        _ = await Task.WhenAll(tasks.Select(static t => t.Task));
        OrderedTask[] orderedTask = tasks
            .OrderBy(static s => s.Index)
            .ToArray();

        foreach (OrderedTask task in orderedTask)
        {
            MemoryStream ms = await task.Task;
            byte[] bytes = ms.ToArray();
            results.Add(bytes);
        }
        return results;
    }


    private async Task<MemoryStream> ConvertPageToImageStreamAsync(MemoryStream file)
    {
        MemoryStream outputStream = new MemoryStream();
        MagickImageCollection images = new MagickImageCollection();
        await images.ReadAsync(file);   // Only accepts one image at a time, when I tried multiple PDFS it only gets the last image.

        foreach (MagickImage image in images)
        {
            image.Quality = 100;
            await image.WriteAsync(outputStream, MagickFormat.Jpg);
        }

        outputStream.Position = 0;
        file.Close();
        return outputStream;
    }

    private List<MemoryStream> SplitPdf(Stream stream)
    {
        List<MemoryStream> pdfPages = new List<MemoryStream>();

        using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(stream)))
        {
            for (int pageNumber = 1; pageNumber <= pdfDocument.GetNumberOfPages(); pageNumber++)
            {
                using (MemoryStream tempStream = new MemoryStream())
                {
                    using (PdfWriter writer = new PdfWriter(tempStream))
                    {
                        using (PdfDocument newPdf = new PdfDocument(writer))
                        {
                            _ = pdfDocument.CopyPagesTo(pageNumber, pageNumber, newPdf);
                        }
                    }

                    MemoryStream outputStream = new MemoryStream(tempStream.ToArray());
                    pdfPages.Add(outputStream);
                }
            }
        }

        return pdfPages;
    }
}

internal class OrderedTask
{
    required public int Index { get; set; }

    required public Task<MemoryStream> Task { get; set; }
}
3
  • 1
    Why don't you use PyMuPDF which is a very fast PDF to image conversion tool. Commented Mar 28 at 4:26
  • ImageMagick generates pretty low quality PDF images, you should try GhostScript.NET Commented Mar 28 at 11:49
  • 1
    @KJ It may be that it's just "slow" two seconds, the requirements are non existent so I'm trying to convert a 100mb PDF to a png, so I split it and it's 5MB each and takes around 1 second. It looks like everyone posting here only knows about stuff I've found so maybe there is no good answer. I think maybe it's just slow by nature. Commented Mar 28 at 14:17

1 Answer 1

-1

I made a nuget package that uses Pdfium to do exactly what you want, well except it returns a bitmap list, but trivial to use Save() to save as jpegs. I use this in production software to convert Pdfs into images for viewing within a WinUI3 application. 100-page document at 300 dpi would probably take around 7 or 8 seconds or so. I am not sure what is an acceptable speed for you.

Nuget package is PdfToBitmapList.

Can pass file path of pdf, memorystream, or byte array.

Use like:

var imageList = Pdf2Bmp.Split(pdfFilePath)

There is an option to save images to disk by passing in a save location like :

var imageList = Pdf2Bmp.Split(pdfFilePath, "C:\\TempDirectory")

This is very useful for larger Pdfs as holding all those images in memory can eat up your ram very fast. Instead of returning a List of Bitmap images it will return a list of strings with the paths the images are saved on disk. This requires manual clean up after the conversion, so just keep that in mind or save location will have tons of leftover images.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.