Back to Insights
pdf engineering

How to Compress Scanned PDF Documents: OCR and Resolution Tuning

2026-06-05
25 min read
Engineering Digest

Discover the engineering principles behind compressing scanned PDFs and reducing the file size of OCR documents. Learn how resolution tuning, image downsampling, and structural cleanups execute locally in WebAssembly to preserve data privacy.

Scanned PDFs are essentially image wrappers. OCR overlays an invisible search layer, which can dramatically bloat the file format structure.
Downsampling scanner resolution from 600 DPI to 150 DPI using Lanczos-3 interpolation reduces file sizes by up to 90% while maintaining textual clarity.
Color space reduction to Grayscale (8-bit) or Monochrome (1-bit) strips redundant color channels, reducing the memory size of image resources.
MojoDocs uses client-side WebAssembly to execute the entire optimization pipeline within browser RAM, maintaining 100% data sovereignty.
Content Roadmap
Scanned documents represent a massive portion of the digital files processed daily by businesses, governmental departments, legal firms, and individual citizens. Whether you are archiving legal records, managing digital receipts, or uploading identification proofs to online portals, the digitization process often leaves you with bloated, unwieldy files. Physical pages are scanned as raster images, wrapped in PDF structures, and then often run through Optical Character Recognition (OCR) systems to render them searchable. This process creates a hybrid document structure containing high-resolution images and hidden vector text layers that can dramatically expand the resource footprint. This comprehensive technical guide walks you through the engineering principles behind how to compress scanned pdf files, reduce size of ocr pdf structures, and optimize pdf scanner workflows. At MojoDocs, we build local-first tools designed to give users control over their data. By using WebAssembly (WASM), we compile industrial-grade PDF compression and image downsampling libraries directly for the browser runtime. When you use the MojoDocs PDF Compressor, your files are processed entirely inside your local device's random-access memory (RAM) sandbox. The document never leaves your machine. This guide explores the details of how scanned PDFs are structured, how OCR adds layers of metadata, how resolution downsampling works mathematically, and the security, privacy, and economic benefits of local-first file processing.

1. The Architecture of Scanned & OCR PDFs

To optimize a document, it is helpful to understand its internal structure. A scanned PDF differs fundamentally from a native PDF created by a word processor like Microsoft Word, Google Docs, or Google Slides. A native PDF contains direct vector instructions that draw fonts, lines, and shapes using exact coordinates. For example, rendering the word "Agreement" in Helvetica only requires a few bytes of text coordinates and a reference to a font resource. In contrast, a scanned PDF is essentially an image wrapper. When a physical document is fed into a flatbed scanner or photographed with a mobile camera app, the hardware captures a grid of pixels. The scanning software then wraps this high-resolution raster image inside a PDF object container. Inside the PDF file structure, which is defined by the ISO 32000-1 specification, this image is stored as an independent dictionary object called an External Object, or /XObject, with a subtype of /Image. Below is a representation of an uncompressed image object inside a PDF:
5 0 obj
<<
  /Type /XObject
  /Subtype /Image
  /Width 2550
  /Height 3300
  /ColorSpace /DeviceRGB
  /BitsPerComponent 8
  /Filter /FlateDecode
  /Length 2524500
>>
stream
... [Binary Image Data] ...
endstream
endobj
In this example, the image has a width of 2,550 pixels and a height of 3,300 pixels, which matches a standard Letter-sized page scanned at 300 DPI. The /ColorSpace is set to /DeviceRGB, representing full red-green-blue color. The /BitsPerComponent is set to 8, meaning each color channel uses 8 bits, resulting in 24 bits of color data per pixel. The image is compressed using the /FlateDecode filter (a lossless ZIP-equivalent algorithm). Because the physical page contains dust, paper texture, and scanning noise, lossless algorithms cannot find repeating patterns easily, leaving the stream block massive. An OCR PDF goes a step further. It takes this image wrapper and adds a transparent text layer on top. The OCR engine analyzes the image, identifies text characters, and inserts matching invisible text blocks directly over the corresponding pixels. This enables search and selection features, but it also inserts coordinate transformations, font descriptions, and text drawing instructions into the page's /Contents stream, adding metadata overhead. Furthermore, many OCR packages embed complex font subsets inside the PDF. TrueType or OpenType fonts, which contain the vector outlines for each glyph, can be highly complex. Even if the font is subsetted (meaning only the letters used in the document are kept), the descriptor dictionary and the stream of the font program can add significant overhead. When you process dozens of pages, these individual font files and coordinate dictionaries combine to swell the total file size. Standard optimization requires cleaning these redundant blocks, removing duplicate font resources, and ensuring the layout structures are stripped of unnecessary properties.

2. The Causes of File Bloat in Scanned Documents

Why do scanned and OCR PDFs grow so large? Several key factors contribute to document bloat: 1. Resolution Overload: The resolution of a scan is measured in Dots Per Inch (DPI). A higher DPI captures finer details but increases the pixel count exponentially. A standard page scanned at 150 DPI contains about 2.2 million pixels. At 300 DPI, it contains 8.9 million pixels. At 600 DPI, it contains 35.7 million pixels. Scanning at 600 DPI creates 16 times as many pixels as scanning at 150 DPI, which is often unnecessary for general utility documents. 2. Excess Color Depth: Scanners often default to 24-bit RGB color mode, even when capturing black-and-white text. Capturing color data for off-white backgrounds, ink bleeds, and scanner noise adds significant overhead. If a document only contains black text, saving color channels is inefficient. 3. Scanning Artifacts and Noise: Physical paper is rarely uniform. Creases, dust, highlights, and ink bleeds are captured by high-resolution sensors as image noise. Lossless compression algorithms struggle with this high-frequency noise, leading to larger file sizes. 4. Unoptimized OCR Structures: OCR engines often write coordinate positions with high decimal precision (e.g., 54.12345 680.67890 Td). This level of precision is unnecessary for document viewing. Additionally, the OCR engine may embed a full TrueType font file inside the PDF to display the invisible text, adding a significant amount of overhead. 5. Incremental Update Overhead: When a PDF is edited or signed, many editing tools do not rewrite the entire file. Instead, they append changes to the end of the file, updating the cross-reference table to point to the new objects while leaving the old, deleted versions of the objects inside the file. Over time, these incremental additions accumulate, leading to significant file bloat. A proper compressor must rewrite the entire object tree, purging unused historical data.

3. Resolution Tuning and Resampling: The Mathematical Core

To reduce size of ocr pdf documents, MojoDocs downsamples the high-resolution images within the /XObject streams using advanced mathematical algorithms. Downsampling reduces the width and height of the image grid, but maintaining text readability requires high-quality resampling. Standard downsampling methods like bilinear or nearest-neighbor interpolation are simple, but they tend to blur fine text or introduce jagged artifacts. Bilinear interpolation averages nearby pixels, softening the contrast between black ink and white paper, which makes small text hard to read. To prevent this, MojoDocs uses Lanczos-3 resampling. The Lanczos filter uses a multi-lobed sinc function as a kernel to interpolate pixel values. The mathematical formula for the Lanczos kernel is:
L(x) = sinc(x) * sinc(x / a)  for -a < x < a
     = 0                      otherwise
For Lanczos-3, the parameter a is set to 3. This three-lobed kernel evaluates 36 surrounding pixels to calculate the final value of each downsampled pixel. This preserves sharp contrast transitions along text boundaries, keeping characters legible even when resolution is reduced. We also apply color space conversion to reduce data footprint. Converting a 24-bit RGB scan to grayscale removes the color channels while preserving luminance. This reduces the raw image data size by 66% before compression. The conversion calculates luminance using the ITU-R BT.601 standard: Y = 0.299R + 0.587G + 0.114B. For documents that do not require color, this conversion dramatically reduces file size. For color and grayscale scans, MojoDocs uses JPEG compression (the /DCTDecode filter) with a quality factor of 75. This quality level balances size reduction and visual clarity, achieving a 10:1 compression ratio on image streams while keeping text sharp and legible for human readers and OCR engines. Additionally, we can apply monochrome binarization for purely textual documents. This method converts pixels into pure black or pure white (1 bit per pixel), reducing the raw image data size by 95% compared to RGB. MojoDocs uses an adaptive thresholding algorithm that calculates the local mean and contrast of the surrounding pixels, ensuring that thin characters are not lost and dark backgrounds are flattened.

4. Optimizing the OCR Text Layer

In addition to image downsampling, optimizing the OCR text layer is necessary to compress scanned pdf documents. A poorly structured OCR layer can add significant overhead. MojoDocs optimizes the OCR layer by rounding text coordinate values to two decimal places (e.g., 72.35 750.12 Td instead of 72.34567 750.12345 Td). This reduces the size of the text stream by up to 15% without affecting search accuracy. We also replace fully embedded font files with subsetted versions or standard PDF Core Fonts (like Helvetica or Times-Roman) with basic font descriptors. Since the search layer is invisible, it does not need a complex embedded font to render properly; a simple reference to a standard system font is sufficient, which reduces file overhead. Finally, MojoDocs consolidates fragmented text streams. OCR engines often write separate content streams for each line or paragraph. MojoDocs merges these into a single, continuous stream per page, reducing object dictionary overhead and optimizing the file structure. We also look at the structure of the /ToUnicode mapping tables. These tables are used by PDF readers to map glyph indices back to standard Unicode character codes, enabling copy-paste functionality. Many OCR generators produce redundant map files with loose formatting. MojoDocs reorganizes these maps into consolidated arrays, reducing the byte footprint of the dictionary mapping streams by up to 40% while preserving text copy-paste accuracy.

5. Security & Privacy: The Danger of Cloud PDF Reducers

Using online cloud tools to compress scanned pdf documents introduces significant security risks, especially when handling personal identity documents. In India, identity verification commonly requires scanned copies of documents like Aadhaar cards (issued by UIDAI), PAN cards (issued by NSDL), Passports (issued by MEA), and Driving Licenses or Registration Certificates (issued by Parivahan/MoRTH). These files contain sensitive personal information, including names, dates of birth, addresses, signature files, and biometric hashes. When you upload a PDF containing a scanned Aadhaar or PAN card to an online compressor, the file is processed and stored on a remote server. While many platforms state that they delete files within 24 hours, users have no way to verify this. If the cloud provider experiences a security breach, your personal documents could be exposed, putting you at risk of identity theft and financial fraud. In India, many citizens also use local Xerox shops, cyber cafes, or print services like Blinkit, Zepto, or Swiggy Instamart. In these environments, files are often sent via WhatsApp Web or email to a shared computer, downloaded to a public desktop, and compressed using online tools. Once the task is complete, the files often remain on the computer, exposing subsequent customers to your private information. MojoDocs avoids this risk by processing files locally within the browser, leaving no files behind on the computer. Additionally, the Digital Personal Data Protection (DPDP) Act of 2023 sets strict standards for how businesses handle citizens' personal data. Under this law, companies are responsible for securing the data they process. Sharing client documents—such as tax statements, land registry papers, or identity scans—with unauthorized third-party cloud services can expose businesses to significant legal liability and financial penalties. Using a local-first tool like MojoDocs ensures compliance with the DPDP Act by keeping all data processing within the local environment. By processing files entirely on-device, you remove the risk of third-party intercept. Even if your internet connection is compromised or you are using a public Wi-Fi network, your document remains encrypted and sandboxed within your local browser memory space.

6. Local-First WebAssembly Architecture: MojoDocs' Solution

MojoDocs solves the privacy and performance issues of document compression by processing files locally using WebAssembly (WASM). This architecture allows you to run high-performance C++ and Rust processing engines directly inside your web browser. When you use MojoDocs, the application downloads the WASM engine once and runs it inside a secure sandbox in your browser. The file processing happens entirely within your system's RAM. Because the files are never uploaded to a server, you can compress files safely and privately.

The Flight Mode Verification

1. Open MojoDocs. 2. Turn off WiFi/Internet. 3. Process the file. 4. It completes instantly without any data leaving your device.

This verification demonstrates that MojoDocs runs offline. You can load the tool, disconnect your device from the internet, and compress a PDF. The task will complete successfully, proving that no network transmission was needed to process the file. For a more detailed audit, you can monitor network activity using your browser's built-in developer tools: 1. Navigate to the MojoDocs PDF Compressor. 2. Open the Developer Tools by pressing F12 (or Cmd + Option + I on macOS) and select the Network tab. 3. Select or drag a PDF into the browser window. 4. Click Compress PDF and watch the Network tab. You will see that no files are uploaded and no external API requests are made. The compression runs entirely on your device's CPU. This browser-side design also prevents UI freezes. We isolate the WebAssembly engine within a dedicated Web Worker thread. This thread handles the binary decompression, downsampling, and packaging, while the browser's main thread remains free to render progress updates and respond to user inputs.

7. The Economics of Document Compression

Shifting document processing to the client side also offers significant economic benefits. Let's compare the costs of MojoDocs against subscription-based cloud services and physical copy centers in India.
Method Cost Privacy
Adobe Acrobat Pro License ~₹1,593 per month (~₹19,116 per year) High (Processes files locally, but requires subscription)
Cloud SaaS Compressors ~₹450 to ₹750 per month (~₹5,400 to ₹9,000 per year) Low (Requires file uploads to third-party servers)
Local Cyber Cafe / Copy Center ₹10 to ₹20 per page scan/processing fee Low (Files left on shared public desktops)
MojoDocs PDF Compressor ₹0 (Free, unlimited files) Maximum (100% local WebAssembly processing)
For a small legal firm, independent accounting practice, or consultancy team with 10 members, switching to MojoDocs for routine compression and document merging can save nearly ₹1,90,000 per year in licensing fees. For students, job seekers, and applicants who only need to prepare files for portals like UPSC, JEE, or NEET, avoiding expensive subscription fees provides immediate financial relief. Additionally, client-side tools save bandwidth. Uploading a 60MB file to a cloud service on a mobile connection takes time and consumes data. With MojoDocs, the file is processed locally, saving mobile data and completing the task in seconds instead of minutes. By eliminating server hosts, we don't pay for cloud computation or expensive database storage for your documents. This architectural optimization keeps MojoDocs free and accessible to everyone.

8. Step-by-Step Guide to Optimizing Scanned PDFs

To get the best results when you compress scanned pdf files, follow these steps to optimize your scanner settings and compress the final document. Step 1: Set Scanner Settings Before scanning a document, adjust your scanner settings to balance quality and file size: * Resolution: Choose 150 DPI for normal documents or 200 DPI for text with small fonts. Avoid 300 or 600 DPI unless you need high-quality print outputs. * Color Mode: Use Grayscale for standard documents or Monochrome for pure black-and-white text. Use full color only if the document contains images or color branding. * Contrast: Increase the contrast slightly to sharpen text edges and clean up the page background. Step 2: Run OCR Processing Run your OCR engine to make the text searchable. Ensure that the OCR software is set to embed the text as an "Invisible Text Layer" or "Searchable Image" under the raster image. This setup maintains the original scanned appearance while enabling text search. Step 3: Compress with MojoDocs Open the MojoDocs PDF Compressor. Drag and drop your file into the work area. Choose your compression level: "Medium" downsamples images to 150 DPI and applies grayscale conversion, while "High" applies 1-bit monochrome binarization for maximum file savings. Click Compress PDF, and download the finished file in seconds. Step 4: Audit and Verify Confirm the final file size and layout quality. Zoom in on small characters to ensure they remain clear and legible. Open the browser's developer tools network tab during compression to verify that no network requests were sent, ensuring your document stayed secure.

9. Advanced Document Tuning Techniques

To further optimize your scanned documents, consider implementing advanced tuning techniques. First, clean up the digital image pages before running compression. If your scan contains dark borders or shadow artifacts near the binding edge (common when scanning books or folded pages), use a cropping tool to remove these areas. These dark regions consist of complex noise patterns that are difficult to compress. Removing them can reduce file sizes significantly. Second, consider standardizing your document page dimensions. Scanned pages sometimes have slightly different heights or widths due to scanner feeding variances. Standardizing the page sizes using a layout tool simplifies the PDF geometry structure, making the file easier to compress and parse. Third, verify your image interpolation settings. If you downsample a grayscale image, using a mild unsharp mask filter after downsampling can enhance contrast along text boundaries, ensuring readability remains high even at lower resolutions. Finally, understand the target requirements of the portals you are uploading to. For example, many Indian government portals require files to be under 100KB, 200KB, or 500KB. Knowing these targets allows you to select the appropriate compression level in MojoDocs. If your file is still too large, try converting it to grayscale or monochrome to meet the limit. By matching the processing options to your target portal requirements, you can achieve the best compression ratio while maintaining the legibility of your documents.

10. Visual Artifacts and Quantization Filtering

Let's analyze the visual artifacts that can arise from aggressive document compression. When you compress scanned pdf pages using lossy filters, compression artifacts can appear around high-contrast edges. In JPEG compression, these are called DCT ringing artifacts. They manifest as small halos or pixel speckles around the letters. To minimize these artifacts, the MojoDocs compressor uses an adaptive quantization matrix. When it detects that a block of pixels contains text structures, it applies a lower quantization step to the high-frequency coefficients, preserving character edges while applying higher compression to uniform white or gray background areas. This selective compression keeps the document legible even at high compression ratios. Another consideration is font embedding. Many OCR engines embed fonts without subsetting them. Subsetting is the process of generating a new font file containing only the characters used in the document. For example, if a document only uses the characters 'a', 'b', and 'c', the subsetted font will only contain glyph data for those three letters. MojoDocs reads the document's content streams, builds a list of used character codes, and removes unused glyphs from the embedded font programs. This font optimization step can reduce the size of the embedded font objects from several megabytes to just a few kilobytes.
compress scanned pdf reduce size of ocr pdf optimize pdf scanner pdf compression ocr pdf webassembly privacy data sovereignty
Share article
WebAssembly
Client-Side Engine
Zero Latency
Processing Speed
0.00 KB
Data Retention
AES-256
Security Standard