By Divye JainEngineer
Tue Mar 10 2026

From 60 Seconds to 50ms: The Architecture Trick Behind Instant PDF Search

Engineering
Share
From 60 Seconds to 50ms: The Architecture Trick Behind Instant PDF Search

At finrep.ai, our users live inside long, dense PDFs – annual reports, regulatory filings, investor decks – documents so big people don’t skim them, they study them. For analysts and investors, fast search isn’t a “nice-to-have,” it’s essential. Initially, our in-browser PDF viewer was painfully slow: a single search in a 100-page report would freeze the browser for 15–20 seconds while it rendered every page’s text. This dragged analysts out of their flow and even undermined trust in our product. (In today’s world, “search” is fundamental; users expect results to be instant.)

Business impact: Slow search frustrates users and wastes time. For example, one financial tool’s guide notes that automating PDF search can “reduce hours spent manually searching through documents,” boosting accuracy and speed. By contrast, a 15-second delay per query makes people give up or doubt the tool.

The Naïve Approach and Why It Fails

Most PDF viewers (and tutorials) implement search by rendering each page’s text layer and then reading the text from it. A typical search function might do:

function searchPDF(query) {
  for (let page = 1; page <= totalPages; page++) {
    renderTextLayer(page);           // Draws all text on screen
    const text = extractText(page);  // Reads text from that drawn layer
    if (text.includes(query)) {
      highlightText(page);
    }
  }
}
  • renderTextLayer(page): Builds a visual overlay of every character (positioning each letter with canvas.measureText and DOM elements) so the text can be selected or highlighted.
  • extractText(page): Reads text strings from that rendered layer.
  • Search logic: We then search those strings for the query.

This works on small PDFs, but scales terribly. Rendering each text layer is very expensive – the browser computes coordinates, styles, and DOM nodes for every character. Calling renderTextLayer() on hundreds of pages blocks the UI and can take hundreds of milliseconds per page. For a 200-page 10-K report, a search could literally take minutes. It’s like printing the entire filing just to find one word.

Result: The UI freezes and the browser hangs during each search, a dealbreaker for analysts. This mistake is everywhere because most PDF.js tutorials use it: it “just works” on small files, so nobody notices until a real user uploads a 300+ page filing.

The Key Idea: Separate Search from Display

We realized: You don’t need to draw every page to search its text. PDF.js provides two different operations:

  • page.getTextContent(): Retrieves the raw text content of a page without drawing it. This is a lightweight in-memory operation that returns the page’s words and positions.
  • renderTextLayer(): Takes text content and creates the visual overlay of positioned characters on the screen.

Smart approach: Use getTextContent() to extract all text upfront (in the background), but only call renderTextLayer() for the page currently in view (when the user scrolls to it). In other words, we build an in-memory index of the document’s text, then search that index instantly. We avoid re-rendering pages that the user isn’t looking at.

This is like searching a text file vs. re-printing the file to find something. By indexing the text once, every search is fast, even on huge documents.

Implementation Steps

We updated the PDF viewer as follows.

Extract all text on load (background indexing)

When a PDF is opened, we loop through all pages (in small batches) and call page.getTextContent().

const pageTexts = {};  // pageNum → text content
async function extractAllPagesText(totalPages) {
  for (let i = 1; i <= totalPages; i += 5) {
    const batch = Array.from({length: 5}, (_, k) => i+k).filter(n => n <= totalPages);
    await Promise.all(batch.map(async pageNum => {
      const page = await pdf.getPage(pageNum);
      pageTexts[pageNum] = await page.getTextContent();
    }));
  }
}
extractAllPagesText(pdf.numPages);

This runs quietly in the background (about 1–3 seconds for a 200-page doc). The user can start reading page 1 immediately while pages 2–200 are being indexed.

Search the in-memory index

Now, when the user types a query, we simply scan the stored text.

function searchInDocument(query) {
  if (!query.trim()) return [];
  const lowerQuery = query.toLowerCase();
  const results = [];
  for (let page = 1; page <= totalPages; page++) {
    const textContent = pageTexts[page];
    if (!textContent) continue;  // if still extracting, skip for now
    const pageText = textContent.items.map(item => item.str).join(' ').toLowerCase();
    if (pageText.includes(lowerQuery)) {
      results.push(page);
    }
  }
  return results;  // array of page numbers containing the query
}

This is pure string matching. In practice, searching a 200-page document this way takes under 50ms. No rendering, no waiting.

Render text layers only as needed

When showing pages (or highlighting search hits), we only call renderTextLayer() on the current page (and maybe immediate neighbors).

function renderVisibleTextLayers(currentPage) {
  const pagesToRender = [currentPage-1, currentPage, currentPage+1].filter(n => n >= 1 && n <= totalPages);
  pagesToRender.forEach(pageNum => {
    if (!alreadyRendered.has(pageNum)) {
      renderTextLayer(pageNum);
      alreadyRendered.add(pageNum);
    }
  });
}
renderVisibleTextLayers(currentPage);

Even on a 500-page filing, at most 3 pages’ text are ever rendered at once – the rest remain un-rendered.

Debounce search input

We wait ~300ms after the user stops typing before running the search (common UX pattern). For instance, in React:

useEffect(() => {
  const timer = setTimeout(() => {
    setSearchResults(searchInDocument(searchQuery));
  }, 300);
  return () => clearTimeout(timer);
}, [searchQuery]);

This means rapid typing won’t trigger dozens of searches. It just waits for the user to pause, then searches.

Performance: Night and Day

The improvement is dramatic. In one example:

  • Old way: Search 100-page PDF took ~30–60 seconds (UI frozen).
  • New way: Same search takes ~50 milliseconds (instant).

Even a 400-page document goes from minutes to ~200ms. The only delay is the initial 1–3 second extraction, which happens while the user reads page 1. After that, every search is under 0.2 seconds – essentially real-time.

Result: Users get results before they’ve even finished typing. Searching a 300-page annual report for “gross margin” feels instant. The slow lag is gone.

Why Free Viewers Don’t Do This (And Why Paid Ones Do)

  • Tutorials are naïve: Most PDF.js examples focus on getting it working, not speed. They often show “render then extract” code. On small test files, it works, so it ships that way. Only real usage finds the problem.
  • PDF.js is low-level: The library gives you both getTextContent() and renderTextLayer(), but it doesn’t say “use text content for search”. Developers have to know the distinction; many don’t.
  • Paid SDKs solve it: PDF tools like PSPDFKit (now Nutrient), Apryse, and Foxit preprocess documents (often with full-text indexes) so search is millisecond-fast. But those SDKs cost thousands per year. (For example, Nutrient’s starter plans run in the ~$2,500/year range, with enterprise deals much higher.) If you only need fast text search, you don’t need to buy an entire suite.
  • Scanned PDFs: This approach only works with text PDFs. If a PDF is a scan (just images), there’s no text to get – you’d need OCR. Browser-based OCR is slow and imprecise. This is why some paid solutions justify their cost: they handle OCR and complex layouts that free tools skip.

The Business Takeaway

Fast search is not just a technical improvement – it directly impacts users and business. With the new approach:

  • User satisfaction: Analysts find info immediately, staying in flow. They trust the tool.
  • Product credibility: It feels like an enterprise-grade application, not a slow prototype.
  • Time saved: Automating search saves hours of manual scanning. In contrast, if the tool is slow, users give up or switch away, hurting engagement and retention.

Conclusion: Search Should Be Invisible

Separate the data layer from the display layer. Extract all text once (data layer) and search it. Render only what you need (display layer). Never block the UI with unnecessary rendering.

At finrep.ai, people spend serious time in these documents. Search must be invisible – it should “just work” under the hood. By splitting extraction from rendering, we achieved instant search without any extra services or huge costs. The difference between a slow PDF viewer and a fast one often comes down to understanding this detail.

With this solution, our users get the information they need before they even realize they searched.

Remember: for any PDF viewer, perform text extraction upfront and do pure string search on that data. Then render highlights only on demand. It’s a small change in code that delivers huge benefits – keeping users engaged, saving them hours, and making your product feel fast and reliable.

Transform Your SEC Reporting Now