JavaScript Text Summarizer

Question

I wrote this both to help read (and edit/summarize) my own work, but mainly to be able to get through the mounds of papers I was asked to read as a graduate student.

This is a summarization algorithm based on what I learned in first or second grade. I tried to simply get the "main ideas" of each paragraph by taking the first sentence -- now, I have updated the program to allow for first two sentences. I have also modified the program because I didn't want certain words to be seen as sentences like Theorem/Definition/Corollary etc. I am not sure how well it parses text copied from a PDF compared to for instance flat text.

Now that I'm actually sharing this, I'm thinking about things such as comments in Latex that I would want the summarizer to ignore, as well as mathtype between \[ and \].

<pre>
    <!DOCTYPE html>
    <html>
    <head>
    <title>Simple Text Summarizer with PDF Support</title>
    <meta charset="UTF-8">
    <meta name="description" content="A simple web-based summarizer that extracts the first sentence from each paragraph. Includes special handling for text copied from PDF files.">
    <meta name="keywords" content="text summarizer, PDF summarizer, paragraph summarizer, first sentence summary, JavaScript tool, text analysis">
    <meta name="author" content="Your Name">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    </head>
    <body>
    <p id="intro">
    This is a naive way of summarizing large blocks of text. It is based on the elementary school idea that the main idea of a text can be found in the first sentence of the first paragraph. So given a long text, it will output the first sentence of each paragraph as the summary.
    </p>
    <textarea name="indata" id="indata" rows="20" cols="50"></textarea><br>
    <label>
    <input type="checkbox" id="isPDF"> PDF-style input (lines joined into paragraphs)
    </label><br>
    <label for="sentenceCount">Sentences per paragraph:</label>
    <select id="sentenceCount">
    <option value="1">1</option>
    <option value="2" selected>2</option>
    </select><br>
    <input type="button" onclick="summarize(document.getElementById('indata').value)" value="Summarize"><br><br>
    <p name="outdata" id="outdata"></p>
    </body>
    </html>
</pre>

    function summarize(text) {
      const isPDF = document.getElementById("isPDF").checked;
      const sentenceCount = parseInt(document.getElementById("sentenceCount").value);
      let paras;

      if (isPDF) {
        const lines = text.split('\n');
        paras = [];
        let paragraph = '';
        for (let i = 0; i < lines.length; i++) {
          if (lines[i].trim() === '') {
            if (paragraph.trim().length > 0) {
              paras.push(paragraph.trim());
              paragraph = '';
            }
          } else {
            paragraph += lines[i].trim() + ' ';
          }
        }
        if (paragraph.trim().length > 0) {
          paras.push(paragraph.trim());
        }
      } else {
        paras = text.split('\n');
      }

      const termRegex = /\b(lemma|theorem|conjecture|proposition|remark)\s*\d*\.?$/i;

      let out = `<b>Summary (${sentenceCount} sentence${sentenceCount > 1 ? "s" : ""} per paragraph)</b><br><br>`;

      for (let i = 0; i < paras.length; i++) {
        if (paras[i].length > 1) {
          let rawSentences = paras[i].split(/(?<=\.)\s+/);
          let sentences = [];

          for (let j = 0; j < rawSentences.length; j++) {
            let current = rawSentences[j].trim();

            if (termRegex.test(current) && j + 1 < rawSentences.length) {
              // Merge with next sentence if it ends in formal term
              current += ' ' + rawSentences[j + 1].trim();
              j++; // Skip the next sentence
            }

            sentences.push(current);
          }

          let snippet = sentences.slice(0, sentenceCount).join(' ');
          out += snippet + "<br><br>";
        }
      }
    
      document.getElementById("outdata").innerHTML = out;
    }

    body {
        font-family: sans-serif;
        margin: 20px;
    }
    textarea {
        width: 80%;
        height: 150px;
        margin-bottom: 10px;
        padding: 10px;
    }
    button {
        padding: 10px 20px;
        background-color: #007bff;
        color: white;
        border: none;
        cursor: pointer;
    }
    .outdata {
        margin-top: 20px;
        border: 1px solid #ccc;
        padding: 10px;
    }

Are you sure you have <html> inside <pre>? That's a similar "mistake" as in your previous question: codereview.stackexchange.com/q/297868/36647 — Thomas Weller
– Thomas Weller, Commented Aug 12 at 18:29
Somehow I think that an LLM would be perfectly able to summarize text without having to go through all the possible caveats. Otherwise you may want to e.g. pre-process the text by something like pandoc, which can take about any format and return plain text (or about anything else). — Maarten Bodewes
– Maarten Bodewes, Commented Aug 12 at 18:35
I've found that LLMs are hit and miss. Sometimes they give a good overview of the article, sometimes they entirely miss the point. But this was written around 2007-ish, so pre LLMs. It serves a dual purpose now as when I ask an LLM for a bulleted list (summary) of each paragraph, I don't always get that because the text is too long. Here, there is no max text length other than your own browser. — Charles G
– Charles G, Commented Aug 12 at 18:42
@CharlesG Oh, OK. I've got a plus subscription to ChatGPT and nowadays I find often that it is too longwinding rather than not extensive enough. But maybe that's because I've got more tokens or something. It would be easy though to create a GPT that simply gets 1 line if that summarizes the paragraph, 2 lines where required to create a better summary and 3 if that's really, really needed for the other two to make sense, for instance. Trick with AI is that you need to create a very good prompt to get OK(-ish) results. But that's not on topic I guess :P — Maarten Bodewes
– Maarten Bodewes, Commented Aug 12 at 19:46

Maarten Bodewes · Accepted Answer · 2025-08-12 20:43:33Z

Best tool for the job

First of all: use the best tool for the job. Running this in a browser may not be the best idea if the files can grow pretty large. It is much better to stream the files from the local filesystem. If you want you can still e.g. use Node.JS. From what I've seen from your code, it is an embarrassingly linear process, so streaming should be fine.

Then you can just read in line by line (with a different line extractor for each file type, if you haven't "normalized" it to plaintext), check if the sentences suit your requirements, and then add it to your summary file.

I'll however assume you've already got the text loaded.

The copy to memory issue

Lines

const lines = text.split('\n');

This makes me somewhat angry, as this will copy the entire file into memory. It is already bad enough that text contains everything, but there is definitely no need to copy the entire text to go from line to line.

For instance you could do:

function* linesFromText(text) {
  const newlinePattern = /([^\r\n]*)(?:\r?\n|$)/gy;
  let match;
  while ((match = newlinePattern.exec(text)) 
         && (match[0].length || newlinePattern.lastIndex < text.length)) {
    yield match[1];
  }
}

// usage
for (const line of linesFromText(text)) {
  // process each line
}

However, after that you still put all the text from the PDF in paragraphs, so that won't work.

I'd recommend another way of doing this, defining an interface which you can then implement for each content type:

class ParagraphSentenceRetriever {
  /** @param {string[]} out */ // mutates `out`
  async retrieveSentencesFromNextParagraph(out) {
      throw new Error('abstract');
  }
}

Sentences

Of course, just with the lines, I'd prefer searching rather than splitting through a potentially large file. You first find the location of the line, then start looking for sentences within that line, using indices in the text. Only once you want to create your summary does it make sense to copy.

Smaller remarks

const sentenceCount = parseInt(document.getElementById("sentenceCount").value);

Other developers - including your future self - don't know the purpose of your application. So "SentenceCount" immediately puts me on the wrong foot. The developer has to know that these are the requested number of sentences.

let rawSentences = paras[i].split(/(?<=\.)\s+/);

I'd always prefix a comment for my regular expressions that tell what they do. As a reader you're otherwise only left with how they do it, e.g.

// splits sentences by a dot followed by at least one whitespace character.

Even though the "term" regex is somewhat more intuitive, I'd still comment it as it may get more complex.

const termRegex = /\b(lemma|theorem|conjecture|proposition|remark)\s*\d*\.?$/i;

if (paras[i].length > 1) {

Always avoid getting too many scopes in your functions. Here you could simply have

if (paras[i].length == 0) {
  continue;
}

Of course, if you'd search for lines / paragraphs and sentences this issue would probably not pop up.

let snippet = sentences.slice(0, sentenceCount).join(' ');

Let's not. This creates another copy during slice. In this case a simple loop would be sufficient. If you'd use the method previously indicated you could just build the snippet after searching and finding the lines of course.

let snippet = '';
for (let i = 0; i < sentenceCount && i < sentences.length; i++) {
  if (i > 0) snippet += ' ';
  snippet += sentences[i];
}

const isPDF = document.getElementById("isPDF").checked;

Don't mix UI and functionality. You may want to be able to test your functionality without having to go through the UI.

"const lines = text.split('\n'); will copy the entire file into memory." - uh, no. The text string already is in memory. Splitting it into slices is hardly a problem. Your iterator version won't perform much better if the entire text is processed anyway. — Bergi
– Bergi, Commented Aug 13 at 12:03
@Bergi That's an engine optimization, it's not in the contract of string or the method. Safari seems to have (had?) some issues with substring handling. Copying it into paragraphs later on certainly will create a copy though, as it also changes the layout by inserting spaces etc. But OK, yeah, the engines in major browsers to keep references & indices into the original string. — Maarten Bodewes
– Maarten Bodewes, Commented Aug 13 at 16:04
"For instance you could do:" So you've substituted exec() for split()? No change. You can just use iterator helpers and operate on the original string. — guest271314
– guest271314, Commented Aug 31 at 18:35

Stack Exchange Network

JavaScript Text Summarizer

1 Answer 1

Best tool for the job

The copy to memory issue

Lines

Sentences

Smaller remarks

You must log in to answer this question.

Linked

Hot Network Questions

JavaScript Text Summarizer

1 Answer 1

Best tool for the job

The copy to memory issue

Lines

Sentences

Smaller remarks

You must log in to answer this question.

Linked

Related

Hot Network Questions