I wrote this both to help read (and edit/summarize) my own work, but mainly to be able to get through the mounds of papers I was asked to read as a graduate student.
This is a summarization algorithm based on what I learned in first or second grade. I tried to simply get the "main ideas" of each paragraph by taking the first sentence -- now, I have updated the program to allow for first two sentences. I have also modified the program because I didn't want certain words to be seen as sentences like Theorem/Definition/Corollary etc. I am not sure how well it parses text copied from a PDF compared to for instance flat text.
Now that I'm actually sharing this, I'm thinking about things such as comments in Latex that I would want the summarizer to ignore, as well as mathtype between \[ and \].
<pre>
<!DOCTYPE html>
<html>
<head>
<title>Simple Text Summarizer with PDF Support</title>
<meta charset="UTF-8">
<meta name="description" content="A simple web-based summarizer that extracts the first sentence from each paragraph. Includes special handling for text copied from PDF files.">
<meta name="keywords" content="text summarizer, PDF summarizer, paragraph summarizer, first sentence summary, JavaScript tool, text analysis">
<meta name="author" content="Your Name">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
<p id="intro">
This is a naive way of summarizing large blocks of text. It is based on the elementary school idea that the main idea of a text can be found in the first sentence of the first paragraph. So given a long text, it will output the first sentence of each paragraph as the summary.
</p>
<textarea name="indata" id="indata" rows="20" cols="50"></textarea><br>
<label>
<input type="checkbox" id="isPDF"> PDF-style input (lines joined into paragraphs)
</label><br>
<label for="sentenceCount">Sentences per paragraph:</label>
<select id="sentenceCount">
<option value="1">1</option>
<option value="2" selected>2</option>
</select><br>
<input type="button" onclick="summarize(document.getElementById('indata').value)" value="Summarize"><br><br>
<p name="outdata" id="outdata"></p>
</body>
</html>
</pre>
function summarize(text) {
const isPDF = document.getElementById("isPDF").checked;
const sentenceCount = parseInt(document.getElementById("sentenceCount").value);
let paras;
if (isPDF) {
const lines = text.split('\n');
paras = [];
let paragraph = '';
for (let i = 0; i < lines.length; i++) {
if (lines[i].trim() === '') {
if (paragraph.trim().length > 0) {
paras.push(paragraph.trim());
paragraph = '';
}
} else {
paragraph += lines[i].trim() + ' ';
}
}
if (paragraph.trim().length > 0) {
paras.push(paragraph.trim());
}
} else {
paras = text.split('\n');
}
const termRegex = /\b(lemma|theorem|conjecture|proposition|remark)\s*\d*\.?$/i;
let out = `<b>Summary (${sentenceCount} sentence${sentenceCount > 1 ? "s" : ""} per paragraph)</b><br><br>`;
for (let i = 0; i < paras.length; i++) {
if (paras[i].length > 1) {
let rawSentences = paras[i].split(/(?<=\.)\s+/);
let sentences = [];
for (let j = 0; j < rawSentences.length; j++) {
let current = rawSentences[j].trim();
if (termRegex.test(current) && j + 1 < rawSentences.length) {
// Merge with next sentence if it ends in formal term
current += ' ' + rawSentences[j + 1].trim();
j++; // Skip the next sentence
}
sentences.push(current);
}
let snippet = sentences.slice(0, sentenceCount).join(' ');
out += snippet + "<br><br>";
}
}
document.getElementById("outdata").innerHTML = out;
}
body {
font-family: sans-serif;
margin: 20px;
}
textarea {
width: 80%;
height: 150px;
margin-bottom: 10px;
padding: 10px;
}
button {
padding: 10px 20px;
background-color: #007bff;
color: white;
border: none;
cursor: pointer;
}
.outdata {
margin-top: 20px;
border: 1px solid #ccc;
padding: 10px;
}
<html>inside<pre>? That's a similar "mistake" as in your previous question: codereview.stackexchange.com/q/297868/36647 \$\endgroup\$pandoc, which can take about any format and return plain text (or about anything else). \$\endgroup\$