Getting the words

Here we will look at what we need to do to break up and filter our words in preparation for scoring

The first thing we will need is some text to play with. For this, I am going to use the text from www.lipsum.com which explains what lorem ipsum is. I will be storing it in a template literal so that I can just paste it in and not have to worry about linebreaks or anything like that.

const text = `

What is Lorem Ipsum?
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Why do we use it?
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).

Where does it come from?
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.

The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.

Where can I get some?
There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.

`;

Our goal is to take this text and extract a 5 sentence excerpt from it which will give us the gist of the content without needing to read the whole thing. To do this, word frequency will play a big part in determining the "value" of a word (for example "lorem" and "ipsum" will likely occur a lot, and are therefore assumed to be important). Unfortunately, stopwords also occur a lot. We don't want to unfairly rank a sentence based on having a lot of words like "if", "the", "and", etc, so we will need to disregard these. You can use this pre-prepared excerpt to help identify stop words.

const stopwords = ['i','me','my','myself','we','our','ours','ourselves','you','your','yours','yourself','yourselves','he','him','his','himself','she','her','hers','herself','it','its','itself','they','them','their','theirs','themselves','what','which','who','whom','this','that','these','those','am','is','are','was','were','be','been','being','have','has','had','having','do','does','did','doing','a','an','the','and','but','if','or','because','as','until','while','of','at','by','for','with','about','against','between','into','through','during','before','after','above','below','to','from','up','down','in','out','on','off','over','under','again','further','then','once','here','there','when','where','why','how','all','any','both','each','few','more','most','other','some','such','no','nor','not','only','own','same','so','than','too','very','s','t','can','will','just','don','should','now'];

For this exercise, I recommend making an account on repl.it and using that, but if you want to run it in node on your own device that would work too.

So to get started, we need to break our text down into sentences and then words and filter out any stopwords. To get the sentences, we can split the text on any common sentence terminators. In the text we are targeting, we are going to assume that a sentence ends with one of ['.', '!', '?'] followed by any amount of whitespace (namely a "space" or "linebreak")

One fairly simple way to do this is with Regular Expressions (Regex). Regular Expressions let us match sequences of text based on a pattern. For the purposes described above, I am going to use this pattern.

/(.+?[.?!][\s\r])/mg

If you are not familiar with Regex, this might look confusing, but it should work for now. For a primer on Regular expressions, check out this link. Regexr is also a great resource for playing around and experimenting with different patterns. I use it every time I work with Regex. Check it out here.

To use our pattern to extract the sentences, we can use the String function match with a provided pattern to get an array of all matching segments. This will look something like this:

const sentences = text.match(/(.+?[.?!][\s\r])/mg);

We can use this array of words to calculate our "word stems", which will be an index of how often each word appeared in the overall text (which we will use to score how "important" a given sentence is"). Before we can calculate this though, we need to split up the sentences, remove stopwords, and clean up any punctuation. Here is how I did it.

const stems = {}; // make an object to store the stems

sentences.forEach((sentence) => {
    const words = sentence.split(' ');
    words
    .map((word) => {
        const cleanedWord = word.match(/([a-zA-Z]+)/);
        return cleanedWord && cleanedWord[0]; // remove any punctuation
    })
    .filter((word) => {
        return word && !stopwords.includes(word); // make sure it isn't a stopword or null
    })
    .forEach((word) => {
        if (stems[word]) {
            stems[word] += 1; // if there is already a stem, increment
        } else {
            stems[word] = 1; // if it is a new word, record it
        }
    });
    
});

After this code runs, stems should now be an object keyed by each given word, where the values are the number of times each word occurred in the text.

Next up we will look at how to use this object to assign scores to our sentences.

Last updated