Kyo Suayan | suayan.com | Extracting Keywords from Text

Extracting Keywords from Text

Saturday, March 18th 2023

There are different approaches to extract the most important keywords from a chunk of text, but one common method is to use natural language processing techniques such as tokenization, part-of-speech tagging, and frequency analysis. Here's an example of how you can use the Natural Language Toolkit (nltk) library in JavaScript to extract the most important keywords from a text:

import natural from 'natural';
import { removeStopwords, eng } from 'stopword';

const DEFAULT_COUNT = 5;
const DEFAULT_MIN_COUNT = 3;
const DEFAULT_MIN_LENGTH = 4;

const tokenizer = new natural.WordTokenizer();
const importantTags = new Set(['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']);
const language = 'EN';
const defaultCategory = 'N';
const defaultCategoryCapitalized = 'NNP';

const lexicon = new natural.Lexicon(language, defaultCategory, defaultCategoryCapitalized);
const ruleSet = new natural.RuleSet('EN');
const tagger = new natural.BrillPOSTagger(lexicon, ruleSet);

const toObjectList = (frequencyCounts) => {
  return Object.keys(frequencyCounts).map((key) => {
    return {
      word: key,
      count: frequencyCounts[key]
    };
  });
};

/**
 * Analyze Text for the Top Keywords using NLTK/Natural.
 * @param {*} param0
 * @returns
 */
export const analyzeTopKeywords = ({
  text,
  count = DEFAULT_COUNT,
  minCount = DEFAULT_MIN_COUNT,
  minLength = DEFAULT_MIN_LENGTH
}) => {
  const tokens = tokenizer.tokenize(text);
  const filteredTokens = removeStopwords(tokens, eng);
  const taggedWords = tagger.tag(filteredTokens).taggedWords;
  const importantTokens = taggedWords.filter((token) => importantTags.has(token.tag)).map((token) => token.token);
  // Count the frequency of each remaining word and sort them by frequency
  const frequencyCounts = importantTokens.reduce((counts, token) => {
    const key = token.toLowerCase();
    counts[key] = (counts[key] || 0) + 1;
    return counts;
  }, {});

  // Sort descending by count, return tokens that match
  // a minimum occurrence count, and minimum word length.
  const sortedKeywords = toObjectList(frequencyCounts)
    .sort((a, b) => b.count - a.count)
    .filter((item) => item.count >= minCount && item.word.length >= minLength);

  const topKeywords = sortedKeywords.slice(0, count);
  return topKeywords;
};

This code uses the natural library to tokenize the text into individual words, remove common stop words using the stopword library, perform part-of-speech tagging, filter out non-important parts of speech, count the frequency of each remaining word, and sort them by frequency. Finally, it outputs the top 5 most frequent keywords.

Note that this method is not perfect and may not always extract the most relevant keywords, but it provides a good starting point for keyword extraction. You may need to experiment with different techniques and adjust the parameters to get the best results for your specific use case.

tags: