Extracting Keywords from Text

There are different approaches to extract the most important keywords from a chunk of text, but one common method is to use natural language processing techniques such as tokenization, part-of-speech tagging, and frequency analysis. Here's an example of how you can use the Natural Language Toolkit (nltk) library in JavaScript to extract the most important keywords from a text:

import natural from 'natural';
import { removeStopwords, eng } from 'stopword';
 
const DEFAULT_COUNT = 5;
const DEFAULT_MIN_COUNT = 3;
const DEFAULT_MIN_LENGTH = 4;
 
const tokenizer = new natural.WordTokenizer();
const importantTags = new Set(['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']);
const language = 'EN';
const defaultCategory = 'N';
const defaultCategoryCapitalized = 'NNP';
 
const lexicon = new natural.Lexicon(language, defaultCategory, defaultCategoryCapitalized);
const ruleSet = new natural.RuleSet('EN');
const tagger = new natural.BrillPOSTagger(lexicon, ruleSet);
 
const toObjectList = (frequencyCounts) => {
  return Object.keys(frequencyCounts).map((key) => {
    return {
      word: key,
      count: frequencyCounts[key]
    };
  });
};
 
/**
 * Analyze Text for the Top Keywords using NLTK/Natural.
 * @param {*} param0
 * @returns
 */
export const analyzeTopKeywords = ({
  text,
  count = DEFAULT_COUNT,
  minCount = DEFAULT_MIN_COUNT,
  minLength = DEFAULT_MIN_LENGTH
}) => {
  const tokens = tokenizer.tokenize(text);
  const filteredTokens = removeStopwords(tokens, eng);
  const taggedWords = tagger.tag(filteredTokens).taggedWords;
  const importantTokens = taggedWords.filter((token) => importantTags.has(token.tag)).map((token) => token.token);
  // Count the frequency of each remaining word and sort them by frequency
  const frequencyCounts = importantTokens.reduce((counts, token) => {
    const key = token.toLowerCase();
    counts[key] = (counts[key] || 0) + 1;
    return counts;
  }, {});
 
  // Sort descending by count, return tokens that match
  // a minimum occurrence count, and minimum word length.
  const sortedKeywords = toObjectList(frequencyCounts)
    .sort((a, b) => b.count - a.count)
    .filter((item) => item.count >= minCount && item.word.length >= minLength);
 
  const topKeywords = sortedKeywords.slice(0, count);
  return topKeywords;
};

This code uses the natural library to tokenize the text into individual words, remove common stop words using the stopword library, perform part-of-speech tagging, filter out non-important parts of speech, count the frequency of each remaining word, and sort them by frequency. Finally, it outputs the top 5 most frequent keywords.

Note that this method is not perfect and may not always extract the most relevant keywords, but it provides a good starting point for keyword extraction. You may need to experiment with different techniques and adjust the parameters to get the best results for your specific use case.

Extracting Keywords from Text

Related Articles

Natural NPM Package

Natural NPM Package - Decision Tree Classifier

Natural NPM Package - Named Entity Recognition (NER)

Natural NPM Package - NLP Classification

Tags