Extracting Keywords from Text

Saturday, March 18th 2023

There are different approaches to extract the most important keywords from a chunk of text, but one common method is to use natural language processing techniques such as tokenization, part-of-speech tagging, and frequency analysis. Here's an example of how you can use the Natural Language Toolkit (nltk) library in JavaScript to extract the most important keywords from a text:

import natural from 'natural'; import { removeStopwords, eng } from 'stopword'; const DEFAULT_COUNT = 5; const DEFAULT_MIN_COUNT = 3; const DEFAULT_MIN_LENGTH = 4; const tokenizer = new natural.WordTokenizer(); const importantTags = new Set(['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']); const language = 'EN'; const defaultCategory = 'N'; const defaultCategoryCapitalized = 'NNP'; const lexicon = new natural.Lexicon(language, defaultCategory, defaultCategoryCapitalized); const ruleSet = new natural.RuleSet('EN'); const tagger = new natural.BrillPOSTagger(lexicon, ruleSet); const toObjectList = (frequencyCounts) => { return Object.keys(frequencyCounts).map((key) => { return { word: key, count: frequencyCounts[key] }; }); }; /** * Analyze Text for the Top Keywords using NLTK/Natural. * @param {*} param0 * @returns */ export const analyzeTopKeywords = ({ text, count = DEFAULT_COUNT, minCount = DEFAULT_MIN_COUNT, minLength = DEFAULT_MIN_LENGTH }) => { const tokens = tokenizer.tokenize(text); const filteredTokens = removeStopwords(tokens, eng); const taggedWords = tagger.tag(filteredTokens).taggedWords; const importantTokens = taggedWords.filter((token) => importantTags.has(token.tag)).map((token) => token.token); // Count the frequency of each remaining word and sort them by frequency const frequencyCounts = importantTokens.reduce((counts, token) => { const key = token.toLowerCase(); counts[key] = (counts[key] || 0) + 1; return counts; }, {}); // Sort descending by count, return tokens that match // a minimum occurrence count, and minimum word length. const sortedKeywords = toObjectList(frequencyCounts) .sort((a, b) => b.count - a.count) .filter((item) => item.count >= minCount && item.word.length >= minLength); const topKeywords = sortedKeywords.slice(0, count); return topKeywords; };

This code uses the natural library to tokenize the text into individual words, remove common stop words using the stopword library, perform part-of-speech tagging, filter out non-important parts of speech, count the frequency of each remaining word, and sort them by frequency. Finally, it outputs the top 5 most frequent keywords.

Note that this method is not perfect and may not always extract the most relevant keywords, but it provides a good starting point for keyword extraction. You may need to experiment with different techniques and adjust the parameters to get the best results for your specific use case.