Kyo Suayan | suayan.com | Extracting Text from epub Files

Extracting Text from epub Files

Saturday, March 18th 2023

You can use the epub module in Node.js to extract text from an epub document, and then convert the text to markdown using the turndown module. Here's an example code snippet:

const EPub = require('epub');
const turndown = require('turndown');

// Open the epub file
const book = new EPub('/path/to/book.epub');

// Wait for the book to be ready
book.on('end', function () {

  // Extract the contents of the book as plain text
  book.getChapter('all', function(err, text) {

    // Convert the plain text to markdown
    const converter = new turndown();
    const markdown = converter.turndown(text);

    // Do something with the markdown...
    console.log(markdown);
  });
});

book.parse();

In this example, we first open the epub file using the EPub constructor and wait for it to be ready by listening to the 'end' event. We then extract the contents of the book as plain text using the getChapter method and convert it to markdown using the turndown module. Finally, we can do something with the markdown, such as printing it to the console.

Note that this is a simple example and may not work perfectly for all epub files. The epub module provides various options for customizing the extraction process, such as excluding certain elements or chapters, which you can use to fine-tune the output.

Extract chapter metadata

const chapters = epub.flow.map((chapter) => {
    return {
      title: chapter.title,
      id: chapter.id,
      href: chapter.href
    };
  });
  
console.log(chapters);

// Or extract for one chapter only:
const firstChapter = epub.flow[0].id;

epub.getChapter(firstChapter, function (err, text) {
	if (err) {
	  console.log("Error:", err);
	} else {
	  console.log("Chapter Text:", text);
	}
});

tags: