Kyo Suayan | suayan.com

Apify SDK

Saturday, March 18th 2023

Apify SDK is an open-source web scraping and automation platform that allows you to write and run web scrapers, automate web browsers, and deploy crawlers to the cloud. The SDK is built on top of Node.js and provides a simple and easy-to-use API for scraping and automating web applications.

Some of the key features of Apify SDK include:

Web Scraping: Apify SDK allows you to easily scrape data from websites using its powerful web crawling engine. You can use pre-built scrapers, or build your own custom scrapers using the SDK's API.
Web Automation: Apify SDK provides a powerful automation framework that allows you to automate web browsers to perform complex actions, such as filling out forms, clicking buttons, and navigating through pages.
Proxy Management: The SDK allows you to easily manage and rotate proxies, which can help you to scrape data at scale without getting blocked by websites.
Cloud Deployment: You can easily deploy your crawlers to the Apify cloud, which provides scalable computing resources and ensures that your crawlers keep running even if your local machine is turned off.
Data Storage: Apify SDK provides a powerful data storage and management system, which allows you to store and manage your scraped data in a structured way.

Apify SDK is a powerful tool for web scraping and automation, and is used by developers and data scientists across a wide range of industries, including e-commerce, finance, marketing, and more.

const Apify = require('apify');

Apify.main(async () => {
  // Initialize the crawler configuration
  const crawler = new Apify.CheerioCrawler({
    // Set the starting URL
    startUrls: ['https://www.example.com'],
    // Define a function to be called for each page crawled
    handlePageFunction: async ({ request, $ }) => {
      // Extract information from the page using Cheerio
      const title = $('title').text();

      // Log the information to the console
      console.log(`Title of page ${request.url}: ${title}`);
    },
  });

  // Start the crawler
  await crawler.run();
});

In this example, we create a CheerioCrawler instance to crawl the website, and define a handlePageFunction function that extracts information from each page using Cheerio. The function logs the title of each page to the console. We then start the crawler using the run method.

Note that this is just a simple example, and there are many more features and options available in Apify SDK for more complex web scraping tasks.

Request throttling

Apify allows request throttling and provides several configuration options to set crawl depth.

Request throttling can be configured using the RequestList class, which allows you to define the rate at which requests are sent to the target website. This is done using the handleRequestFunction method, which takes a callback function that can be used to configure the request interval, retries, and other options.

The crawl depth can be set by specifying the maximum number of pages that should be visited during the crawl. This can be achieved using the PuppeteerCrawler class, which allows you to specify the maximum depth using the maxDepth option.

Here's an example that demonstrates how to use Apify SDK to set request throttling and crawl depth:

const Apify = require('apify');

// Define the starting URL and maximum crawl depth
const startUrl = 'https://www.example.com';
const maxDepth = 5;

// Create a new request list
const requestList = new Apify.RequestList({
    sources: [{ url: startUrl }],
});

// Configure the request throttling
requestList.handleRequestFunction(async ({ request }) => {
    // Wait for 1 second between requests
    await Apify.utils.sleep(1000);

    return request;
});

// Create a new Puppeteer crawler
const crawler = new Apify.PuppeteerCrawler({
    requestList,
    handlePageFunction: async ({ page }) => {
        // Process the page here
    },
    maxDepth,
});

// Start the crawler
await crawler.run();

In this example, we create a new RequestList object and add the starting URL to it. We then define a callback function that waits for 1 second between requests, and pass it to the handleRequestFunction method.

Next, we create a new PuppeteerCrawler object and pass it the request list and a callback function that handles the pages. We also specify the maximum crawl depth using the maxDepth option.

Finally, we start the crawler by calling the run method on the crawler object.

Other crawler types supported by Apify

Apify supports several types of crawlers:

Basic Crawler: A simple web crawler that can handle most websites.
Cheerio Crawler: A web crawler that uses the Cheerio library to parse HTML and XML documents.
Puppeteer Crawler: A web crawler that uses the Puppeteer library to control a headless Chrome or Chromium browser.
Playwright Crawler: A web crawler that uses the Playwright library to control a headless Chrome, Firefox, or Safari browser.
Nightmare Crawler: A web crawler that uses the Nightmare library to control a headless Electron browser.
Selenium Crawler: A web crawler that uses the Selenium WebDriver to control a headless or non-headless browser.

Basic Crawler

The Basic Crawler is one of the simplest and most versatile types of crawlers provided by Apify. It offers the following advantages:

Easy to set up: The Basic Crawler requires minimal configuration, making it easy to set up and use even for beginners.
Flexibility: The Basic Crawler allows you to customize the crawler to suit your specific needs, such as specifying the crawling rate, maximum requests per crawl, and more.
Fast and efficient: The Basic Crawler is optimized for speed and efficiency, allowing you to crawl large websites quickly and easily.
Easy to use: The Basic Crawler has a simple and intuitive API, making it easy to use and integrate with other tools and applications.
Scalable: The Basic Crawler is designed to be scalable, allowing you to easily scale up or down depending on your needs.

tags: