Crawling web pages with Javascript

Posted Jun 28, 202012 min read

Author:Shenesh Perera

Translation:crazy technical house

Original: https://www.scrapingbee.com/b...

Reprinting without permission is strictly prohibited

Web Scraping with Javascript

This article explains how to use Node.js to efficiently crawl data from the Web.

Prerequisites

This article is mainly aimed at programmers with some JavaScript experience. If you have a deep understanding of web scraping, but are not familiar with JavaScript, then this article can still help you.

  • Know JavaScript
  • Will use DevTools to extract element selectors
  • Some ES6(optional)

You will learn

Through this article you will learn:

  • Learn more about Node.js
  • Use multiple HTTP clients to help the web crawling process
  • Crawl the web using multiple, tried-and-tested libraries

Understanding Node.js

Javascript is a simple modern programming language, originally to add dynamic effects to web pages in the browser. When the website is loaded, the Javascript code is run by the browser's Javascript engine. In order for Javascript to interact with your browser, the browser also provides a runtime environment(document, window, etc.).

This means that Javascript cannot directly interact with or manipulate computer resources. For example, in a Web server, the server must be able to interact with the file system so that it can read and write files.

Node.js enables Javascript not only to run on the client side, but also on the server side. In order to do this, its founder Ryan Dahl chose the v8 Javascript Engine of the Google Chrome browser and embedded it in a Node program developed in C++. So Node.js is a runtime environment, which allows Javascript code to run on the server.

In contrast to other languages (such as C or C++) that handle concurrency through multiple threads, Node.js utilizes a single main thread and performs tasks in a non-blocking manner with the help of an event loop.

It is very simple to create a simple web server as follows:

const http = require('http');
const PORT = 3000;

const server = http.createServer((req, res) => {
  res.statusCode = 200;
  res.setHeader('Content-Type','text/plain');
  res.end('Hello World');
});

server.listen(port,() => {
  console.log(`Server running at PORT:${port}/`);
});

If you have installed Node.js, you can try to run the above code. Node.js is very suitable for I/O intensive programs.

HTTP client:access to the Web

HTTP clients are tools that can send requests to the server and then receive responses from the server. All the bottom layers of the tools mentioned below use HTTP clients to access the websites you want to crawl.

Request

Request is one of the most widely used HTTP clients in the Javascript ecosystem, but the author of the Request library has officially declared it deprecated. However, this does not mean that it is unavailable. Quite a few libraries are still using it and it is very easy to use. Using Request to make an HTTP request is very simple:

const request = require('request')
request('https://www.reddit.com/r/programming.json', function(
  error,
  response,
  body

) {
console.error('error:', error)
console.log('body:', body)
})

You can find the Request library on Github , and installing it is very simple. You can also find the deprecation notice and its meaning at https://github.com/request/re... .

Axios

Axios is a promise-based HTTP client that can run in browsers and Node.js. If you use Typescript, then axios will override the built-in types for you. Initiating an HTTP request through Axios is very simple. By default, it comes with Promise support instead of using callbacks in Request:

const axios = require('axios')

axios
    .get('https://www.reddit.com/r/programming.json')
    .then((response) => {
        console.log(response)
    })
    .catch((error) => {
        console.error(error)
    });

If you like the async/await syntactic sugar of the Promises API, then you can also use it, but since the top-level await is still in stage 3 , so we I had to use asynchronous functions instead:

async function getForum() {
    try {
        const response = await axios.get(
            'https://www.reddit.com/r/programming.json'
       )
        console.log(response)
    } catch(error) {
        console.error(error)
    }
}

All you have to do is call getForum! The Axios library can be found on https://github.com/axios/axios .

Superagent

Like Axios, Superagent is another powerful HTTP client that supports Promise and async/await syntactic sugar. It has a fairly simple API like Axios, but Superagent is less popular due to more dependencies.

Making HTTP requests to Superagent with promises, async/await or callbacks looks like this:

const superagent = require("superagent")
const forumURL = "https://www.reddit.com/r/programming.json"

//callbacks
superagent
    .get(forumURL)
    .end((error, response) => {
        console.log(response)
    })

//promises
superagent
    .get(forumURL)
    .then((response) => {
        console.log(response)
    })
    .catch((error) => {
        console.error(error)
    })

//promises with async/await
async function getForum() {
    try {
        const response = await superagent.get(forumURL)
        console.log(response)
    } catch(error) {
        console.error(error)
    }
}

You can find Superagent at https://github.com/visionmedi... .

Regular expressions:the hard way

Without any dependencies, the easiest way to crawl the web is to use a bunch of regular expressions on the received HTML string when querying a web page using an HTTP client. Regular expressions are not so flexible, and many professionals and amateurs have difficulty writing correct regular expressions.

Let's give it a try, assuming that there is a label with a username, and we need that username, which is similar to what you have to do when you rely on regular expressions

const htmlString ='<label>Username:John Doe</label>'
const result = htmlString.match(/<label>(.+)<\/label>/)

console.log(result[1], result[1].split(":")[1])
//Username:John Doe, John Doe

In Javascript, match() usually returns an array that contains everything that matches the regular expression. The second element(in index 1) will find the textContent or innerHTML that we want the <label> tag. But the result contains some unwanted text("Username:"), which must be deleted.

As you can see, for a very simple use case, there are many steps and work to be done. This is why you should rely on the HTML parser, which we will discuss later.

Cheerio:the core JQuery for traversing the DOM

Cheerio is an efficient and lightweight library that allows you to use JQuery's rich and powerful API on the server side. If you have used JQuery before, you will be familiar with Cheerio, which eliminates all DOM inconsistencies and browser-related functions, and exposes an effective API to parse and manipulate the DOM.

const cheerio = require('cheerio')
const $= cheerio.load('<h2 class="title">Hello world</h2>')

$('h2.title').text('Hello there!')
$('h2').addClass('welcome')

$.html()
//<h2 class="title welcome">Hello there!</h2>

As you can see, Cheerio is very similar to JQuery.

However, although it works differently than a web browser, it means that it cannot:

  • Render any parsed or manipulated DOM elements
  • Apply CSS or load external resources
  • Run JavaScript

Therefore, if the website or web application you are trying to crawl is heavily dependent on Javascript(such as "single-page application"), then Cheerio is not the best choice, and you may have to rely on other options discussed later.

In order to demonstrate the powerful features of Cheerio, we will try to grab the r/programming forum in Reddit and try to get a list of post names.

First, install Cheerio and axios by running the following command:npm install cheerio axios.

Then create a new file named crawler.js and copy and paste the following code:

const axios = require('axios');
const cheerio = require('cheerio');

const getPostTitles = async() => {
    try {
        const {data} = await axios.get(
            'https://old.reddit.com/r/programming/'
       );
        const $= cheerio.load(data);
        const postTitles = [];

        $('div> p.title> a').each((_idx, el) => {
            const postTitle = $(el).text()
            postTitles.push(postTitle)
        });

        return postTitles;
    } catch(error) {
        throw error;
    }
};

getPostTitles()
.then((postTitles) => console.log(postTitles));

getPostTitles() is an asynchronous function that will crawl the old reddit r/programming forum. First, use a simple HTTP GET request with the axios HTTP client library to get the HTML of the website, and then use the cheerio.load() function to enter the html data into Cheerio.

Then with the help of Dev Tools in the browser, you can get a selector that can locate all the list items. If you have used JQuery, you must be very familiar with $('div> p.title> a'). This will get all posts, because you only want to get the title of each post separately, so you must traverse each post. These operations are done with the help of the each() function.

To extract text from each title, you must obtain the DOM element with the help of Cheerio(el refers to the current element). Then calling text() on each element can provide you with text.

Now, open the terminal and run node crawler.js, and then you will see an array of headers, which will be very long. Although this is a very simple use case, it demonstrates the simple nature of the API provided by Cheerio.

If your use case needs to execute Javascript and load an external source, the following options will be helpful.

JSDOM:Node's DOM

JSDOM is a pure Javascript implementation of the document object model used in Node.js. As mentioned earlier, DOM is not available to Node, but JSDOM is the closest. It mimics the browser more or less.

Since the DOM is created, you can programmatically interact with the web application or website you want to crawl, or you can simulate clicking a button. If you are familiar with DOM manipulation, using JSDOM will be very simple.

const {JSDOM} = require('jsdom')
const {document} = new JSDOM(
    '<h2 class="title">Hello world</h2>'

).window
const heading = document.querySelector('.title')
heading.textContent ='Hello there!'
heading.classList.add('welcome')

heading.innerHTML
//<h2 class="title welcome">Hello there!</h2>

The code uses JSDOM to create a DOM, and then you can use the same methods and properties as the browser DOM to manipulate the DOM.

To demonstrate how to use JSDOM to interact with the site, we will get the first post in the Reddit r/programming forum and vote on it, and then verify that the post has been voted on.

First run the following command to install jsdom and axios:npm install jsdom axios

Then create a file named crawler.js and copy and paste the following code:

const {JSDOM} = require("jsdom")
const axios = require('axios')

const upvoteFirstPost = async() => {
  try {
    const {data} = await axios.get("https://old.reddit.com/r/programming/");
    const dom = new JSDOM(data, {
      runScripts:"dangerously",
      resources:"usable"
    });
    const {document} = dom.window;
    const firstPost = document.querySelector("div> div.midcol> div.arrow");
    firstPost.click();
    const isUpvoted = firstPost.classList.contains("upmod");
    const msg = isUpvoted
      ? "Post has been upvoted successfully!"
      :"The post has not been upvoted!";

    return msg;
  } catch(error) {
    throw error;
  }
};

upvoteFirstPost().then(msg => console.log(msg));

upvoteFirstPost() is an asynchronous function that will fetch the first post in r/programming and then vote on it. axios sends an HTTP GET request to get the HTML of the specified URL. Then create a new DOM from the HTML that you obtained earlier. The JSDOM constructor takes HTML as the first parameter and option as the second parameter. The two added option items perform the following functions:

  • runScripts:When set to dangerously, it allows execution of event handlers and any Javascript code. If you are unclear about the security of the script that will be run, it is best to set runScripts to "outside-only", which will attach all the provided Javascript specifications to the "window" object, thus preventing any scripts executed on inside .
  • resources:When set to "usable", allow to load any external script declared with <script> tag(for example:JQuery library extracted from CDN)

After creating the DOM, use the same DOM method to get the upvote button of the first article, and then click. To verify that you actually clicked on it, you can check if there is a class named upmod in classList. If it exists in classList, it returns a message.

Open the terminal and run node crawler.js, then you will see a neat string that will indicate whether the post has been liked. Although this example is very simple, you can build powerful things on this basis, for example, a bot that votes around a specific user's post.

If you don t like JSDOM which lacks expressive power, and depends on many such operations in practice, or needs to recreate many different DOMs, then the following will be a better choice.

Puppeteer:Headless browser

As the name suggests, Puppeteer allows you to manipulate the browser programmatically, just like manipulating puppets. It provides developers with a high-level API to control the headless version of Chrome by default.

From Puppeter Docs Puppeteer is more useful than the above tools because it can make you like real people The browser interacts to crawl the network. This has some possibilities that were not available before:

  • You can get a screenshot or generate a PDF of the page.
  • Can grab single-page applications and generate pre-rendered content.
  • Automate many different user interactions, such as keyboard input, form submission, navigation, etc.

It can also play an important role in tasks other than web crawling, such as UI testing and assisted performance optimization.

Usually you will want to take a screenshot of the website, perhaps to understand the competitor's product catalog, you can use puppeteer to do it. First run the following command to install puppeteer:npm install puppeteer

This will download the bundle version of Chromium, which is about 180 MB to 300 MB depending on the operating system. If you want to disable this feature.

Let's try to get the screenshot and PDF of r/programming forum in Reddit, create a new file named crawler.js, then copy and paste the following code:

const puppeteer = require('puppeteer')

async function getVisual() {
    try {
        const URL ='https://www.reddit.com/r/programming/'
        const browser = await puppeteer.launch()
        const page = await browser.newPage()

        await page.goto(URL)
        await page.screenshot({ path:'screenshot.png' })
        await page.pdf({ path:'page.pdf' })

        await browser.close()
    } catch(error) {
        console.error(error)
    }
}

getVisual()

getVisual() is an asynchronous function that will get the screenshot and pdf corresponding to the url in the URL variable. First, create a browser instance via puppeteer.launch(), then create a new page. You can think of this page as a tab in a regular browser. Then call page.goto() with URL as the parameter to direct the previously created page to the specified URL. Eventually, the browser instance is destroyed along with the page.

After completing the operation and loading the page, you will use page.screenshot() and page.pdf() to get the screenshot and pdf respectively. You can also listen to javascript load events and then perform these operations, which is highly recommended at the production level.

Run node crawler.js on the terminal. After a few seconds, you will notice that two files have been created named screenshot.jpg and page.pdf.

Nightmare:a replacement for Puppeteer

Nightmare is an advanced browser automation library similar to Puppeteer, which uses Electron, but is said to be twice as fast as its predecessor, PhantomJS.

If you do not like Puppeteer to some extent or are frustrated with the size of the Chromium bundle, then nightmare is an ideal choice. First, run the following command to install the nightmare library:npm install nightmare

Then, once nightmare is downloaded, we will use it to find ScrapingBee's website through the Google search engine. Create a file named crawler.js, then copy and paste the following code into it:

const Nightmare = require('nightmare')
const nightmare = Nightmare()

nightmare
    .goto('https://www.google.com/')
    .type("input[title='Search']",'ScrapingBee')
    .click("input[value='Google Search']")
    .wait('#rso> div:nth-child(1)> div> div> div.r> a')
    .evaluate(
       () =>
            document.querySelector(
                '#rso> div:nth-child(1)> div> div> div.r> a'
           ).href
   )
    .end()
    .then((link) => {
        console.log('Scraping Bee Web Link':link)
    })
    .catch((error) => {
        console.error('Search failed:', error)
    })

First create a Nighmare instance, then direct the instance to the Google search engine by calling goto(). After loading, use its selector to get the search box, and then use the search box value(input label) to change to "ScrapingBee". When finished, submit the search form by clicking the "Google Search" button. Then tell Nightmare to wait until the first link is loaded, and once completed, it will use the DOM method to get the value of the href attribute that contains the positioning tag of the link.

Finally, after all operations are completed, the link will be printed to the console.

to sum up

  • Node.js is the runtime environment of Javascript on the server side. Due to the event loop mechanism, it has a "non-blocking" nature.
  • HTTP client(such as Axios, Superagent, and Request) is used to send HTTP requests to the server and receive responses.
  • Cheerio The advantages of JQuery are extracted. Web crawling on the server side is the only purpose, but Javascript code is not executed.
  • JSDOM According to the standard Javascript specification Create a DOM from an HTML string and allow you to perform DOM operations on it.
  • Puppeteer and Nightmare are high-level browser automation libraries that allow you to manipulate Web applications programmatically, just like real people are interacting with .

This article is the first WeChat public account:front-end pioneer

Welcome to scan the QR code to pay attention to the public number, and push you fresh front-end technical articles every day

Welcome to scan the QR code to follow the public number, and push you to the fresh front-end technical articles every day


Welcome to continue reading other high praise articles in this column: