How to use Web Scraping and Puppeteer to analyze Christie’s, Sotheby’s and Phillips’ Auctions.

Mike Danilchyk
3 min readJun 29, 2020

--

Web Scraping is one of the most popular methods of reading various data located on web pages for their systematization and further analysis. In fact, it can be called “site parsing”, where information is collected and exported in a more user-friendly format, whether it’s a table or API.

Web Scraping tools not only allow you to get new or updated data manually, but also automatically for successful achievement of your goals.

How does Web Scraping work?

· Collecting data for market research. Allows you to prepare information in a short time to make strategically important decisions in conducting business.
· To extract certain information (phones, e-mails, addresses) from various websites to create your own lists.
· Collection of product data for competitor analysis.
· Cleaning up site data before migration.
· Collection of financial data.
· In HR work to track resumes and job vacancies.

The Lansoft team has mastered this method quite successfully. That’s why we want to share with you one of the data collection cases for analysis of art objects data sets for New York company Pryph.

Pryph analyzes famous auction houses such as Christie’s, Sotheby’s and Phillips and summarizes the conclusions about the popularity of different artists.

For the work we chose the tool — Puppeteer. This is a JavaScript library for Node.js that manages Chrome browser without user interface.

With the help of this library it is quite easy to automatically read data from different websites or create so-called web scrapers that simulate user actions.

In fact, there are better ways to scrape sites using node.js.

The reasons for choosing Puppeteer in our case were:
· analysis of only 3 sites with clear sections and structure;
· active promotion of this tool by Google;
· emulation of real user’s work on UI without risk of hitting the bun as potential DDOS attacks.

So, our challenge was to go to the websites of auction houses and collect data on sales of all lots for each type of auction from 2006 to 2019.

For example, we inserted a piece of code written on Puppeteer to extract links to pictures of lots from the auction house Phillips:

In a similar way, the Lansoft team needed to find the name of the artist, a description of the work, the price, details of the sale and a link to the artwork for each lot.

Examples of lots links:
https://www.phillips.com/detail/takashi-murakami/HK010120/110
https://www.sothebys.com/en/buy/auction/2020/contemporary-art-evening-auction/lynette-yiadom-boayke-cloister?locale=en

For example, in the picture above we see the name of the artist TAKASHI MURAKAMI, the name of the picture “Blue Flower Painting B” and the price data for $231,000–359,000. All necessary fields we collected and saved in csv files, divided by years.

It looked like this:

As a result, we received sets of csv files on sales for different years. The file size was about 6,000 lines. And then the client used his algorithms to perform trend analysis for various artists.

The results can be found at: http://pryph.org/insights

But there are some nuances in working with Puppeteer:
· some resources can block access if obscure activity is detected;
· Puppeteer’s efficiency is not high, it can be enhanced by trottling animations, limiting network calls, etc.;
· it is necessary to end the session using a browser instance;
· the page/browser context is different from the node context in which the application is running;
· use a browser, even in Headless mode, is not as efficient and fast in time for large data analyses.

--

--

Mike Danilchyk

Co-Founder & CTO — Lansoft.dev | CTO — Web3soft | Blockchain, Crypto and NFT Expert