- Web Scraping With Django Unchained
- Web Scraping With Django Using
- Web Scraping With Django For Beginners
Dec 14, 2020 Django is a high-level framework which is written in Python which allows us to create server-side web applications. In this article, we will see how to create a News application using Django. In this article, we will see how to create a News application using Django. Resolving the Complexities of Web Scraping with Python Picking the right tools, libraries, and frameworks. First and foremost, I can't stress enough the utility of browser tools for visual inspection. Effectively planning our web scraping approach upfront can probably save us hours of head scratching in advance.
In August this year, Django 3.1 arrived with support for Django async views. This was fantastic news but most people raised the obvious question – What can I do with it? There have been a few tutorials about Django asynchronous views that demonstrate asynchronous execution while calling asyncio.sleep
. But that merely led to the refinement of the popular question – What can I do with it besides sleep-ing?
The short answer is – it is a very powerful technique to write efficient views. For a detailed overview of what asynchronous views are and how they can be used, keep on reading. If you are new to asynchronous support in Django and like to know more background, read my earlier article: A Guide to ASGI in Django 3.0 and its Performance.
Django Async Views
Django now allows you to write views which can run asynchronously. First let’s refresh your memory by looking at a simple and minimal synchronous view in Django:
It takes a request object and returns a response object. In a real world project, a view does many things like fetching records from a database, calling a service or rendering a template. But they work synchronously or one after the other.
In Django’s MTV (Model Template View) architecture, Views are disproportionately more powerful than others (I find it comparable to a controller in MVC architecture though these things are debatable). Once you enter a view you can perform almost any logic necessary to create a response. This is why Asynchronous Views are so important. It lets you do more things concurrently.
It is quite easy to write an asynchronous view. For example the asynchronous version of our minimal example above would be:
This is a coroutine rather than a function. You cannot call it directly. An event loop needs to be created to execute it. But you do not have to worry about that difference since Django takes care of all that.
Note that this particular view is not invoking anything asynchronously. If Django is running in the classic WSGI mode, then a new event loop is created (automatically) to run this coroutine. So in this case, it might be slightly slower than the synchronous version. But that’s because you are not using it to run tasks concurrently.
So then why bother writing asynchronous views? The limitations of synchronous views become apparent only at a certain scale. When it comes to large scale web applications probably nothing beats FaceBook.
Views at Facebook
In August, Facebook released a static analysis tool to detect and prevent security issues in Python. But what caught my eye was how the views were written in the examples they had shared. They were all async!
Note that this is not Django but something similar. Currently, Django runs the database code synchronously. But that may change sometime in the future.
If you think about it, it makes perfect sense. Synchronous code can be blocked while waiting for an I/O operation for several microseconds. However, its equivalent asynchronous code would not be tied up and can work on other tasks. Therefore it can handle more requests with lower latencies. More requests gives Facebook (or any other large site) the ability to handle more users on the same infrastructure.
Even if you are not close to reaching Facebook scale, you could use Python’s asyncio as a more predictable threading mechanism to run many things concurrently. A thread scheduler could interrupt in between destructive updates of shared resources leading to difficult to debug race conditions. Compared to threads, coroutines can achieve a higher level of concurrency with very less overhead.
Misleading Sleep Examples
As I joked earlier, most of the Django async views tutorials show an example involving sleep. Even the official Django release notes had this example:
To a Python async guru this code might indicate the possibilities that were not previously possible. But to the vast majority, this code is misleading in many ways.
Firstly, the sleep happening synchronously or asynchronously makes no difference to the end user. The poor chap who just opened the URL linked to that view will have to wait for 0.5 seconds before it returns a cheeky “Hello, async world!”. If you are a complete novice, you may have expected an immediate reply and somehow the “hello” greeting to appear asynchronously half a second later. Of course, that sounds silly but then what is this example trying to do compared to a synchronous time.sleep()
inside a view?
The answer is, as with most things in the asyncio world, in the event loop. If the event loop had some other task waiting to be run then that half second window would give it an opportunity to run that. Note that it may take longer than that window to complete. Cooperative Multithreading assumes that everyone works quickly and hands over the control promptly back to the event loop.
Secondly, it does not seem to accomplish anything useful. Some command-line interfaces use sleep to give enough time for users to read a message before disappearing. But it is the opposite for web applications - a faster response from the web server is the key to a better user experience. So by slowing the response what are we trying to demonstrate in such examples?
The best explanation for such simplified examples I can give is convenience. It needs a bit more setup to show examples which really need asynchronous support. That’s what we are trying to explore here.
Better examples
A rule of thumb to remember before writing an asynchronous view is to check if it is I/O bound or CPU-bound. A view which spends most of the time in a CPU-bound activity for e.g. matrix multiplication or image manipulation would really not benefit from rewriting them to async views. You should be focussing on the I/O bound activities.
Invoking Microservices
Most large web applications are moving away from a monolithic architecture to one composed of many microservices. Rendering a view might require the results of many internal or external services.
In our example, an ecommerce site for books renders its front page - like most popular sites - tailored to the logged in user by displaying recommended books. The recommendation engine is typically implemented as a separate microservice that makes recommendations based on past buying history and perhaps a bit of machine learning by understanding how successful its past recommendations were.
In this case, we also need the results of another microservice that decides which promotional banners to display as a rotating banner or slideshow to the user. These banners are not tailored to the logged in user but change depending on the items currently on sale (active promotional campaign) or date.
Let’s look at how a synchronous version of such a page might look like:
Here instead of the popular Python requests library we are using the httpx library because it supports making synchronous and asynchronous web requests. The interface is almost identical.
The problem with this view is that the time taken to invoke these services add up since they happen sequentially. The Python process is suspended until the first service responds which could take a long time in a worst case scenario.
Let’s try to run them concurrently using a simplistic (and ineffective) await call:
Notice that the view has changed from a function to a coroutine (due to async def
keyword). Also note that there are two places where we await for a response from each of the services. You don’t have to try to understand every line here, as we will explain with a better example.
Interestingly, this view does not work concurrently and takes the same amount of time as the synchronous view. If you are familiar with asynchronous programming, you might have guessed that simply awaiting a coroutine does not make it run other things concurrently, you will just yield control back to the event loop. The view still gets suspended.
Let’s look at a proper way to run things concurrently:
If the two services we are calling have similar response times, then this view should complete in _half _the time compared to the synchronous version. This is because the calls happen concurrently as we would want.
Let’s try to understand what is happening here. There is an outer try…except block to catch request errors while making either of the HTTP calls. Then there is an inner async…with block which gives a context having the client object.
The most important line is one with the asyncio.gather call taking the coroutines created by the two client.get
calls. The gather call will execute them concurrently and return only when both of them are completed. The result would be a tuple of responses which we will unpack into two variables response_p
and response_r
. If there were no errors, these responses are populated in the context sent for template rendering.
Microservices are typically internal to the organization hence the response times are low and less variable. Yet, it is never a good idea to rely solely on synchronous calls for communicating between microservices. As the dependencies between services increases, it creates long chains of request and response calls. Such chains can slow down services.
Why Live Scraping is Bad
We need to address web scraping because so many asyncio examples use them. I am referring to cases where multiple external websites or pages within a website are concurrently fetched and scraped for information like live stock market (or bitcoin) prices. The implementation would be very similar to what we saw in the Microservices example.
But this is very risky since a view should return a response to the user as quickly as possible. So trying to fetch external sites which have variable response times or throttling mechanisms could be a poor user experience or even worse a browser timeout. Since microservice calls are typically internal, response times can be controlled with proper SLAs.
Ideally, scraping should be done in a separate process scheduled to run periodically (using celery or rq). The view should simply pick up the scraped values and present them to the users.
Serving Files
Django addresses the problem of serving files by trying hard not to do it itself. This makes sense from a “Do not reinvent the wheel” perspective. After all, there are several better solutions to serve static files like nginx.
But often we need to serve files with dynamic content. Files often reside in a (slower) disk-based storage (we now have much faster SSDs). While this file operation is quite easy to accomplish with Python, it could be expensive in terms of performance for large files. Regardless of the file’s size, this is a potentially blocking I/O operation that could potentially be used for running another task concurrently.
Imagine we need to serve a PDF certificate in a Django view. However the date and time of downloading the certificate needs to be stored in the metadata of the PDF file, for some reason (possibly for identification and validation).
We will use the aiofiles library here for asynchronous file I/O. The API is almost the same as the familiar Python’s built-in file API. Here is how the asynchronous view could be written:
This example illustrates why we need asynchronous template rendering in Django. But until that gets implemented, you could use aiofiles library to pull local files without skipping a beat.
There are downsides to directly using local files instead of Django’s staticfiles. In the future, when you migrate to a different storage space like Amazon S3, make sure you adapt your code accordingly.
Handling Uploads
Web Scraping With Django Unchained
On the flip side, uploading a file is also a potentially long, blocking operation. For security and organizational reasons, Django stores all uploaded content into a separate ‘media’ directory.
If you have a form that allows uploading a file, then we need to anticipate that some pesky user would upload an impossibly large one. Thankfully Django passes the file to the view as chunks of a certain size. Combined with aiofile’s ability to write a file asynchronously, we could support highly concurrent uploads.
Again this is circumventing Django’s default file upload mechanism, so you need to be careful about the security implications.
Where To Use
Django Async project has full backward compatibility as one of its main goals. So you can continue to use your old synchronous views without rewriting them into async. Asynchronous views are not a panacea for all performance issues, so most projects will still continue to use synchronous code since they are quite straightforward to reason about.
In fact, you can use both async and sync views in the same project. Django will take care of calling the view in the appropriate manner. However, if you are using async views it is recommended to deploy the application on ASGI servers.
This gives you the flexibility to try asynchronous views gradually especially for I/O intensive work. You need to be careful to pick only async libraries or mix them with sync carefully (use the async_to_sync and sync_to_async adaptors).
Hopefully this writeup gave you some ideas.
Thanks to Chillar Anand and Ritesh Agrawal for reviewing this post. All illustrations courtesy of Old Book Illustrations
Macro Recorder and Diagram Designer
Download FMinerFor Windows:Free Trial 15 Days, Easy to Install and Uninstall Completely
Pro and Basic edition are for Windows, Mac edition just for Mac OS 10. Recommended Pro/Mac edition with full features.
or
FMiner is a software for web scraping, web data extraction, screen scraping, web harvesting, web crawling and web macro support for windows and Mac OS X.
It is an easy to use web data extraction tool that combines best-in-class features with an intuitive visual project design tool, to make your next data mining project a breeze.
Whether faced with routine web scrapping tasks, or highly complex data extraction projects requiring form inputs, proxy server lists, ajax handling and multi-layered multi-table crawls, FMiner is the web scrapping tool for you.
With FMiner, you can quickly master data mining techniques to harvest data from a variety of websites ranging from online product catalogs and real estate classifieds sites to popular search engines and yellow page directories.
Simply select your output file format and record your steps on FMiner as you walk through your data extraction steps on your target web site.
FMiner's powerful visual design tool captures every step and models a process map that interacts with the target site pages to capture the information you've identified.
Using preset selections for data type and your output file, the data elements you've selected are saved in your choice of Excel, CSV or SQL format and parsed to your specifications.
And equally important, if your project requires regular updates, FMiner's integrated scheduling module allows you to define periodic extractions schedules at which point the project will auto-run new or incremental data extracts.
Easy to use, powerful web scraping tool
- Visual design tool
Design a data extraction project with the easy to use visual editor in less than ten minutes.
- No coding required
Use the simple point and click interface to record a scrape project much as you would click through the target site.
- Advanced features
Extract data from hard to crawl Web 2.0 dynamic websites that employ Ajax and Javascript.
- Multiple Crawl Path Navigation Options
Drill through site pages using a combination of link structures, automated form input value entries, drop-down selections or url pattern matching.
- Keyword Input Lists
Upload input values to be used with the target website's web form to automatically query thousands of keywords and submit a form for each keyword.
- Nested Data Elements
Breeze through multilevel nested extractions. Crawl link structures to capture nested product catalogue, search results or directory content.
- Multi-Threaded Crawl
Expedite data extraction with FMiner's multi-browser crawling capability.
- Export Formats
Export harvested records in any number of formats including Excel, CSV, XML/HTML, JSON and popular databases (Oracle, MS SQL, MySQL).
- CAPCHA Tests
Get around target website CAPCHA protection using manual entry or third-party automated decaptcha services.
More Features>>
- If you want us build an FMiner project to scrape a website:
Request a Customized Project (Starting at $99), we can make any complex project for you.
Web Scraping With Django Using
This is working very very well. Nice work. Other companies were quoting us $5,000 - $10,000 for such a project. Thanks for your time and help, we truly appreciate it.
Web Scraping With Django For Beginners
--Nick