Scrape Google Lens with Python
This blog post is a step-by-step tutorial to scraping Google Lens using Python.
What will be scraped
Using Google Lens API from SerpApi
If you don't need an explanation, have a look at the full code example in the online IDE.
from serpapi import GoogleSearch
import json
params = {
'api_key': '...',
'engine': 'google_lens',
'url': 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg',
'hl': 'en',
}
search = GoogleSearch(params) # data extraction on the SerpApi backend
google_lens_results = search.get_dict() # JSON -> Python dict
del google_lens_results['search_metadata']
del google_lens_results['search_parameters']
print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))
Why using API?
There're a couple of reasons that may use API, ours in particular:
- No need to create a parser from scratch and maintain it.
- Bypass blocks from Google: solve CAPTCHA or solve IP blocks.
- Pay for proxies, and CAPTCHA solvers.
- Don't need to use browser automation.
SerpApi handles everything on the backend with fast response times under ~4.3 seconds per request and without browser automation, which becomes much faster. Response times and status rates are shown under SerpApi Status page:
Head to the Google Lens playground for a live and interactive demo.
Preparation
Install library:
pip install google-search-results
google-search-results
is a SerpApi API package.
Code Explanation
Import libraries:
from serpapi import GoogleSearch
import json
Library | Purpose |
GoogleSearch | to scrape and parse Google results using SerpApi web scraping library. |
json | to convert extracted data to a JSON object. |
The parameters are defined for generating the URL. If you want to pass other parameters to the URL, you can do so using the params
dictionary:
params = {
'api_key': '...',
'engine': 'google_lens',
'url': 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg',
'hl': 'en',
}
Parameters | Explanation |
api_key | Parameter defines the SerpApi private key to use. You can find it under your account -> API key. |
engine | Set parameter to google_lens to use the Google Lens API engine. |
url | Parameter defines the URL of an image to perform the Google Lens search. |
hl | Parameter defines the language to use for the Google Lens search. It's a two-letter language code. Head to the Google languages page for a full list of supported Google languages. |
๐Note: You can also add other API Parameters.
Then, we create a search
object where the data is retrieved from the SerpApi backend. In the google_lens_results
dictionary we get data from JSON:
search = GoogleSearch(params) # data extraction on the SerpApi backend
google_lens_results = search.get_dict() # JSON -> Python dict
The google_lens_results
dictionary, in addition to the necessary data, contains information about the request. The request information is not needed, so we remove the corresponding keys using the del
statement:
del google_lens_results['search_metadata']
del google_lens_results['search_parameters']
After the all data is retrieved, it is output in JSON format:
print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))
Output
{
"reverse_image_search": {
"link": "https://www.google.com/search?tbs=sbi:AMhZZiurdULpuTy4_1HSkPv2ZrEBN9afXDH2j7s2drhaSQmdFuOJlf9HaxhrjxEfBrWzj1xi-ZONFSwWi3UlhnMtRXlu68S24Kv5fLuNstTqFQfpUQXGbPBuplF8jDJuvLTDAJow06N44R7keGB1GOU5fRzsc4rirzA"
},
"knowledge_graph": [
{
"title": "Black cat",
"link": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&hl=en&gl=US",
"more_images": {
"link": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&ved=0EOTpBwgAKAAwAA&source=.lens.button&tbm=isch&hl=en&gl=US",
"serpapi_link": "https://serpapi.com/search.json?device=desktop&engine=google&gl=US&google_domain=google.com&hl=en&q=Black+cat&tbm=isch"
},
"thumbnail": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w",
"images": [
{
"title": "Image #1 for Black cat",
"source": "https://vbspca.com/tag/stigma/",
"link": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w",
"size": {
"width": 293,
"height": 172
}
},
... other images
]
},
... other knowledge graph results
],
"visual_matches": [
{
"position": 1,
"title": "Pet Talk: Smoke can create problems quickly for your cat | VailDaily.com",
"link": "https://www.vaildaily.com/opinion/pet-talk-smoke-can-create-problems-quickly-for-your-cat/",
"source": "vaildaily.com",
"source_icon": "https://encrypted-tbn0.gstatic.com/favicon-tbn?q=tbn:ANd9GcSXpzpJuQgYt20Jd-moiGdOr6HoDpS-WQ_vjcfrNvtLJy_gjDrYJIs3abOVeBb7g24x5kLNBg2T-KGdiQ_NkFkcBjt2s7exhkQg46swp-DMTF3S1_lemg",
"thumbnail": "https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQCjR3dx5H8xz9fSevbe6JqPtBlakSxJwrECbaMS64UcP05CwC4"
},
... other visual matches results
]
}
DIY solution
This section is to show the comparison between our solution and the DIY solution.
The fact is that when you click on a regular link, it changes to another link. The GIF below shows this:
The data is correspondingly different and there is no way to extract it without reverse engineering. For simplicity, the DIY solution uses playwright
. It helps to extract data from the modified link.
The data extraction itself is done with selectolax
because it has Lexbor parser which is incredibly fast. In terms of syntax, it is very similar to both bs4
and parsel
, making it easy to use. Please note that selectolax
does not currently support XPath.
Example code to integrate:
from playwright.sync_api import sync_playwright
from selectolax.lexbor import LexborHTMLParser
import json
def run(playwright):
image_url = 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg'
page = playwright.chromium.launch(headless=True).new_page()
page.goto(f'https://lens.google.com/uploadbyurl?url={image_url}&hl=en')
parser = LexborHTMLParser(page.content())
page.close()
reverse_image_search = {
'link': parser.root.css_first('.kuwdsf .VfPpkd-RLmnJb').attributes['href']
}
knowledge_graph = {
'title': parser.root.css_first('.DeMn2d').text(),
'subtitle': parser.root.css_first('.XNTym').text() if parser.root.css_first('.XNTym') else None,
'link': parser.root.css_first('.OCDsub .VfPpkd-RLmnJb').attributes['href'],
'more_images': parser.root.css_first('[aria-label="More Images"]').attributes['href'],
'thumbnail': parser.root.css_first('.oLfv5c .FH8DCc').attributes['src'],
'images': [
{
'title': image.attributes['aria-label'],
'source': image.attributes['href'],
'link': image.css_first('.wETe9b').attributes['src']
}
for image in parser.root.css('.Y02Gld a')
]
}
visual_matches = [
{
'title': result.css_first('.UAiK1e').text(),
'link': result.css_first('.GZrdsf').attributes['href'],
'source': result.css_first('.fjbPGe').text(),
'source_icon': result.css_first('.KRdrw').attributes['src'],
'thumbnail': result.css_first('.jFVN1').attributes['src']
}
for result in parser.root.css('.xuQ19b')
]
google_lens_results = {
'reverse_image_search': reverse_image_search,
'knowledge_graph': knowledge_graph,
'visual_matches': visual_matches
}
print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))
with sync_playwright() as playwright:
run(playwright)
๐Note: In the online IDE this code does not work because the Replit does not support the playwright. You can do all the manipulations described below to check how the DIY solution works.
Preparation
Install library:
pip install playwright selectolax
Install the required browser:
playwright install chromium
Code Explanation
Import libraries:
from playwright.sync_api import sync_playwright
from selectolax.lexbor import LexborHTMLParser
import json
Library | Purpose |
sync_playwright | for synchronous API. playwright have asynchronous API as well using asyncio module. |
LexborHTMLParser | a fast HTML5 parser with CSS selectors using Lexbor engine. |
json | to convert extracted data to a JSON object. |
Declare a function:
def run(playwright):
# further code ...
The image_url
variable is defined, which contains the URL of the image:
image_url = 'https://user-images.githubusercontent.com/81998012/210290011-c175603d-f319-4620-b886-1eaad5c94d84.jpg'
Initialize playwright
, connect to chromium
, launch()
a browser new_page()
and goto()
a given URL:
page = playwright.chromium.launch(headless=True).new_page()
page.goto(f'https://lens.google.com/uploadbyurl?url={image_url}&hl=en')
Parameters | Explanation |
playwright.chromium | is a connection to the Chromium browser instance. |
launch() | will launch the browser, and headless argument will run it in headless mode. Default is True. |
new_page() | creates a new page in a new browser context. |
page.goto() | will make a request to provided website. |
After the page has loaded, pass HTML content to Lexbor
and close
the browser:
parser = LexborHTMLParser(page.content())
page.close()
The first thing to extract is the reverse image search link. To do this, you need to pass the .kuwdsf .VfPpkd-RLmnJb
selector that is responsible for this element to the css_first()
method. Then extract the value of the href
attribute from attributes
:
reverse_image_search = {
'link': parser.root.css_first('.kuwdsf .VfPpkd-RLmnJb').attributes['href']
}
The algorithm for extracting data from the knowledge graph works similarly. There is a difference in extracting title
and subtitle
. For them, the text content is retrieved, so the corresponding text()
method is used. Sometimes there may not be a subtitle
, so a ternary expression is used for such cases:
knowledge_graph = {
'title': parser.root.css_first('.DeMn2d').text(),
'subtitle': parser.root.css_first('.XNTym').text() if parser.root.css_first('.XNTym') else None,
'link': parser.root.css_first('.OCDsub .VfPpkd-RLmnJb').attributes['href'],
'more_images': parser.root.css_first('[aria-label="More Images"]').attributes['href'],
'thumbnail': parser.root.css_first('.oLfv5c .FH8DCc').attributes['src'],
'images': [
{
'title': image.attributes['aria-label'],
'source': image.attributes['href'],
'link': image.css_first('.wETe9b').attributes['src']
}
for image in parser.root.css('.Y02Gld a')
]
}
For both knowledge graph images and visual matches, list comprehensions are used to provide a concise way to create lists. To find multiple elements and iterate them, the css()
method was used:
visual_matches = [
{
'title': result.css_first('.UAiK1e').text(),
'link': result.css_first('.GZrdsf').attributes['href'],
'source': result.css_first('.fjbPGe').text(),
'source_icon': result.css_first('.KRdrw').attributes['src'],
'thumbnail': result.css_first('.jFVN1').attributes['src']
}
for result in parser.root.css('.xuQ19b')
]
The google_lens_results
dictionary is created and previously extracted data is added to the corresponding keys:
google_lens_results = {
'reverse_image_search': reverse_image_search,
'knowledge_graph': knowledge_graph,
'visual_matches': visual_matches
}
After the all data is retrieved, it is output in JSON format:
print(json.dumps(google_lens_results, indent=2, ensure_ascii=False))
Run your code using context manager:
with sync_playwright() as playwright:
run(playwright)
Output
{
"reverse_image_search": {
"link": "https://www.google.com/search?tbs=sbi:AMhZZivbhNZ5ZFwCBpcEUAlEHVFDQnaZIC-4PcD5za7g6xuScvksUbf8osCVDaAg70m3b2eMkaodmPSm_1PiNZgCOEV5wma9PX1piaCV3GtLReFcsjRlP7On4aF3HUJAyPinMnEYGIATNPvQ7PLMoMZlmUXj4uQ1xHw"
},
"knowledge_graph": {
"title": "Black cat",
"subtitle": null,
"link": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&hl=en&gl=US",
"more_images": "https://www.google.com/search?q=Black+cat&kgmid=/m/03dj64&ved=0EOTpBwgAKAAwAA&source=.lens.button&tbm=isch&hl=en&gl=US",
"thumbnail": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w",
"images": [
{
"title": "Image #1 for Black cat",
"source": "https://vbspca.com/tag/stigma/",
"link": "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQEdppH7x_edJGSKSky2KSKK773r4HOp55AnejH0-sYBpO3-M5w"
},
... other images
]
},
"visual_matches": [
{
"title": "Pet Talk: Smoke can create problems quickly for your cat | VailDaily.com",
"link": "https://www.vaildaily.com/opinion/pet-talk-smoke-can-create-problems-quickly-for-your-cat/",
"source": "vaildaily.com",
"source_icon": "https://encrypted-tbn0.gstatic.com/favicon-tbn?q=tbn:ANd9GcSXpzpJuQgYt20Jd-moiGdOr6HoDpS-WQ_vjcfrNvtLJy_gjDrYJIs3abOVeBb7g24x5kLNBg2T-KGdiQ_NkFkcBjt2s7exhkQg46swp-DMTF3S1_lemg",
"thumbnail": "https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQCjR3dx5H8xz9fSevbe6JqPtBlakSxJwrECbaMS64UcP05CwC4"
},
... other visual matches results
]
}
Links
Add a Feature Request๐ซ or a Bug๐