Crawler Hints Update: Cloudflare Supports IndexNow and Announces General Availability

{\n "format-version": 2,\n "table-uuid": "0195e49b-8f7c-7933-8b43-d2902c72720a",\n "location": "s3://my-bucket/warehouse/0195e49b-79ca/table",\n "current-schema-id": 0,\n "schemas": [\n {\n "schema-id": 0,\n "type": "struct",\n "fields": [\n { "id": 1, "name": "id", "required": false, "type": "long" },\n { "id": 2, "name": "data", "required": false, "type": "string" }\n ]\n }\n ],\n "current-snapshot-id": 3567362634015106507,\n "snapshots": [\n {\n "snapshot-id": 3567362634015106507,\n "sequence-number": 1,\n "timestamp-ms": 1743297158403,\n "manifest-list": "s3://my-bucket/warehouse/0195e49b-79ca/table/metadata/snap-3567362634015106507-0.avro",\n "summary": {},\n "schema-id": 0\n }\n ],\n "partition-specs": [{ "spec-id": 0, "fields": [] }]\n}

import pyarrow as pa\nfrom pyiceberg.catalog.rest import RestCatalog\n\n# Define catalog connection details (replace variables)\nWAREHOUSE = "<WAREHOUSE>"\nTOKEN = "<TOKEN>"\nCATALOG_URI = "<CATALOG_URI>"\n\n# Connect to R2 Data Catalog\ncatalog = RestCatalog(\n name="my_catalog",\n warehouse=WAREHOUSE,\n uri=CATALOG_URI,\n token=TOKEN,\n)\n\n# Create default namespace\ncatalog.create_namespace("default")\n\n# Create simple PyArrow table\ndf = pa.table({\n "id": [1, 2, 3],\n "name": ["Alice", "Bob", "Charlie"],\n})\n\n# Create an Iceberg table\ntable = catalog.create_table(\n ("default", "my_table"),\n schema=df.schema,\n)

<!DOCTYPE html>\n<html>\n\t<head>\n\t\t<title>Waiting Room</title>\n\t</head>\n\t<body>\n\t\t<h1>You are currently in the queue.</h1>\n\t\t{{#waitTimeKnown}}\n\t\t\t<h2>Your estimated wait time is {{waitTimeFormatted}}.</h2>\n\t\t{{/waitTimeKnown}}\n\t\t{{^waitTimeKnown}}\n\t\t\t<h2>Your estimated wait time is unknown.</h2>\n\t\t{{/waitTimeKnown}}\n\t\t{{#turnstile}}\n\t\t\t\n\t\t\t<p>Please complete this challenge so we know you're a human:</p>\n\t\t\t{{{turnstile}}} \n\t\t{{/turnstile}}\n\t</body>\n</html>

<!DOCTYPE html>\n<html>\n\t<head>\n\t\t<title>Waiting Room</title>\n\t</head>\n\t<body>\n\t\t{{#turnstile}}\n\t\t\t<h1>This website is currently using a waiting room.</h1>\n\t\t\t<p>We use a Turnstile challenge to ensure you aren't waiting in line behind bots. Complete this challenge to enter the queue.</p>\n\t\t\t{{{turnstile}}} \n\t\t{{/turnstile}}\n\t\t{{^turnstile}}\n\t\t\t<h1>You are currently in the queue.</h1>\n\t\t\t{{#waitTimeKnown}}\n\t\t\t\t<h2>Your estimated wait time is {{waitTimeFormatted}}.</h2>\n\t\t\t{{/waitTimeKnown}}\n\t\t\t{{^waitTimeKnown}}\n\t\t\t\t<h2>Your estimated wait time is unknown.</h2>\n\t\t\t{{/waitTimeKnown}}\n\t\t{{/turnstile}}\n\t</body>\n</html>

let cookie\nif (query['cf_wr_turnstile']) {\n const turnstileToken = query['cf_wr_turnstile']\n const tokenOk = await siteverify(turnstileToken)\n if (tokenOk) {\n analytics.turnstileSuccesses++\n cookie = newCookie()\n } else {\n analytics.turnstileFailures++\n cookie = { infiniteQueuing: true }\n }\n response.headers['Set-Cookie'] = encryptCookie(cookie)\n}\nif (!cookie) {\n cookie = decryptCookie(headers['Cookie'])\n}\nif (!cookie) {\n analytics.turnstileChallenges++\n return await queuingPage(await estimateTimeRemaining(), { turnstileChallenge: true })\n} else if (cookie.infiniteQueuing) {\n analytics.infiniteQueueRequests++\n return await queuingPage(fakeTimeRemaining())\n} else if (cookie.accepted) {\n return await sendToOrigin()\n} else {\n // run Waiting Room's distributed queuing logic to check whether\n // this user has made it to the front of the queue, but only after\n // the user has completed a Turnstile challenge and isn't in the\n // fake infinite queue\n const { letThrough, timeRemaining } = calculateQueuing(cookie)\n if (letThrough) {\n cookie.accepted = true\n response.headers['Set-Cookie'] = encryptCookie(cookie)\n return await sendToOrigin()\n } else {\n return await queuingPage(timeRemaining)\n }\n}

Crawler Hints Update: Cloudflare Supports IndexNow and Announces General Availability

2021-10-18

5 min read

This post is also available in 한국어.

In the midst of the hottest summer on record, Cloudflare held its first ever Impact Week. We announced a variety of products and initiatives that aim to make the Internet and our planet a better place, with a focus on environmental, social, and governance projects. Today, we’re excited to share an update on Crawler Hints, an initiative announced during Impact Week. Crawler Hints is a service that improves the operating efficiency of approximately 45% of the Internet traffic that comes from web crawlers and bots.

Crawler Hints achieves this efficiency improvement by ensuring that crawlers get information about what they’ve crawled previously and if it makes sense to crawl a website again.

Today we are excited to announce two updates for Crawler Hints:

The first: Crawler Hints now supports IndexNow, a new protocol that allows websites to notify search engines whenever content on their website content is created, updated, or deleted. By collaborating with Microsoft and Yandex, Cloudflare can help improve the efficiency of their search engine infrastructure, customer origin servers, and the Internet at large.
The second: Crawler Hints is now generally available to all Cloudflare customers for free. Customers can benefit from these more efficient crawls with a single button click. If you want to enable Crawler Hints, you can do so in the Cache Tab of the Dashboard.

What problem does Crawler Hints solve?

Crawlers help make the Internet work. Crawlers are automated services that travel the Internet looking for… well, whatever they are programmed to look for. To power experiences that rely on indexing content from across the web, search engines and similar services operate massive networks of bots that crawl the Internet to identify the content most relevant to a user query. But because content on the web is always changing, and there is no central clearinghouse for when these changes happen on websites, search engine crawlers have a Sisyphean task. They must continuously wander the Internet, making guesses on how frequently they should check a given site for updates to its content.

Companies that run search engines have worked hard to make the process as efficient as possible, pushing the state-of-the-art for crawl cadence and infrastructure efficiency. But there remains one clear area of waste: excessive crawl.

At Cloudflare, we see traffic from all the major search crawlers, and have spent the last year studying how often these bots revisit a page that hasn't changed since they last saw it. Every one of these visits is a waste. And, unfortunately, our observation suggests that 53% of this crawler traffic is wasted.

With Crawler Hints, we expect to make this task a bit more tractable by providing an additional heuristic to the people who run these crawlers. This will allow them to know when content has been changed or added to a site instead of relying on preferences or previous changes that might not reflect the true change cadence for a site. Crawler Hints aims to increase the proportion of relevant crawls and limit crawls that don’t find fresh content, improving customer experience and reducing the need for repeated crawls.

Cloudflare sits in a unique position on the Internet to help give crawlers hints about when they should recrawl a site. Don’t knock on a website’s door every 30 seconds to see if anything is new when Cloudflare can proactively tell your crawler when it’s the best time to index new or changed content. That’s Crawler Hints in a nutshell!

If you want to learn more about Crawler Hints, see the original blog.

What is IndexNow?

IndexNow is a standard that was written by Microsoft and Yandex search engines. The standard aims to provide an efficient manner of signaling to search engines and other crawlers for when they should crawl content. Cloudflare’s Crawler Hints now supports IndexNow.

In its simplest form, IndexNow is a simple ping so that search engines know that a URL and its content has been added, updated, or deleted, allowing search engines to quickly reflect this change in their search results.- www.indexnow.org

By enabling Crawler Hints on your website, with the simple click of a button, Cloudflare will take care of signaling to these search engines when your content has changed via the IndexNow protocol. You don’t need to do anything else!

What does this mean for search engine operators? With Crawler Hints you’ll receive a near real-time, pushed feed of change events of Cloudflare websites (that have opted in). This, in turn, will dramatically improve not just the quality of your results, but also the energy efficiency of running your bots.

Collaborating with Industry leaders

Cloudflare is in a unique position to have a sizable portion of the Internet proxied behind us. As a result, we are able to see trends in the way bots access web resources. That visibility allows us to be proactive about signaling which crawls are required vs. not. We are excited to work with partners to make these insights useful to our customers. Search engines are key constituents in this equation. We are happy to collaborate and share this vision of a more efficient Internet with Microsoft Bing, and Yandex. We have been testing our interaction via IndexNow with Bing and Yandex for months with some early successes.

This is just the beginning. Crawler Hints is a continuous process that will require working with more and more partners to improve Internet efficiency more generally. While this may take time and participation from other key parts of the industry, we are open to collaborate with any interested participant who relies on crawling to power user experiences.

“The cache data from CDNs is a really valuable signal for content freshness. Cloudflare, as one of the top CDNs, is key in the adoption of IndexNow to become an industry-wide standard with a large portion of the internet actually using it. Cloudflare has built a really easy 1-click button for their users to start using it right away. Cloudflare’s mission of helping build a better Internet resonates well with why I started IndexNow i.e. to build a more efficient and effective Search.”- Fabrice Canel, Principal Program Manager

“Yandex is excited to join IndexNow as part of our long-term focus on sustainability. We have been working with the Cloudflare team in early testing to incorporate their caching signals in our crawling mechanism via the IndexNow API. The results are great so far.”- Maxim Zagrebin, Head of Yandex Search

"DuckDuckGo is supportive of anything that makes search more environmentally friendly and better for end users without harming privacy. We're looking forward to working with Cloudflare on this proposal."- Gabriel Weinberg, CEO and Founder

How do Cloudflare customers benefit?

Crawler Hints doesn’t just benefit search engines. For our customers and origin owners, Crawler Hints will ensure that search engines and other bot-powered experiences will always have the freshest version of your content, translating into happier users and ultimately influencing search rankings. Crawler Hints will also mean less traffic hitting your origin, improving resource consumption. Moreover, your site performance will be improved as well: your human customers will not be competing with bots!

And for Internet users? When you interact with bot-fed experiences — which we all do every day, whether we realize it or not, like search engines or pricing tools — these will now deliver more useful results from crawled data, because Cloudflare has signaled to the owners of the bots the moment they need to update their results.

How can I enable Crawler Hints for my website?

Crawler Hints is free to use for all Cloudflare customers and promises to revolutionize web efficiency. If you’d like to see how Crawler Hints can benefit how your website is indexed by the worlds biggest search engines, please feel free to opt-into the service:

Sign in to your Cloudflare Account.
In the dashboard, navigate to the Cache tab.
Click on the Configuration section.
Locate the Crawler Hints sign up card and enable. It's that easy.

How to enable crawler hints with one-click

Once you’ve enabled it, we will begin sending hints to search engines about when they should crawl particular parts of your website. Crawler Hints holds tremendous promise to improve the efficiency of the Internet.

What’s next?

We’re thrilled to collaborate with industry leaders Microsoft Bing, and Yandex to bring IndexNow to Crawler Hints, and to bring Crawler Hints to a wide audience in general availability. We look forward to working with additional companies who run crawlers to help make this process more efficient for the whole Internet.

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.

Crawler Hints

Product News

Speed & Reliability

SEO

	\n Pricing \n
\n R2 storage \n For standard storage class \n	\n $0.015 per GB-month (no change) \n
\n R2 Class A operations \n	\n $4.50 per million operations (no change) \n
\n R2 Class B operations \n	\n $0.36 per million operations (no change) \n
\n Data Catalog operations \n e.g., create table, get table metadata, update table properties \n	\n $9.00 per million catalog operations \n
\n Data Catalog compaction data processed \n	\n $0.05 per GB processed \n $4.00 per million objects processed \n
\n Data egress \n	\n $0 (no change, always free) \n

Plan	Bucket size	Request refill rate	Max keys per request	Total keys
Free	25 requests	5 per minute	100	500 per minute
Pro	25 requests	5 per second	100	500 per second
Biz	50 requests	10 per second	100	1,000 per second
Enterprise	500 requests	50 per second	100	5,000 per second