How to keep robots out of your web site

redaag (35)in Project HOPE • 3 years ago

You know that search engines are created to assist folks find info quickly on the Internet, and also the search engines acquire abundant of their information through golems (also referred to as spiders or crawlers), that search for sites for them.

The spiders or crawlers robots explore the online trying to find and recording every kind of information. they typically begin with universal resource locator submitted by users, or from links they find on the web sites, the sitemap files or the highest level of a site.

Once the robot accesses the house page then recursively accesses all pages joined from that page. however the golem also can explore all the pages which will notice on a specific server.

once the robot finds an online page it works categorization the title, the keywords, the text, etcetera however generally you may want to forestall search engines from indexing a number of your sites like news postings, and specially marked web pages (in example: affiliates pages), but whether or not individual robots follow to those conventions is pure voluntary.

ROBOTS EXCLUSION PROTOCOL

therefore if you wish robots to stay out from some of your web pages, you'll be able to raise robots to ignore the online pages that you just dont want indexed, and attempt to to|to try and do} that you can place a robots.txt file on the native root server of your web site.

In example if you have got a directory referred to as e-books and you wish to ask robots to stay out of it, your robots.txt file ought to read:

User-agent: * Disallow: e-books/

after you dont have enough management over your server to line up a robots.txt file, you can try adding a META tag to the head section of any hypertext markup language document.

In example, a tag just like the following tells robots to not index and not to follow links on a specific page:

meta name=”ROBOTS” content=”NOINDEX, NOFOLLOW”

Support for the META tag among robots isn't therefore frequent because the Robots Exclusion Protocol, however most of major net indexes presently support it.

NEWS POSTINGS

If you wish to stay the search engines out of your news postings, you'll be able to produce an an “X-no-archive” line in of your postings’ headers:

X-no-archive: affirmative

however though common news clients, permit you to feature associate X-no-archive line to the headers of your news postings, a number of them dont permit you to try and do so.

the matter is that the majority search engines assume that every one info they notice is public unless marked otherwise.

therefore watch out as a result of though the golem and archive exclusion standards could facilitate keep your material out of major search engines there are some others that respect no such rules.

If you’re extremely involved regarding the privacy of your e-mail and Usenet postings, you want to use thereforeme anonymous remailers and PGP. you'll be able to examine it here:

http://www.well.com/user/abacard/remail.html http://www.io.com/~combs/htmls/crypto.html
http://world.std.com/~franl/pgp/

albeit you're not notably involved about privacy, bear in mind that something you write are indexed and archived somewhere for eternity, so use the robots.txt file the maximum amount as you would like it.

#web

3 years ago in Project HOPE by redaag (35)

Sort:

crypto.piotr (70)admin@project.hope founder 3 years ago

Hello @redaag

Thank you for posting within our project.hope community on STEEMit.

Please spare few minutes and read how project.hope is organized and learn about our economy.

That would help you understand more our goals and how are we trying to achieve them. Hopefully you will join our community and become strong part of it :)

Do you use telegram or discord? If you do then join our server and give me a shout. I would gladly share with you goals of our community and introduce to others from our team.

Consider joining our discord server: https://discord.gg/uWMJTaW

Yours,
@project.hope team,

$0.00

1 vote

[-]