Setup & config options
Apache 2.4+ LAMP server
The game & not the islands
Setup & config options
North Atlantic : Macaronésia
Nine Azorean islands🚫 No ads & tracking
Most of the articles, descriptions and instructions written here are applicable to the most common Debian-based Linux derivatives. Depending on the respective operating system, there may be minor or major discrepancies.
This website is for educational purposes only. Please do not deploy anything in manufacturing plants.
No warranty or compensation is given for loss of data or hardware.
It should be also mentioned that this modest web server is hosted on a Raspberry Pi type 4B at home.
Raspberry Pi is a series of small single-board computers (SBCs) developed in the United Kingdom by the Raspberry Pi Foundation in association with Broadcom. The mini-computer with its armv7l processor has quickly become the favourite of hobbyists. Projects can be started with suitable Linux distributions. Even an aged RasPi e.g. the models 2B and 2B+ can definitely serve to simple tasks quite well.
Transfer the »robots.txt« in the top-level directory (root) of your Apache v2.4+ web server e.g.
/var/www/hmtl/
Use all lower case for the filename : »robots.txt« - never Robots.TXT or something similar.
»robots.txt« informs search engine spiders (bots) how to interact with indexing your web content.
If you do not have any »robots.txt« file placed, the web server log will return 404 error whenever a bot tries to access. Upload a blank text file named »robots.txt« if you deserve to stop getting 404 error messages.
Some search engines allow you to specify the address of a XML-sitemap, but if your site is small you do not need to create an XML-sitemap.
Please also note that if your »robots.txt« contains errors and spiders won’t be able to recognize the commands they will continue crawling thru your domain structure.
Two subdomains lead to /images/
and /contents/
.
Alter your main .htaccess
file in /root
and the system redirects in future both from the folders /images/
and /contents/
to /root
.
Redirect 301 /contents/robots.txt http://example.com/robots.txt Redirect 301 /images/robots.txt http://example.com/robots.txt
10-Jan 2021
Updated 08-Feb 2021
The longer edition proofed by the »Bing Webmaster Tools« :
# robots.txt # https://yourdomain.tld # 27-Jul 2022 # Deny bots those do not benefit User-agent: AdsBot-Google User-agent: AdsBot-Google-Mobile User-agent: APIs-Google User-agent: Amazonbot User-agent: Baiduspider User-agent: Barkrowler User-agent: Bytespider User-agent: DomainStatsBot User-agent: facebookexternalhit User-agent: Facebot User-agent: linkdexbot User-agent: MJ12bot User-agent: PetalBot User-agent: SBSearch User-agent: Silk User-agent: Sogou User-agent: YandexBot User-agent: YandexImages Disallow: / # Global rules User-agent: Applebot User-agent: Bingbot User-agent: BingPreview User-agent: coccocbot-image User-agent: coccocbot-web User-agent: DuckDuckBot User-agent: DuckDuckBot-Https User-agent: DuckDuckGo-Favicons-Bot User-agent: Ecosia-Bot User-agent: Exabot User-agent: Google-Read-Aloud User-agent: Googlebot User-agent: Googlebot-Image User-agent: Googlebot-News User-agent: Googlebot-Video User-agent: ia_archiver User-agent: IonCrawl User-agent: Mediapartners-Google User-agent: MojeekBot User-agent: Msnbot User-agent: Neevabot User-agent: Quantify User-agent: SeekportBot User-agent: SeznamBot User-agent: Slurp User-agent: Storebot-Google User-agent: Twitterbot Disallow: /cgi-bin/ Disallow: /directory/ Disallow: /directory-2/ Disallow: /page.html/ Disallow: /page-2.php/ Disallow: /pages/page-3.asp # Specific rules User-agent: WellKnownBot Allow: /.well-known/security.txt Allow: /ads.txt Allow: /humans.txt Allow: /robots.txt Disallow: / # Sitemaps Sitemap: https://dosboot.org/sitemap.xml Sitemap: https://dosboot.org/urllist.txt Sitemap: https://dosboot.org/sitemap-images.xml Sitemap: https://dosboot.org/urllist-images.txt Sitemap: https://dosboot.org/sitemap-videos.xml Sitemap: https://dosboot.org/urllist-videos.txt # End Of File
The shorter edition proofed by the »Bing Webmaster Tools« :
# robots.txt # https://yourdomain.tld # 27-Jul 2022 # Deny bots those do not benefit User-agent: AdsBot-Google User-agent: AdsBot-Google-Mobile User-agent: APIs-Google User-agent: Amazonbot User-agent: Baiduspider User-agent: Barkrowler User-agent: Bytespider User-agent: DomainStatsBot User-agent: facebookexternalhit User-agent: Facebot User-agent: linkdexbot User-agent: MJ12bot User-agent: PetalBot User-agent: SBSearch User-agent: Silk User-agent: Sogou User-agent: YandexBot User-agent: YandexImages Disallow: / # Global rules User-agent: * Disallow: /cgi-bin/ Disallow: /directory/ Disallow: /directory-2/ Disallow: /page.html/ Disallow: /page-2.php/ Disallow: /pages/page-3.asp # Specific rules User-agent: WellKnownBot Allow: /.well-known/security.txt Allow: /ads.txt Allow: /humans.txt Allow: /robots.txt Disallow: / # Sitemaps Sitemap: https://dosboot.org/sitemap.xml Sitemap: https://dosboot.org/urllist.txt Sitemap: https://dosboot.org/sitemap-images.xml Sitemap: https://dosboot.org/urllist-images.txt Sitemap: https://dosboot.org/sitemap-videos.xml Sitemap: https://dosboot.org/urllist-videos.txt # End Of File
This is not any secret : https://dosboot.org/robots.txt
21-Jan 2021
<!-- sitemap.xml -->
<?xml version="1.0" encoding="UTF-8"?/> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url> <loc>http://www.yourdomain.tld/index.php</loc> <lastmod>2017-10-13 00:27:25+00:00</lastmod> <changefreq>weekly</changefreq> <priority>1.00</priority> </url>
<url> <loc>http://www.yourdomain.tld/contents/admin.html</loc> <lastmod>2017-10-13 00:27:25+00:00</lastmod> <changefreq>weekly</changefreq> <priority>0.50</priority> </url>
</urlset>
<!-- sitemap-images.xml -->
<?xml version="1.0" encoding="UTF-8"?/> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<!-- just one image in your page -->
<url> <loc>http://www.yourdomain.tld/contents/admin.html</loc> <image:image> <image:loc>http://www.yourdomain.tld/images/loony.jpg</image:loc> </image:image> </url>
<!-- more than just one image in your page -->
<url> <loc>http://www.yourdomain.tld/index.php</loc> <image:image> <image:loc>http://www.yourdomain.tld/images/fiddler.png</image:loc> </image:image> <image:image> <image:loc>http://www.yourdomain.tld/images/backpiper.gif</image:loc> </image:image> </url>
</urlset>
Assuming the sitemaps are located in http://yourdomain.tld/
use pings to the URLs below to get the updated *.xml files fetched.
Bing (MSN) / Yahoo! Slurp
https://www.bing.com/webmaster/ping.aspx?sitemap= https://yourdomain.tld/sitemap.xml
18-Sep 2018 : Anonymous URL Submission Tool Being Retired
Saying Goodbye is never easy, but the time has come to announce the withdrawal of anonymous non-signed in support Bing's URL submission tool. Webmaster will still be able to log in and access Submit URL tool in Bing Webmaster Tool ... Bing blogs
Strangewise the submmissions of simple text-based feeds urllist.txt
to Bing is still accepted.
Google Inc.
https://www.google.com/webmasters/sitemaps/ping?sitemap= https://yourdomain.tld/sitemap.xml
https://www.google.com/webmasters/sitemaps/ping?sitemap= https://yourdomain.tld/sitemap-images.xml
You should never ever overdo with anonymous submissions.
09-Oct 2017
Updated 04-Jul 2021
Here are two examples to escape any 404 document error messages.
The both files are targeted to all non-direct / non-resellers.
Put the files in /root
of your web server and the bots are satisfied.
»ads.txt«
# ads.txt https://example.com/ example.com, * , DIRECT, * example.com, * , RESELLER, * # not any advertisements here to find
»sellers.json«
{ "contact_email": "", "contact_address": "", "version": "1.0", "identifiers": [ { "name": "", "value": "" } ], "sellers": [ { "seller_id": "", "seller_type": "", "name": "", "domain": "example.com", "comment": "nothing to sell/resell here" } ] }
23-May 2021
<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0"> xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" <channel> <title>Main header</title> <link>http://example.com</link> <description>Use a short header description</description> <language>en-gb</language> <item> <title>First article</title> <description>Use a short description</description> <!-- Link to article --> <link>http://example.com/one.html</link> <!-- Date published --> <pubDate>Wed, 30 Apr 2018 00:00:00 +0200</pubDate> </item> <item> <title>Second article</title> <description>Use a short description</description> <!-- Link to article --> <link>http://example.com/second.html</link> <!-- Date published --> <pubDate>Wed, 30 Apr 2018 00:00:00 +0200</pubDate> </item> </channel> </rss>
The <pubDate>
code is not needed, but be aware that the format is strict.
The +0200 indicates that time 2 hours ahead of GMT.
If you have one character missing - e.g. a closing tag - you get a white screen within your web browser.
Even if you are adding characters other than standard letters (&) and numerals.
20-Jul 2018