Home
Newer & updated contents

Non-daily weblog

Exchange | Board

Files for download
Toolbox :

Workshop tutorials
Open Directory

W3 Directory resources

AI revolution drives up power consumption

Raspberry Pi as server

Setup & config options

Raspberry Pi first usage
Suitable power supply
RasPi & sFTP file transfer
Home network print server
Home network scan server
Mesh : home Lan USB drive
Explore hard & software
UFW firewall explained
Secured by fail2ban server
Software packaging & PPA

Apache 2.4+ LAMP server

http web server : port 80
https web server : port 443
Varnish caching proxy
Module : cgi & perl
Module : geoip
Modules : php & mysql
http*s error handling
Server : conditional logging
TL-domain & dynamic DNS
Webalizer log analyser
Defeat referrer spam
robots.txt & xml sitemaps
Server : .htaccess handling

»Windward« server

The game & not the islands
Setup & config options

Establish the Windward game server. About Windward

Some off-topics

Firefox web browser. Improve your Firefox

Weather widget

Lat. 52.27, Long. 8.01

Meteorological service

North Atlantic : Macaronésia

Nine Azorean islands

Madeiran archipelago
The point of interest on the Madeiran islands.

POI on Madeira island
Streaming media, videos :

Streaming media

Off-topic media

Front desk clerk

Contact details

Host traffic elicitation

🚫 No ads & tracking

The prologue

Most of the articles, descriptions and instructions written here are applicable to the most common Debian-based Linux derivatives. Depending on the respective operating system, there may be minor or major discrepancies.
This website is for educational purposes only. Please do not deploy anything in manufacturing plants.
No warranty or compensation is given for loss of data or hardware.

It should be also mentioned that this modest web server is hosted on a Raspberry Pi type 4B at home.

The Raspberry Pi mini-computer board as multi-purpose server deployed
A competent allrounder for domestic purposes and micro-enterprises

Raspberry Pi is a series of small single-board computers (SBCs) developed in the United Kingdom by the Raspberry Pi Foundation in association with Broadcom. The mini-computer with its armv7l processor has quickly become the favourite of hobbyists. Projects can be started with suitable Linux distributions. Even an aged RasPi e.g. the models 2B and 2B+ can definitely serve to simple tasks quite well.

»robots.txt« | Prologue

Apache v2.4+ web server | »robots.txt« and xml-sitemaps, RSS feeds

Transfer the »robots.txt« in the top-level directory (root) of your Apache v2.4+ web server e.g.

/var/www/hmtl/

Use all lower case for the filename : »robots.txt« - never Robots.TXT or something similar.

»robots.txt« informs search engine spiders (bots) how to interact with indexing your web content.

If you do not have any »robots.txt« file placed, the web server log will return 404 error whenever a bot tries to access. Upload a blank text file named »robots.txt« if you deserve to stop getting 404 error messages.

Some search engines allow you to specify the address of a XML-sitemap, but if your site is small you do not need to create an XML-sitemap.

Please also note that if your »robots.txt« contains errors and spiders won’t be able to recognize the commands they will continue crawling thru your domain structure.

Subdomains | How to deal with the »robots.txt« correctly ❓

Two subdomains lead to /images/ and /contents/.

Alter your main .htaccess file in /root and the system redirects in future both from the folders /images/ and /contents/ to /root.

        Redirect 301 /contents/robots.txt http://example.com/robots.txt
        Redirect 301 /images/robots.txt http://example.com/robots.txt

10-Jan 2021
Updated 08-Feb 2021

The plain robots.txt | Longer and shorter version

The longer edition proofed by the »Bing Webmaster Tools« :

        #    robots.txt
        #    https://yourdomain.tld
        #    27-Jul 2022

        # Deny bots those do not benefit

        User-agent: AdsBot-Google
        User-agent: AdsBot-Google-Mobile
        User-agent: APIs-Google
        User-agent: Amazonbot
        User-agent: Baiduspider
        User-agent: Barkrowler
        User-agent: Bytespider
        User-agent: DomainStatsBot
        User-agent: facebookexternalhit
        User-agent: Facebot
        User-agent: linkdexbot
        User-agent: MJ12bot
        User-agent: PetalBot
        User-agent: SBSearch
        User-agent: Silk
        User-agent: Sogou
        User-agent: YandexBot
        User-agent: YandexImages
        Disallow: /

        # Global rules

        User-agent: Applebot
        User-agent: Bingbot
        User-agent: BingPreview
        User-agent: coccocbot-image
        User-agent: coccocbot-web
        User-agent: DuckDuckBot
        User-agent: DuckDuckBot-Https
        User-agent: DuckDuckGo-Favicons-Bot
        User-agent: Ecosia-Bot
        User-agent: Exabot
        User-agent: Google-Read-Aloud
        User-agent: Googlebot
        User-agent: Googlebot-Image
        User-agent: Googlebot-News
        User-agent: Googlebot-Video
        User-agent: ia_archiver
        User-agent: IonCrawl
        User-agent: Mediapartners-Google
        User-agent: MojeekBot
        User-agent: Msnbot
        User-agent: Neevabot
        User-agent: Quantify
        User-agent: SeekportBot
        User-agent: SeznamBot
        User-agent: Slurp
        User-agent: Storebot-Google
        User-agent: Twitterbot
        Disallow: /cgi-bin/
        Disallow: /directory/
        Disallow: /directory-2/
        Disallow: /page.html/
        Disallow: /page-2.php/
        Disallow: /pages/page-3.asp

        # Specific rules

        User-agent: WellKnownBot
        Allow: /.well-known/security.txt
        Allow: /ads.txt
        Allow: /humans.txt
        Allow: /robots.txt
        Disallow: /

        # Sitemaps

        Sitemap: https://dosboot.org/sitemap.xml
        Sitemap: https://dosboot.org/urllist.txt
        Sitemap: https://dosboot.org/sitemap-images.xml
        Sitemap: https://dosboot.org/urllist-images.txt
        Sitemap: https://dosboot.org/sitemap-videos.xml
        Sitemap: https://dosboot.org/urllist-videos.txt

        #    End Of File

The shorter edition proofed by the »Bing Webmaster Tools« :

        #    robots.txt
        #    https://yourdomain.tld
        #    27-Jul 2022

        # Deny bots those do not benefit

        User-agent: AdsBot-Google
        User-agent: AdsBot-Google-Mobile
        User-agent: APIs-Google
        User-agent: Amazonbot
        User-agent: Baiduspider
        User-agent: Barkrowler
        User-agent: Bytespider
        User-agent: DomainStatsBot
        User-agent: facebookexternalhit
        User-agent: Facebot
        User-agent: linkdexbot
        User-agent: MJ12bot
        User-agent: PetalBot
        User-agent: SBSearch
        User-agent: Silk
        User-agent: Sogou
        User-agent: YandexBot
        User-agent: YandexImages
        Disallow: /

        # Global rules

        User-agent: *
        Disallow: /cgi-bin/
        Disallow: /directory/
        Disallow: /directory-2/
        Disallow: /page.html/
        Disallow: /page-2.php/
        Disallow: /pages/page-3.asp

        # Specific rules

        User-agent: WellKnownBot
        Allow: /.well-known/security.txt
        Allow: /ads.txt
        Allow: /humans.txt
        Allow: /robots.txt
        Disallow: /

        # Sitemaps

        Sitemap: https://dosboot.org/sitemap.xml
        Sitemap: https://dosboot.org/urllist.txt
        Sitemap: https://dosboot.org/sitemap-images.xml
        Sitemap: https://dosboot.org/urllist-images.txt
        Sitemap: https://dosboot.org/sitemap-videos.xml
        Sitemap: https://dosboot.org/urllist-videos.txt

        #    End Of File

»robots.txt« | Which configuration does this server deploy ❓

This is not any secret : https://dosboot.org/robots.txt

21-Jan 2021

Structure of a XML-sitemap

        <!-- sitemap.xml -->

        <?xml version="1.0" encoding="UTF-8"?/>
        <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemalocation="http://www.sitemaps.org/schemas/sitemap/0.9
        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

        <url>
        <loc>http://www.yourdomain.tld/index.php</loc>
        <lastmod>2017-10-13 00:27:25+00:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>1.00</priority>
        </url>

        <url>
        <loc>http://www.yourdomain.tld/contents/admin.html</loc>
        <lastmod>2017-10-13 00:27:25+00:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.50</priority>
        </url>

        </urlset>

Structure of a XML-sitemap to images
Primary used by Googlebot-Image exclusively

        <!-- sitemap-images.xml -->

        <?xml version="1.0" encoding="UTF-8"?/>
        <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">

        <!-- just one image in your page -->

        <url>
        <loc>http://www.yourdomain.tld/contents/admin.html</loc>
        <image:image>
        <image:loc>http://www.yourdomain.tld/images/loony.jpg</image:loc>
        </image:image>
        </url>

        <!-- more than just one image in your page -->

        <url>
        <loc>http://www.yourdomain.tld/index.php</loc>
        <image:image>
        <image:loc>http://www.yourdomain.tld/images/fiddler.png</image:loc>
        </image:image>
        <image:image>
        <image:loc>http://www.yourdomain.tld/images/backpiper.gif</image:loc>
        </image:image>
        </url>

        </urlset>

How to PING your *.xml sitemaps to MS Bing, Yahoo! & Google ❓

Assuming the sitemaps are located in http://yourdomain.tld/ use pings to the URLs below to get the updated *.xml files fetched.

Bing (MSN) / Yahoo! Slurp

        https://www.bing.com/webmaster/ping.aspx?sitemap=
        https://yourdomain.tld/sitemap.xml

18-Sep 2018 : Anonymous URL Submission Tool Being Retired
Saying Goodbye is never easy, but the time has come to announce the withdrawal of anonymous non-signed in support Bing's URL submission tool. Webmaster will still be able to log in and access Submit URL tool in Bing Webmaster Tool ... Bing blogs

~~Strangewise the submmissions of simple text-based feeds urllist.txt to Bing is still accepted.~~

Google Inc.

        https://www.google.com/webmasters/sitemaps/ping?sitemap=
            https://yourdomain.tld/sitemap.xml

        https://www.google.com/webmasters/sitemaps/ping?sitemap=
            https://yourdomain.tld/sitemap-images.xml

Google sitemap notification.

Attention! You should never ever overdo with anonymous submissions.

09-Oct 2017
Updated 04-Jul 2021

Structures of a blank »ads.txt« and a blank »sellers.json«

Here are two examples to escape any 404 document error messages.

The both files are targeted to all non-direct / non-resellers.

Put the files in /root of your web server and the bots are satisfied.

»ads.txt«

        # ads.txt        https://example.com/
        example.com, * , DIRECT, *
        example.com, * , RESELLER, *
        # not any advertisements here to find

»sellers.json«

        {
        "contact_email": "",
        "contact_address": "",
        "version": "1.0",
        "identifiers": [
        {
        "name": "",
        "value": ""
        }
        ],
        
        "sellers": [
        {
        "seller_id": "",
        "seller_type": "",
        "name": "",
        "domain": "example.com",
        "comment": "nothing to sell/resell here"
        }
        ]
        }

23-May 2021

The structure of a really simple RSS feed

        <?xml version="1.0" encoding="UTF-8"?>
        <rss version="2.0">

        xmlns:content="http://purl.org/rss/1.0/modules/content/"
        xmlns:wfw="http://wellformedweb.org/CommentAPI/"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:atom="http://www.w3.org/2005/Atom"

        <channel>

	       <title>Main header</title>
	       <link>http://example.com</link>
	       <description>Use a short header description</description>
	       <language>en-gb</language>

	       <item>
	       <title>First article</title>
	       <description>Use a short description</description>
	       <!-- Link to article -->
	       <link>http://example.com/one.html</link>
	       <!-- Date published -->
	       <pubDate>Wed, 30 Apr 2018 00:00:00 +0200</pubDate>
	       </item>

	       <item>
	       <title>Second article</title>
	       <description>Use a short description</description>
	       <!-- Link to article -->
	       <link>http://example.com/second.html</link>
	       <!-- Date published -->
	       <pubDate>Wed, 30 Apr 2018 00:00:00 +0200</pubDate>
	       </item>

        </channel>

        </rss>

The <pubDate> code is not needed, but be aware that the format is strict.
The +0200 indicates that time 2 hours ahead of GMT.

If you have one character missing - e.g. a closing tag - you get a white screen within your web browser.
Even if you are adding characters other than standard letters (&) and numerals.

20-Jul 2018

dosboot.org 2025 | Design and layout handmade in Northwest Europe