Apache: Rewrite to IF - Printable Version

Scivillage.com Casual Discussion Science Forum

Apache: Rewrite to IF - Printable Version

+- Scivillage.com Casual Discussion Science Forum (https://www.scivillage.com)
+-- Forum: Science (https://www.scivillage.com/forum-61.html)
+--- Forum: Computer Sci., Programming & Intelligence (https://www.scivillage.com/forum-79.html)
+--- Thread: Apache: Rewrite to IF (/thread-2809.html)

Apache: Rewrite to IF - stryder - Sep 7, 2016

Currently I'm looking at a slight change in a configuration file for the Apache Webserver. I'm currently doing this on a development system before applying it to this actual site.

One of the main reasons for this is actually down to the lapse examples and documentation on how to utilise <IF><IFELSE><ELSE> additions to Apache. This means I'm having to mess with the settings to see what works and what causes 500 Server Errors. As I go though I thought I would post what I've worked out and how slowly things can be shifted from using mod_Rewrite, as it might be helpful to other webmasters. (I apologise to all that think this is all "Gobbledygook")

A simple example:

Redirect example.com to www.example.com if the file or directory exists

Code:
<If "(-f %{REQUEST_FILENAME} || -d %{REQUEST_FILENAME}) && (%{HTTP_HOST} == 'example.com')">

Redirect permanent "/" "Http://www.example.com/"

</If>

This might not seem much, however previously my lapse rewrites were actually causing redirects even when a file or directory doesn't exist, that isn't really necessary (in fact it increases is the amount of resources used by the server, albeit marginally)

With the above IF statement, the 404 ErrorDocument will handle any requests that don't match the criteria (unless otherwise specified in other IF statements)

If you intend to use it with HTTPS, then it makes sense to follow it up with:

Code:
#Elseif the attempted request uses HTTP (and exists) redirect to HTTPS 

<ElseIf "%{REQUEST_SCHEME} =='http' && (-f %{REQUEST_FILENAME} || -d %{REQUEST_FILENAME})">

Redirect permanent "/" "Https://www.example.com/"

</ElseIf>

Just make sure that both redirects begin Https:// (The first example doesn't have an s in Wink

Wink

)
It just tests to see if the page was accessed using http and if the file that's attempting to be accessed exists. (It will default to a 404 on failure which again doesn't use redirects) This method should be compatible with sites using cloudflare that doesn't even have SSL certs themselves.

Restricting Request methods to only GET, POST and HEAD

Code:
<If "(%{REQUEST_METHOD} != 'GET') || (%{REQUEST_METHOD} != 'POST') || (%{REQUEST_METHOD} != 'HEAD')">

Redirect 405 -

</If>

This took me a little while to workout since I was trying to do it in a more concise manner involving checking the request method against an array (to shorten the conditional)

Other Request methods can be used by some servers/services, however I try to tighten it where possible.

An addition that isn't an IF Statement might be capable of being used to take it further:

Code:
<LIMIT PUT DELETE CONNECT OPTIONS PATCH PROPFIND PROPPATCH MKCOL COPY MOVE LOCK UNLOCK>

Deny from All

</LIMIT>

I'll try to add some more as I work out the correct conditionals.

edit:

Code:
<LIMITEXCEPT GET POST HEAD>

Require all denied

</LIMITEXCEPT>

Limitexcept is like limit, however you put what you want to use and everything else has what's between the tags applied. Notice this is using Require all denied as opposed to Deny from All (since that is now depreciated)

RE: Apache: Rewrite to IF - stryder - Sep 8, 2016

Don't allow POSTing with less than HTTP/1.1

Code:
<If "(%{REQUEST_METHOD} == 'POST') && (! %{THE_REQUEST} =~ m#^POST(.*)HTTP/1.1$#)">

Deny from All

</If>

This Checks to make sure that the request is actually a POST and that it it's not using HTTP/1.1 (This will likely change with HTTP/2.0 around the corner) Any POST attempts not using 1.1 will trigger a 403 Forbidden by using a Deny from All rule.

This example makes use of regex (regular expressions) to test the Raw request information.

RE: Apache: Rewrite to IF - stryder - Sep 11, 2016

It seems like I wasn't the only one looking at SSL inclusion with their site, so I decided to write a rather lengthy response to a question about it thanks to what I've done here:

https://stackoverflow.com/a/39433261/4136214

In the article (as that's what it became) I touched on how the HSTS protocol requires elevation to follow a particular pattern:

Http://example.com through a 301 Redirect need to be elevated to https://example.com before being 301 redirected yet again to https://www.example.com
It unfortunately negates the version I was using to reduce how many redirections occur.

RE: Apache: Rewrite to IF - stryder - Nov 18, 2017

Traffic Calming - Slowing Down Robots

As you might or might not be aware, robots (spiders, crawlers, agents) can be a regular pain the *insert explicit here*
While some attempt to take into consideration the Robots.txt of a site to identify where to crawl and how often, others can be down right abusive and eat up as much bandwidth as they can. This can lead to site instabilities where the server literally can't handle all the requests.

Most webmaster related articles or examples tend to attempt to use bot the robots.txt for Good bots and block the request of "Bad" bots, I however have been looking at a slightly different approach. While indeed there are bad bots (ones that attempt to exploit, consume resources or scrape the site for data), others are actually more Roguish than Bad. Rogue bots in that instance are ones that just need better control methods applied to them.

So I came up with a traffic calming method using SetEnvifNoCase, RewriteCond serverenvironment strings:

PHP Code:
<?php 

<IfModule mod_setenvif.c>

    SetEnvIfNoCase User-Agent (roguebot1|roguebot2){1} throttlebot=1

</If>

<IfModule mod_rewrite.c>

    # Initialise the Rewrite Engine if not already Initialised

    RewriteEngine on

    RewriteBase /

    #If its between the hours of 0 to 8, 12, or 16 to 23

    # set an environment of trafficcalm

    RewriteCond %{TIME_HOUR} >00

    RewriteCond %{TIME_HOUR} <08 

    RewriteRule ^ - [E=trafficcalm:1]

     RewriteCond %{TIME_HOUR} =12 

     RewriteRule ^ - [E=trafficcalm:1]

     RewriteCond %{TIME_HOUR} >16

     RewriteCond %{TIME_HOUR} <23

     RewriteRule ^ - [E=trafficcalm:1]

    # If trafficcalm and throttlebot set and the request isn't for robots.txt

    # and it's between 10 and 50 seconds of the minute

    # Redirect to a 503 else set environment normaltraffic

    # Throttlebots only have access for 20 seconds per minute

     RewriteCond %{ENV:trafficcalm} 1

     RewriteCond %{ENV:throttlebot} 1

     RewriteCond %{REQUEST_URI} !^\/robots\.txt$ [NC]

     RewriteCond %{TIME_SEC} >10

     RewriteCond %{TIME_SEC} <50

     RewriteRule ^ - [E=throttled:1,R=503,L]

     RewriteRule ^ - [E=normaltraffic:1]

</IfModule>

What it basically does using a SetEnvIfNoCase we identify a robot (in this example with a User-Agent) and create a server environment of "throttlebot" (to identify it needs throttling)

Then using a rewrite first to work out what time of day it is in relationship to a traffic patten (in this instance were looking to cover the HIGH LOAD hours in the day), during these times we have a server environment of "trafficcalm" being set.

Then using a rewrite that checks for both "throttlebot" and "trafficcalm" together and that it's not accessing robots.txt (it shouldn't be calmed ever) it also checks if its within 10 and 50 seconds on the minute. If that is the case we throttle the bot by using a 503 Service Not Available otherwise an server environment of "normaltraffic" is set.

Additionally, having a Retry-After header with the value of 43 in the 503 page and adding a Crawl-delay with the value of "43" means that should a robot retry after 43 seconds they will eventually cycle around to accessing during the "normaltraffic" window.

The reason to Calm rather than block is that it does mean a robot can spider, so it won't effect SEO as much. The robot can't however be blocked for hours as that would cause problems, so creating an intermittent block on a per minute basis can lead to calming effect. Most good robots (which aren't effected by this measure) tend to be able to retry crawling after a duration of time, if the resource they are looking for isn't missing for hours.