[4/5] ids-functions.pl: Use If-Modified-Since header to reduce file downloads.

Message ID 20220323040452.2609-4-stefan.schantl@ipfire.org
State Superseded
Headers
Series [1/5] ids-functions.pl: Drop downloader code for sourcefire based ruleset. |

Commit Message

Stefan Schantl March 23, 2022, 4:04 a.m. UTC
  When using the "If-Modified-Since" header, the server can be requested
if a modified version of the file can be served.

In case that is true, the file will be sent and stored by the downloader
function. If the file has not been touched since the last time, the
server will respond with the code "304" (Not modified).

This tells us, that the current stored file is the latest one (still up-to-date)
and we safely can skip the download attempt for this provider.

Signed-off-by: Stefan Schantl <stefan.schantl@ipfire.org>
---
 config/cfgroot/ids-functions.pl | 38 ++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)
  

Comments

Michael Tremer March 23, 2022, 9:34 a.m. UTC | #1
Hello,

I like this mechanism. It is quite powerful since it will save a lot of unnecessary downloads and CPU cycles. However, I am not sure if all providers will support this because some use crappy CDNs which do not give anything about conforming to any RFC.

> On 23 Mar 2022, at 04:04, Stefan Schantl <stefan.schantl@ipfire.org> wrote:
> 
> When using the "If-Modified-Since" header, the server can be requested
> if a modified version of the file can be served.
> 
> In case that is true, the file will be sent and stored by the downloader
> function. If the file has not been touched since the last time, the
> server will respond with the code "304" (Not modified).
> 
> This tells us, that the current stored file is the latest one (still up-to-date)
> and we safely can skip the download attempt for this provider.
> 
> Signed-off-by: Stefan Schantl <stefan.schantl@ipfire.org>
> ---
> config/cfgroot/ids-functions.pl | 38 ++++++++++++++++++++++++++++++---
> 1 file changed, 35 insertions(+), 3 deletions(-)
> 
> diff --git a/config/cfgroot/ids-functions.pl b/config/cfgroot/ids-functions.pl
> index dfbeb1a7d..d7df41dd1 100644
> --- a/config/cfgroot/ids-functions.pl
> +++ b/config/cfgroot/ids-functions.pl
> @@ -365,9 +365,25 @@ sub downloadruleset ($) {
> 		my $tmp = File::Temp->new( SUFFIX => ".tmp", DIR => "/var/tmp/", UNLINK => 0 );
> 		my $tmpfile = $tmp->filename();
> 
> +		# Genarate and assign file name and path to store the downloaded rules file.
> +		my $dl_rulesfile = &_get_dl_rulesfile($provider);
> +
> +		# Load perl module to deal with file atributes.
> +		use File::stat;

I don’t like modules in the middle of a script. This makes testing a new perl update more difficult. If I can load the script, I expect it to have all dependencies it needs. This does not work when modules are being loaded later on. I would say that we need to cover those things by a module test then.

I would prefer reliability over performance (I assume that this your argument why to load it only when you need it).

> +
> +		# Get the mtime of the rulesfile if it exists.
> +		my $mtime = (stat($dl_rulesfile)->mtime) if (-f $dl_rulesfile);

I suggest making this a little bit simpler:

Create the request first, then stat the file (you don’t need to check if it exists first because stat will just tell you).

Then add the header if you have a timestamp; if not, then don’t add a useless header. This is a lot more straight forward :)

> +
> +		# Convert the mtime into gmtime format.
> +		my $gmtime = gmtime($mtime || 0);
> +
> 		# Pass the requested url to the downloader.
> 		my $request = HTTP::Request->new(GET => $url);
> 
> +		# Add the If-Modified-Since header to the request, containing the omited and converted
> +		# mtime of the downloaded rules file, if one is present.
> +		$request->header( 'If-Modified-Since' => "$gmtime" );

Will this survive any redirects? I know that some redirect to AWS S3 and you will need to make sure that the header is being sent when you follow the redirect - assuming S3 supports this which I think they would.

> +
> 		my $dl_attempt = 1;
> 		my $response;
> 
> @@ -381,6 +397,14 @@ sub downloadruleset ($) {
> 				# Break loop.
> 				last;
> 
> +			# Check if the server responds with 304 (Not Modified).
> +			} elsif ($response->code == 304) {
> +				# Log to syslog.
> +				&_log_to_syslog("Ruleset is up-to-date, no update required.");
> +
> +				# Nothing to be done, the ruleset is up-to-date.
> +				return;
> +
> 			# Check if we ran out of download re-tries.
> 			} elsif ($dl_attempt eq $max_dl_attempts) {
> 				# Obtain error.
> @@ -406,6 +430,10 @@ sub downloadruleset ($) {
> 		# Get the remote size of the downloaded file.
> 		my $remote_filesize = $headers->content_length;
> 
> +		# Get the timestamp from header, when the file has been modified the
> +		# last time.
> +		my $last_modified = $headers->last_modified;
> +
> 		# Load perl stat module.
> 		use File::stat;

You are loading the module again.

> 
> @@ -428,9 +456,6 @@ sub downloadruleset ($) {
> 			return 1;
> 		}
> 
> -		# Genarate and assign file name and path to store the downloaded rules file.
> -		my $dl_rulesfile = &_get_dl_rulesfile($provider);
> -
> 		# Check if a file name could be obtained.
> 		unless ($dl_rulesfile) {
> 			# Log error message.
> @@ -449,6 +474,13 @@ sub downloadruleset ($) {
> 		# Overwrite the may existing rulefile or tarball with the downloaded one.
> 		move("$tmpfile", "$dl_rulesfile");
> 
> +		# Check if the server respond contained a last_modified value.
> +		if ($last_modified) {
> +			# Assign the last modified timestamp from server as mtime to the
> +			# rules file.
> +			utime(time(), "$last_modified", "$dl_rulesfile");
> +		}
> +
> 		# Delete temporary file.
> 		unlink("$tmpfile");
> 
> -- 
> 2.30.2
> 

-Michael
  
Stefan Schantl March 24, 2022, 6:50 p.m. UTC | #2
Hello Michael,

thanks for your thoughts and feedback.

> Hello,
> 
> I like this mechanism. It is quite powerful since it will save a lot
> of unnecessary downloads and CPU cycles. However, I am not sure if
> all providers will support this because some use crappy CDNs which do
> not give anything about conforming to any RFC.

Currently I'm only aware that content which is hosted on github.com
cannot be checked with this header flag. They are using Etags which
also can be used for cache management, but works in a completely
different way.

There are plenty more providers, some which requires a subscription
which I do not own - so they currently remains untested. 

> 
> > On 23 Mar 2022, at 04:04, Stefan Schantl
> > <stefan.schantl@ipfire.org> wrote:
> > 
> > When using the "If-Modified-Since" header, the server can be
> > requested
> > if a modified version of the file can be served.
> > 
> > In case that is true, the file will be sent and stored by the
> > downloader
> > function. If the file has not been touched since the last time, the
> > server will respond with the code "304" (Not modified).
> > 
> > This tells us, that the current stored file is the latest one
> > (still up-to-date)
> > and we safely can skip the download attempt for this provider.
> > 
> > Signed-off-by: Stefan Schantl <stefan.schantl@ipfire.org>
> > ---
> > config/cfgroot/ids-functions.pl | 38
> > ++++++++++++++++++++++++++++++---
> > 1 file changed, 35 insertions(+), 3 deletions(-)
> > 
> > diff --git a/config/cfgroot/ids-functions.pl b/config/cfgroot/ids-
> > functions.pl
> > index dfbeb1a7d..d7df41dd1 100644
> > --- a/config/cfgroot/ids-functions.pl
> > +++ b/config/cfgroot/ids-functions.pl
> > @@ -365,9 +365,25 @@ sub downloadruleset ($) {
> >                 my $tmp = File::Temp->new( SUFFIX => ".tmp", DIR =>
> > "/var/tmp/", UNLINK => 0 );
> >                 my $tmpfile = $tmp->filename();
> > 
> > +               # Genarate and assign file name and path to store
> > the downloaded rules file.
> > +               my $dl_rulesfile = &_get_dl_rulesfile($provider);
> > +
> > +               # Load perl module to deal with file atributes.
> > +               use File::stat;
> 
> I don’t like modules in the middle of a script. This makes testing a
> new perl update more difficult. If I can load the script, I expect it
> to have all dependencies it needs. This does not work when modules
> are being loaded later on. I would say that we need to cover those
> things by a module test then.
> 
> I would prefer reliability over performance (I assume that this your
> argument why to load it only when you need it).

I'll move the load of the modules to the top of the script. This is
also the way all other (library) scripts are doing this.

This may will lead to some more resource consumption, but will help us
to execute the script in a safe way - even when anything got updated.

Module tests generally are a good way - I'm thinking about implementing
some at a later time (after some important changes are finished) for
the "ids-functions.pl".

> 
> > +
> > +               # Get the mtime of the rulesfile if it exists.
> > +               my $mtime = (stat($dl_rulesfile)->mtime) if (-f
> > $dl_rulesfile);
> 
> I suggest making this a little bit simpler:
> 
> Create the request first, then stat the file (you don’t need to check
> if it exists first because stat will just tell you).
> 
> Then add the header if you have a timestamp; if not, then don’t add a
> useless header. This is a lot more straight forward :)
> 
> > +
> > +               # Convert the mtime into gmtime format.
> > +               my $gmtime = gmtime($mtime || 0);
> > +
> >                 # Pass the requested url to the downloader.
> >                 my $request = HTTP::Request->new(GET => $url);
> > 
> > +               # Add the If-Modified-Since header to the request,
> > containing the omited and converted
> > +               # mtime of the downloaded rules file, if one is
> > present.
> > +               $request->header( 'If-Modified-Since' => "$gmtime"
> > );
> 
> Will this survive any redirects? I know that some redirect to AWS S3
> and you will need to make sure that the header is being sent when you
> follow the redirect - assuming S3 supports this which I think they
> would.

If the servers in between are not entirely broken or do silly stuff,
they should.

> 
> > +
> >                 my $dl_attempt = 1;
> >                 my $response;
> > 
> > @@ -381,6 +397,14 @@ sub downloadruleset ($) {
> >                                 # Break loop.
> >                                 last;
> > 
> > +                       # Check if the server responds with 304
> > (Not Modified).
> > +                       } elsif ($response->code == 304) {
> > +                               # Log to syslog.
> > +                               &_log_to_syslog("Ruleset is up-to-
> > date, no update required.");
> > +
> > +                               # Nothing to be done, the ruleset
> > is up-to-date.
> > +                               return;
> > +
> >                         # Check if we ran out of download re-tries.
> >                         } elsif ($dl_attempt eq $max_dl_attempts) {
> >                                 # Obtain error.
> > @@ -406,6 +430,10 @@ sub downloadruleset ($) {
> >                 # Get the remote size of the downloaded file.
> >                 my $remote_filesize = $headers->content_length;
> > 
> > +               # Get the timestamp from header, when the file has
> > been modified the
> > +               # last time.
> > +               my $last_modified = $headers->last_modified;
> > +
> >                 # Load perl stat module.
> >                 use File::stat;
> 
> You are loading the module again.

Thanks, this is an additional point why, loading all required modules
on top are a good idea. You do not have to think about which modules
are already loaded or may be reloaded etc.
> 
> > 
> > @@ -428,9 +456,6 @@ sub downloadruleset ($) {
> >                         return 1;
> >                 }
> > 
> > -               # Genarate and assign file name and path to store
> > the downloaded rules file.
> > -               my $dl_rulesfile = &_get_dl_rulesfile($provider);
> > -
> >                 # Check if a file name could be obtained.
> >                 unless ($dl_rulesfile) {
> >                         # Log error message.
> > @@ -449,6 +474,13 @@ sub downloadruleset ($) {
> >                 # Overwrite the may existing rulefile or tarball
> > with the downloaded one.
> >                 move("$tmpfile", "$dl_rulesfile");
> > 
> > +               # Check if the server respond contained a
> > last_modified value.
> > +               if ($last_modified) {
> > +                       # Assign the last modified timestamp from
> > server as mtime to the
> > +                       # rules file.
> > +                       utime(time(), "$last_modified",
> > "$dl_rulesfile");
> > +               }
> > +
> >                 # Delete temporary file.
> >                 unlink("$tmpfile");
> > 
> > -- 
> > 2.30.2
> > 
> 
> -Michael

-Stefan
  
Michael Tremer March 28, 2022, 3:15 p.m. UTC | #3
Hello,

> On 24 Mar 2022, at 18:50, Stefan Schantl <stefan.schantl@ipfire.org> wrote:
> 
> Hello Michael,
> 
> thanks for your thoughts and feedback.
> 
>> Hello,
>> 
>> I like this mechanism. It is quite powerful since it will save a lot
>> of unnecessary downloads and CPU cycles. However, I am not sure if
>> all providers will support this because some use crappy CDNs which do
>> not give anything about conforming to any RFC.
> 
> Currently I'm only aware that content which is hosted on github.com
> cannot be checked with this header flag. They are using Etags which
> also can be used for cache management, but works in a completely
> different way.

Yes, but we can still use them.

You can just store the Etag and send it with your next request.

> There are plenty more providers, some which requires a subscription
> which I do not own - so they currently remains untested. 
> 
>> 
>>> On 23 Mar 2022, at 04:04, Stefan Schantl
>>> <stefan.schantl@ipfire.org> wrote:
>>> 
>>> When using the "If-Modified-Since" header, the server can be
>>> requested
>>> if a modified version of the file can be served.
>>> 
>>> In case that is true, the file will be sent and stored by the
>>> downloader
>>> function. If the file has not been touched since the last time, the
>>> server will respond with the code "304" (Not modified).
>>> 
>>> This tells us, that the current stored file is the latest one
>>> (still up-to-date)
>>> and we safely can skip the download attempt for this provider.
>>> 
>>> Signed-off-by: Stefan Schantl <stefan.schantl@ipfire.org>
>>> ---
>>> config/cfgroot/ids-functions.pl | 38
>>> ++++++++++++++++++++++++++++++---
>>> 1 file changed, 35 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/config/cfgroot/ids-functions.pl b/config/cfgroot/ids-
>>> functions.pl
>>> index dfbeb1a7d..d7df41dd1 100644
>>> --- a/config/cfgroot/ids-functions.pl
>>> +++ b/config/cfgroot/ids-functions.pl
>>> @@ -365,9 +365,25 @@ sub downloadruleset ($) {
>>> my $tmp = File::Temp->new( SUFFIX => ".tmp", DIR =>
>>> "/var/tmp/", UNLINK => 0 );
>>> my $tmpfile = $tmp->filename();
>>> 
>>> +  # Genarate and assign file name and path to store
>>> the downloaded rules file.
>>> +  my $dl_rulesfile = &_get_dl_rulesfile($provider);
>>> +
>>> +  # Load perl module to deal with file atributes.
>>> +  use File::stat;
>> 
>> I don’t like modules in the middle of a script. This makes testing a
>> new perl update more difficult. If I can load the script, I expect it
>> to have all dependencies it needs. This does not work when modules
>> are being loaded later on. I would say that we need to cover those
>> things by a module test then.
>> 
>> I would prefer reliability over performance (I assume that this your
>> argument why to load it only when you need it).
> 
> I'll move the load of the modules to the top of the script. This is
> also the way all other (library) scripts are doing this.

Thank you.

> This may will lead to some more resource consumption, but will help us
> to execute the script in a safe way - even when anything got updated.
> 
> Module tests generally are a good way - I'm thinking about implementing
> some at a later time (after some important changes are finished) for
> the "ids-functions.pl".
> 
>> 
>>> +
>>> +  # Get the mtime of the rulesfile if it exists.
>>> +  my $mtime = (stat($dl_rulesfile)->mtime) if (-f
>>> $dl_rulesfile);
>> 
>> I suggest making this a little bit simpler:
>> 
>> Create the request first, then stat the file (you don’t need to check
>> if it exists first because stat will just tell you).
>> 
>> Then add the header if you have a timestamp; if not, then don’t add a
>> useless header. This is a lot more straight forward :)
>> 
>>> +
>>> +  # Convert the mtime into gmtime format.
>>> +  my $gmtime = gmtime($mtime || 0);
>>> +
>>> # Pass the requested url to the downloader.
>>> my $request = HTTP::Request->new(GET => $url);
>>> 
>>> +  # Add the If-Modified-Since header to the request,
>>> containing the omited and converted
>>> +  # mtime of the downloaded rules file, if one is
>>> present.
>>> +  $request->header( 'If-Modified-Since' => "$gmtime"
>>> );
>> 
>> Will this survive any redirects? I know that some redirect to AWS S3
>> and you will need to make sure that the header is being sent when you
>> follow the redirect - assuming S3 supports this which I think they
>> would.
> 
> If the servers in between are not entirely broken or do silly stuff,
> they should.

This is in the control of the client. It should not matter what the server responds with.

>> 
>>> +
>>> my $dl_attempt = 1;
>>> my $response;
>>> 
>>> @@ -381,6 +397,14 @@ sub downloadruleset ($) {
>>> # Break loop.
>>> last;
>>> 
>>> +  # Check if the server responds with 304
>>> (Not Modified).
>>> +  } elsif ($response->code == 304) {
>>> +  # Log to syslog.
>>> +  &_log_to_syslog("Ruleset is up-to-
>>> date, no update required.");
>>> +
>>> +  # Nothing to be done, the ruleset
>>> is up-to-date.
>>> +  return;
>>> +
>>> # Check if we ran out of download re-tries.
>>> } elsif ($dl_attempt eq $max_dl_attempts) {
>>> # Obtain error.
>>> @@ -406,6 +430,10 @@ sub downloadruleset ($) {
>>> # Get the remote size of the downloaded file.
>>> my $remote_filesize = $headers->content_length;
>>> 
>>> +  # Get the timestamp from header, when the file has
>>> been modified the
>>> +  # last time.
>>> +  my $last_modified = $headers->last_modified;
>>> +
>>> # Load perl stat module.
>>> use File::stat;
>> 
>> You are loading the module again.
> 
> Thanks, this is an additional point why, loading all required modules
> on top are a good idea. You do not have to think about which modules
> are already loaded or may be reloaded etc.
>> 
>>> 
>>> @@ -428,9 +456,6 @@ sub downloadruleset ($) {
>>> return 1;
>>> }
>>> 
>>> -  # Genarate and assign file name and path to store
>>> the downloaded rules file.
>>> -  my $dl_rulesfile = &_get_dl_rulesfile($provider);
>>> -
>>> # Check if a file name could be obtained.
>>> unless ($dl_rulesfile) {
>>> # Log error message.
>>> @@ -449,6 +474,13 @@ sub downloadruleset ($) {
>>> # Overwrite the may existing rulefile or tarball
>>> with the downloaded one.
>>> move("$tmpfile", "$dl_rulesfile");
>>> 
>>> +  # Check if the server respond contained a
>>> last_modified value.
>>> +  if ($last_modified) {
>>> +  # Assign the last modified timestamp from
>>> server as mtime to the
>>> +  # rules file.
>>> +  utime(time(), "$last_modified",
>>> "$dl_rulesfile");
>>> +  }
>>> +
>>> # Delete temporary file.
>>> unlink("$tmpfile");
>>> 
>>> -- 
>>> 2.30.2
>>> 
>> 
>> -Michael
> 
> -Stefan
  

Patch

diff --git a/config/cfgroot/ids-functions.pl b/config/cfgroot/ids-functions.pl
index dfbeb1a7d..d7df41dd1 100644
--- a/config/cfgroot/ids-functions.pl
+++ b/config/cfgroot/ids-functions.pl
@@ -365,9 +365,25 @@  sub downloadruleset ($) {
 		my $tmp = File::Temp->new( SUFFIX => ".tmp", DIR => "/var/tmp/", UNLINK => 0 );
 		my $tmpfile = $tmp->filename();
 
+		# Genarate and assign file name and path to store the downloaded rules file.
+		my $dl_rulesfile = &_get_dl_rulesfile($provider);
+
+		# Load perl module to deal with file atributes.
+		use File::stat;
+
+		# Get the mtime of the rulesfile if it exists.
+		my $mtime = (stat($dl_rulesfile)->mtime) if (-f $dl_rulesfile);
+
+		# Convert the mtime into gmtime format.
+		my $gmtime = gmtime($mtime || 0);
+
 		# Pass the requested url to the downloader.
 		my $request = HTTP::Request->new(GET => $url);
 
+		# Add the If-Modified-Since header to the request, containing the omited and converted
+		# mtime of the downloaded rules file, if one is present.
+		$request->header( 'If-Modified-Since' => "$gmtime" );
+
 		my $dl_attempt = 1;
 		my $response;
 
@@ -381,6 +397,14 @@  sub downloadruleset ($) {
 				# Break loop.
 				last;
 
+			# Check if the server responds with 304 (Not Modified).
+			} elsif ($response->code == 304) {
+				# Log to syslog.
+				&_log_to_syslog("Ruleset is up-to-date, no update required.");
+
+				# Nothing to be done, the ruleset is up-to-date.
+				return;
+
 			# Check if we ran out of download re-tries.
 			} elsif ($dl_attempt eq $max_dl_attempts) {
 				# Obtain error.
@@ -406,6 +430,10 @@  sub downloadruleset ($) {
 		# Get the remote size of the downloaded file.
 		my $remote_filesize = $headers->content_length;
 
+		# Get the timestamp from header, when the file has been modified the
+		# last time.
+		my $last_modified = $headers->last_modified;
+
 		# Load perl stat module.
 		use File::stat;
 
@@ -428,9 +456,6 @@  sub downloadruleset ($) {
 			return 1;
 		}
 
-		# Genarate and assign file name and path to store the downloaded rules file.
-		my $dl_rulesfile = &_get_dl_rulesfile($provider);
-
 		# Check if a file name could be obtained.
 		unless ($dl_rulesfile) {
 			# Log error message.
@@ -449,6 +474,13 @@  sub downloadruleset ($) {
 		# Overwrite the may existing rulefile or tarball with the downloaded one.
 		move("$tmpfile", "$dl_rulesfile");
 
+		# Check if the server respond contained a last_modified value.
+		if ($last_modified) {
+			# Assign the last modified timestamp from server as mtime to the
+			# rules file.
+			utime(time(), "$last_modified", "$dl_rulesfile");
+		}
+
 		# Delete temporary file.
 		unlink("$tmpfile");