Fix bug 11558 updxlrator: use mirror mode for SHA1, filenames

Message ID 43f953c8-7f0a-c7a2-2835-bae749b2a9b9@mail.com
State Dropped
Headers
Series Fix bug 11558 updxlrator: use mirror mode for SHA1, filenames |

Commit Message

Justin Luth Dec. 31, 2017, 6:12 a.m. UTC
  Most Microsoft updates now contain an SHA1 hash in the filename.
Since these files are uniquely identifiable, use mirror mode
(which creates a hash of just the filename instead of the entire URL)
to cache them. (But first check the URL cache to see if it
has been downloaded as a URL already.)

This is a HUGELY needed fix. Windows 10 updates are 5+ GB
per month, and we lose several days of bandwidth downloading
duplicates from different mirrors. Sometimes a single client
will request the same patch from multiple mirrors. That's bad.
This patch will save a ton of bandwidth, and lots of disk space.

The patch limits the SHA1 test to microsoft only, but it
could be easily expanded to other vendors if there is a need.

Signed-off-by: Justin Luth  <jluth@mail.com>
---
This is a slight hack, because the fix is tucked away in a
somewhat obscure function. I mean, someone could completely redesign
this and make more modular functions that create the hash, check
if the file exists, etc. But this patch very neatly is contained
in one section of the code and doesn't modify anything else, so
I think the simplicity and elegance warrant the hackiness.

Because the fix is tucked away in the check_cache function,
I added one comment in the Microsoft section, clearly
alerting future programmers about the change.
Originally, I had put my SHA1 test here, but doing so required
pre-processing the caches and renaming the hash identifiers.
This patch avoids that ugly business.

This patch works beautifully because it never downloads anything extra.
If you already cached the URL, then you won't re-download the filename.
But if you hit a different mirror now, you will download one more time
(as normal) and after that every different mirror will be "satisfied".

In the bug report, there is a script that can be tweaked to RENAME
the URL hash to become a filename hash, in case any site
really wants to avoid that possibility of redownloading a file
they already have. But since I haven't seen anyone else
complaining about this problem, I doubt anyone would be interested.

A good test URL (that is a small file, not 1+ GB) is
7.au.download.windowsupdate.com/d/msdownload/update/others/2015/03/16743052_f84687743a71a750edef8ffedd978602a2592000.cab
You can use numbers other than 7, remove the 7. or remove 7.au.
in order to access different mirrors of the same file.
---
  config/updxlrator/updxlrator | 13 +++++++++++++
  1 file changed, 13 insertions(+)
  

Patch

diff --git a/config/updxlrator/updxlrator b/config/updxlrator/updxlrator
index 5baaaae58..ff23b3a95 100644
--- a/config/updxlrator/updxlrator
+++ b/config/updxlrator/updxlrator
@@ -86,6 +86,8 @@  while (<>) {
  	&&   ($source_url !~ m@\&@)
  	   )
  	{
+		# NOTE: check_cache will change to $mirror instead of $unique if the filename contains an SHA1 hash
+		# and the URL is not found in cache!
  		$xlrator_url = &check_cache($source_url,$hostaddr,$username,"Microsoft",$unique);
  	}
  
@@ -400,6 +402,17 @@  sub check_cache
  		&debuglog("Retrieving file from cache ($updsource) for $hostaddr");
  		&setcachestatus("$updcachedir/$vendorid/$uuid/access.log",time);
  		$cacheurl="http://$netsettings{'GREEN_ADDRESS'}:$http_port/updatecache/$vendorid/$uuid/$updfile";
+	}
+	elsif (
+		($cfmirror == $unique) &&
+		($vendorid == "microsoft") &&
+		($source_url =~ m@.*[0-9a-f]{40}\.[^\.]+@i)
+	      )
+	{
+			# Most Microsoft updates now have an SHA1 hash in the name. These should be treated as unique files.
+			# Since it wasn't found in the URL cache, switch to mirror mode and try again using just the filename.
+			&debuglog("SHA1: $vendorid $uuid not cached. Reprocessing as mirror $sourceurl");
+			$cacheurl = &check_cache($source_url,$hostaddr,$username,$vendorid,$mirror);
  	}
  		else
  	{