Lazy distro mirrors with squid

I have a problem that I think a lot of fellow developers probably have–I have enough computers (or virtual machines!) running the same operating system version(s) that I would benefit from a local mirror of them, but I don’t have so many systems that it’s actually reasonable for me to run a full mirror, which would entail rsyncing a bunch of content daily, much of which may be packages I would never use. And using a proxy server isn’t terribly practical, because with a bunch of semi-round-robin mirrors, it’s likely that two systems would pull the same package from different mirrors. A proxy server would have no way of knowing (ahead of time) that the two documents were actually the same.

What I wanted for a long time was a “lazy” mirror — something that would appear to my systems as a full mirror, but would act more as a proxy. When a client installed a particular version of a particular package for the first time, it would go fetch them from a “real” mirror, and then cache it for a long time. Subsequent requests for the same package from my “mirror” would be served from cache. I was convinced that this was impossible to do with a proxy server. Worse, I wanted to mirror multiple repos — Fedora and CentOS and EPEL, and maybe even Ubuntu. There’s no way squid can do that.

I was wrong. squid is pretty awesome. We just pull a few tricks:

  • Instead of using squid as a traditional proxy server that listens on port 3128, use it as a reverse proxy / accelerator that listens on port 80. (This is, incidentally, what sites like Wikipedia do.)
  • Abuse Massage the refresh_pattern rules to cache RPM files (etc.) for a very long time. Normally it is an awful, awful idea for proxy servers to do interfere with the Cache-Control / Expires headers that sites serve. But in the case of a mirror, we know that any updates to a package will necessarily bump the version number in the URL. Ergo, we can pretty safely cache RPMs indefinitely.
  • Set up name-based virtual hosting with squid, so that centos-mirror.lan and fedora-mirror.lan can point to different mirrors.

Two other important steps involve setting up cache_dir reasonably (by default, at least in the packages on CentOS 6, squid will only cache data in RAM), and bumping up maximum_object_size from the default of 4MB.

Here is the relevant section of my squid.conf. (The “irrelevant” section of my squid.conf is a bunch of acl lines that I haven’t really customized and can probably be deleted.)

# Listen on port 80, not 3128
# 'accel' tells squid that it's a reverse proxy
# 'defaultsite' sets the hostname that will be used if none is provided
# 'vhost' tells squid that it'll use name-based virtual hosting. I'm not
#   sure if this is actually needed.
http_port 80 accel defaultsite=mirror.lowell.lan vhost

# Create a disk-based cache of up to 10GB in size:
# (10000 is the size in MB. 16 and 256 seem to set how many subdirectories
#  are created, and are default values.)
cache_dir ufs /var/spool/squid 10000 16 256

# Use the LFUDA cache eviction policy -- Least Frequently Used, with
#  Dynamic Aging. http://www.squid-cache.org/Doc/config/cache_replacement_policy/
# It's more important to me to keep bigger files in cache than to keep
# more, smaller files -- I am optimizing for bandwidth savings, not latency.
cache_replacement_policy heap LFUDA

# Do unholy things with refresh_pattern.
# The top two are new lines, and probably aren't everything you would ever
# want to cache -- I don't account for VM images, .deb files, etc.
# They're cached for 129600 minutes, which is 90 days.
# refresh-ims and override-expire are described in the configuration here:
#  http://www.squid-cache.org/Doc/config/refresh_pattern/
# but basically, refresh-ims makes squid check with the backend server
# when someone does a conditional get, to be cautious.
# override-expire lets us override the specified expiry time. (This is
#  illegal per the RFC, but works for our specific purposes.)
# You will probably want to tune this part.
refresh_pattern -i .rpm$ 129600 100% 129600 refresh-ims override-expire
refresh_pattern -i .iso$ 129600 100% 129600 refresh-ims override-expire
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern -i (/cgi-bin/|\?) 0     0%      0
refresh_pattern .               0       20%     4320

# This is OH SO IMPORTANT: squid defaults to not caching objects over
# 4MB, which may be a reasonable default, but is awful behavior on our
# pseudo-mirror. Let's make it 4GB:
maximum_object_size 4096 MB

# Now, let's set up several mirrors. These work sort of like Apache
# name-based virtual hosts -- you get different content depending on
# which hostname you use in your request, even on the same IP. This lets
# us mirror more than one distro on the same machine.

# cache_peer is used here to set an upstream origin server:
#   'mirror.us.as6453.net' is the hostname of the mirror I connect to.
#   'parent' tells squid that that this is a 'parent' server, not a peer
#    '80 0' sets the HTTP port (80) and ICP port (0)
#    'no-query' stops ICP queries, which should only be used between squid servers
#    'originserver' tells squid that this is a server that originates content,
#      not another squid server.
#    'name=as6453' tags it with a name we use on the next line.
# cache_peer_domain is used for virtual hosting.
#    'as6453' is the name we set on the previous line (for cache_peer)
#    subsequent words are virtual hostnames it answers to. (This particular
#     mirror has Fedora and Debian content mirrored.) These are the hostnames
#     you set up and will use to access content.
# Taken together, these two lines tell squid that, when it gets a request for
#  content on fedora-mirror.lowell.lan or debian-mirror.lowell.lan, it should
#  route the request to mirror.us.as6453.net and cache the result.
cache_peer mirror.us.as6453.net parent 80 0 no-query originserver name=as6453
cache_peer_domain as6453 fedora-mirror.lowell.lan debian-mirror.lowell.lan

# Another, for CentOS:
cache_peer mirrors.seas.harvard.edu parent 80 0 no-query originserver name=harvard
cache_peer_domain harvard centos-mirror.lowell.lan

You will really want to customize this. The as6453.net and harvard.edu mirrors happen to be geographically close to me and very fast, but that might not be true for you. Check out the CentOS mirror list and Fedora mirror list to find something close by. (And perhaps fetch a file or two with wget to check speeds.) And I’m reasonably confident that you don’t have a lowell.lan domain in your home.

If you can find one mirror that has all the distros you need, you don’t need to bother with virtual hosts.

You can edit the respective repos in /etc/yum.repos.d/ to point to the hostnames you set up. Pay attention to whether the mirror matches the URL structure the file defaults to or not.

You can just drop the hostnames in /etc/hosts if you don’t have a home DNS server, e.g.,:

172.16.1.100 fedora-mirror.lowell.lan centos-mirror.lowell.lan

11 thoughts on “Lazy distro mirrors with squid

  1. Oh, a quick tip:

    You can test if you’re properly caching things by running something like this:

    wget -S http://fedora-mirror.lowell.lan/fedora/linux/releases/20/Fedora/x86_64/iso/Fedora-20-x86_64-netinst.iso

    The -S flag shows HTTP headers, including the ones squid inserts (X-Cache, X-Cache-Lookup, Via). On the first run, X-Cache-Hit will be a MISS (since it’s presumably new), but if you re-run that command, it should be a HIT, and considerably faster.

    You might try that with something smaller than an ISO file, but I wanted to test that the maximum object size was set high enough as well.

  2. Thank you! I was thinking of a way to do this with cgi, perl, php or even a custom executable. I also considered squid and other proxies, but thought to myself it would surely be a fail; however as a rule I spend at least a day searching before I reinvent the wheel, skip over it entirely, or press the ‘retail button’. I was about to move onto what my real projects are and just go the rsync full mirror route but you have laid test work that says this will not waist my time. Now I can have a small set of local repos which update only whats used and continue on to make my testing environment and things will be fast while using minimal WAN access. For this your getting a google+ post (that also copies to my Twitter, Linkedin, and Facebook profiles) as it’s the least I can do.

  3. Pingback: RPMGOT | S12

  4. Hi, nice setup. Only thing I didn’t like Is that fact you should make any changes on the target boxes, i.e. modifying hosts and stuff. Besides, it probably won’t handle downloads for the servers, which aren’t in squid.conf. So my question is, is it possible to force squid to hit the files in cache by name and size only, ignoring whole part of the URI, except the file name? And I also would like it running as a transparent (intercept) proxy.

  5. Hi,
    First of all thanks a lot for writing a simple approach to have mirror with squid. I was desperately searching for this kind of code for long time. Keep going the good work.

    Issues with me: I am able to install the packages if I switch over to original source.list file and also I am able to browse to the errored link. So connectivity issues. I am clueless on whats wrong did I do? Please help.

    (1) I have squid3 on Ubuntu_14.04_LTS

    (2) My squid.conf

    root@fab-stg-repo-h1:~#cat /etc/squid3/squid.conf
    http_access allow all
    http_port 80 accel defaultsite=fab-stg-repo-h1 vhost

    This needs to be specified before the cache_dir, otherwise it’s ignored ?!

    maximum_object_size 6144 MB

    cache_dir ufs /squid3 50000 16 256

    cache_replacement_policy heap LFUDA

    refresh_pattern -i .rpm$ 129600 100% 129600 refresh-ims override-expire
    refresh_pattern -i .iso$ 129600 100% 129600 refresh-ims override-expire
    refresh_pattern -i .deb$ 129600 100% 129600 refresh-ims override-expire
    refresh_pattern -i .gz$ 129600 100% 129600 refresh-ims override-expire
    refresh_pattern -i .bz2$ 129600 100% 129600 refresh-ims override-expire
    refresh_pattern -i .tar.xz$ 129600 100% 129600 refresh-ims override-expire
    refresh_pattern -i .tar.gz$ 129600 100% 129600 refresh-ims override-expire
    refresh_pattern -i .zip$ 129600 100% 129600 refresh-ims override-expire
    refresh_pattern ^ftp: 1440 20% 10080
    refresh_pattern ^gopher: 1440 0% 1440
    refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
    refresh_pattern . 0 20% 4320
    #

    cache_peer us.archive.ubuntu.com parent 80 0 no-query originserver name=ubuntu
    cache_peer_domain ubuntu fab-stg-repo-h1

    (3) Client side sources.list

    deb http://fab-stg-repo-h1/ubuntu/ precise main restricted universe
    deb-src http://fab-stg-repo-h1/ubuntu/ precise main restricted universe

    deb http://fab-stg-repo-h1/ubuntu/ precise-security main restricted universe
    deb http://fab-stg-repo-h1/ubuntu/ precise-updates main restricted universe
    deb http://fab-stg-repo-h1/ubuntu/ precise-backports main restricted universe
    deb-src http://fab-stg-repo-h1/ubuntu/ precise-security main restricted universe
    deb-src http://fab-stg-repo-h1/ubuntu/ precise-updates main restricted universe
    deb-src http://fab-stg-repo-h1/ubuntu/ precise-backports main restricted universe

    deb http://fab-stg-repo-h1/ubuntu/ precise partner
    deb-src http://fab-stg-repo-h1/ubuntu/ precise partner

    deb http://fab-stg-repo-h1/ubuntu precise main
    deb-src http://fab-stg-repo-h1/ubuntu precise main

    (4) Result of apt-get update on client side:

    Ign http://fab-stg-repo-h1 precise-backports/main Translation-en_US
    Ign http://fab-stg-repo-h1 precise-backports/restricted Translation-en_US
    Ign http://fab-stg-repo-h1 precise-backports/restricted Translation-en
    Ign http://fab-stg-repo-h1 precise-backports/universe Translation-en_US
    Fetched 1,212 kB in 1s (1,043 kB/s)
    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise/Release Unable to find expected entry ‘partner/source/Sources’ in Release file (Wrong sources.list entry or malformed file)

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise/universe/source/Sources 404 Not Found

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-security/main/source/Sources 404 Not Found

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-security/restricted/source/Sources 404 Not Found

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-security/universe/source/Sources 404 Not Found

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-security/main/binary-amd64/Packages 404 Not Found

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-security/restricted/binary-amd64/Packages 404 Not Found

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-security/universe/binary-amd64/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-security/main/binary-i386/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-security/restricted/binary-i386/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-security/universe/binary-i386/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-updates/main/source/Sources 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-updates/restricted/source/Sources 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-updates/universe/source/Sources 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-updates/main/binary-amd64/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-updates/restricted/binary-amd64/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-updates/universe/binary-amd64/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-updates/main/binary-i386/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-updates/restricted/binary-i386/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-updates/universe/binary-i386/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-backports/main/source/Sources 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-backports/restricted/source/Sources 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-backports/universe/source/Sources 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-backports/main/binary-amd64/Packages 500 Internal Server Error

    W: Failed to fetch http://fab-stg-repo-h1/ubuntu/dists/precise-backports/restricted/binary-amd64/Packages 500 Internal Server Error

    E: Some index files failed to download. They have been ignored, or old ones used instead.
    root@fab-stage-stm-h5:/etc/apt/sources.list.d#

    (5) Logs on squid server

    root@fab-stg-repo-h1:~# tail /var/log/squid3/access.log

    1461948485.306 0 10.100.163.162 TCP_MISS/500 4172 GET http://fab-stg-repo-h1/ubuntu/dists/precise-updates/main/i18n/Translation-en – HIER_NONE/- text/html
    1461948485.306 0 10.100.163.162 TCP_MISS/500 4199 GET http://fab-stg-repo-h1/ubuntu/dists/precise-updates/restricted/i18n/Translation-en_US – HIER_NONE/- text/html
    1461948485.307 0 10.100.163.162 TCP_MISS/500 4190 GET http://fab-stg-repo-h1/ubuntu/dists/precise-updates/restricted/i18n/Translation-en – HIER_NONE/- text/html
    1461948485.307 0 10.100.163.162 TCP_MISS/500 4193 GET http://fab-stg-repo-h1/ubuntu/dists/precise-updates/universe/i18n/Translation-en_US – HIER_NONE/- text/html
    1461948485.307 0 10.100.163.162 TCP_MISS/500 4184 GET http://fab-stg-repo-h1/ubuntu/dists/precise-updates/universe/i18n/Translation-en – HIER_NONE/- text/html
    1461948485.307 0 10.100.163.162 TCP_MISS/500 4187 GET http://fab-stg-repo-h1/ubuntu/dists/precise-backports/main/i18n/Translation-en_US – HIER_NONE/- text/html
    1461948485.307 0 10.100.163.162 TCP_MISS/500 4205 GET http://fab-stg-repo-h1/ubuntu/dists/precise-backports/restricted/i18n/Translation-en_US – HIER_NONE/- text/html
    1461948485.308 0 10.100.163.162 TCP_MISS/500 4196 GET http://fab-stg-repo-h1/ubuntu/dists/precise-backports/restricted/i18n/Translation-en – HIER_NONE/- text/html
    1461948485.308 0 10.100.163.162 TCP_MISS/500 4199 GET http://fab-stg-repo-h1/ubuntu/dists/precise-backports/universe/i18n/Translation-en_US – HIER_NONE/- text/html
    1461948592.626 0 10.100.162.223 TCP_IMS_HIT/304 306 GET http://fab-stg-repo-h1/ubuntu/dists/precise/Release – HIER_NONE/- -

    root@fab-stg-repo-h1:~# tail /var/log/squid3/cache.log

    2016/04/29 09:48:04| TCP connection to us.archive.ubuntu.com/80 failed
    2016/04/29 09:48:05| TCP connection to us.archive.ubuntu.com/80 failed
    2016/04/29 09:48:05| TCP connection to us.archive.ubuntu.com/80 failed
    2016/04/29 09:48:05| TCP connection to us.archive.ubuntu.com/80 failed
    2016/04/29 09:48:05| TCP connection to us.archive.ubuntu.com/80 failed
    2016/04/29 09:48:05| TCP connection to us.archive.ubuntu.com/80 failed
    2016/04/29 09:48:05| TCP connection to us.archive.ubuntu.com/80 failed
    2016/04/29 09:48:05| Detected DEAD Parent: ubuntu
    2016/04/29 09:48:05| Detected REVIVED Parent: ubuntu
    2016/04/29 09:53:09| temporary disabling (Not Found) digest from us.archive.ubuntu.com

    root@fab-stg-repo-h1:~# tail /var/log/squid3/netdb.state

    91.189.91.0 1 1 17.00000 77.00000 1461923671 1461923371 us.archive.ubuntu.com
    91.189.91.0 1 1 17.00000 77.00000 1461923671 1461923371 us.archive.ubuntu.com
    91.189.91.0 1 1 17.00000 77.00000 1461923671 1461923371 us.archive.ubuntu.com
    91.189.91.0 1 1 17.00000 77.00000 1461923671 1461923371 us.archive.ubuntu.com
    91.189.91.0 5 5 13.80000 78.60000 1461939067 1461938767 us.archive.ubuntu.com
    10.100.161.0 1 1 1.00000 1.00000 1461936952 1461936652 fab-stg-repo-h1
    91.189.91.0 6 6 14.44000 78.48000 1461941472 1461941172 us.archive.ubuntu.com
    10.100.161.0 1 1 1.00000 1.00000 1461936952 1461936652 fab-stg-repo-h1

Leave a Reply to Fabian Cancel reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>