Gzipping website in Amazon S3 bucket

Amazon S3 enables users to create static websites easily. It goes without saying that as far as websites are static there's no server related "sugar" like dynamic HTTP headers that depends on user's request. Unfortunately it also means that S3 service can't respond with either compressed or uncompressed page response depending on request.

At the same time modern webpages grow constantly, one with jQuery and Bootstraps can easily reach 300+KB. Downloading several files can aggravate response time yet more. Search engines penalize sites with big latency, and no matter how fast your server is it most likely will not decrease latency as much as compression does. And it concerns not only search engines, it's quite reasonable - bad response time makes users unhappy. Do you remember how you feel when your OS loads several minutes? It's the same.

So lack of compression sucks. Really. And no, I don't know a way to make S3 work as web server. But it doesn't matter. Can you name a browser that doesn't support Gzip compression? Do you know someone who uses it? Personally I don't. Even lynx supports gzipped responses. So most likely nothing bad happens if we make our site Gzipped only. Yes, I know, I dreamed about better solution too, but it's the best I've found so far. If you can suggest something better, please contact me, I will be happy to hear about your approach. Meanwhile Gzipping webpages before submitting them to S3 bucket works for me now quite good and I'm going to describe how I manage to use it.

And finally another bad news (last ones I promise) - several repeating gzipping of the same file have different hash sums. At the same time tools like s3cmd execute synchronization of bucket with local directory based on hash sums and objects' size. This means that you can't gzip the whole site every time you update some page. Actually you can, but it will require manual synchronization with bucket that is huge pain in the ass, or it will cause useless synchronization of unchanged pages you should pay for.

So what do I suggest? Automated synchronization of changed only files. For this purpose I suppose having two directories: one with latest uncompressed content, and another with published gzipped content. Before publication latest content is compared with extracted gzipped one, and if they are different latest one will be compressed and will replace gzipped one in the second directory. And if they are the same, nothing will happen, and gzipped page will not be changed as it's supposed. In such a way only changed pages will be synchronized by s3cmd.

As static site generator I use Pelican, so my website is build with make. Finally deploying to Amazon S3 will affect these targets:

compress:
    python tools/aws-s3-gzip-compression.py $(OUTPUTDIR) $(S3_PUBLICATION_DIR)

s3_gzip_upload: publish compress    
    s3cmd sync $(S3_PUBLICATION_DIR)/ s3://$(S3_BUCKET) --acl-public --add-header \
      "Content-Encoding:gzip" --mime-type="application/javascript; charset=utf-8" \
      --add-header "Cache-Control: max-age 86400" --exclude '*' --include '*.js' && \
    s3cmd sync $(S3_PUBLICATION_DIR)/ s3://$(S3_BUCKET) --acl-public --add-header \
      "Content-Encoding:gzip" --mime-type="text/css; charset=utf-8" --add-header \
      "Cache-Control: max-age 86400" --exclude '*' --include '*.css' && \
    s3cmd sync $(S3_PUBLICATION_DIR)/ s3://$(S3_BUCKET) --acl-public --add-header \
      "Content-Encoding:gzip" --mime-type="text/html; charset=utf-8" --exclude '*' \
      --include '*.html' && \
    s3cmd sync $(S3_PUBLICATION_DIR)/ s3://$(S3_BUCKET) --acl-public --add-header \
      "Content-Encoding:gzip" --mime-type="application/xml; charset=utf-8" --exclude \
      '*' --include '*.xml'  && \
    s3cmd sync $(S3_PUBLICATION_DIR)/static/ s3://$(S3_BUCKET)/static/ --acl-public \
      --add-header "Cache-Control: max-age 86400" && \
    s3cmd sync $(S3_PUBLICATION_DIR)/theme/ s3://$(S3_BUCKET)/theme/ --acl-public \
      --add-header "Cache-Control: max-age 86400" && \
    s3cmd sync $(S3_PUBLICATION_DIR)/ s3://$(S3_BUCKET) --acl-public --delete-removed

While publish target just compiles website and put resulting content to $(OUTPUT) directory, compress target gzips changed files putting them to $(S3_PUBLICATION_DIR) and s3_gzip_upload synchronizes local directory with S3 bucket setting some useful headers.

Let's talk about compression firstly. It's done by separate self-written Python script that you can find and load in my aws-s3-gzip-compression.py GitHub Gist. It does exactly what I described as my suggestion - automated synchronization of only changed files. Not all files, but only ones that can be compressed with good rate. Script compresses HTML, JavaScript, CSS and XML files. You can use Makefile snippet as usage example for this script.

Now about synchronization with S3 bucket. There's a tool s3cmd that can be installed easily. As mentioned previously it can synchronize local directory with S3 bucket using MD5 hash function. Important part about compression is Content-Encoding:gzip HTTP header, it tells browser that it should decompress page before rendering. With include and exclude attributes I specified that only HTML, JavaScript, CSS and XML files should have this header. As far as it's publicly available website, acl-public attributes specifies that objects can be requested anonymously.

Defining character set in HTTP header helps browser to render pages faster. It's even recommended to avoid its specification in meta tag inside webpages. Another rendering acceleration approach is to add Cache-Control header. It tells browser that once loaded this HTTP resource (S3 object) will be cached by browser and any access to it withing max-age would cause loading page from local cache, not from server. It's amazing approach if you have huge CSS and JavaScript files.

Tagged as : Amazon S3 Python

Comments