Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Anthony Dang 1404 posts 2558 karma points MVP c-trib
    Dec 10, 2012 @ 14:01
    Anthony Dang
    0

    Automated media upload to multiple Amazon S3 CDN's with local media fallback

    I have specific requirement for a project regarding CDN's and media, and just wanted some input into how I think this may be implemented.

    Requirements...

    1. The client wants to be able to upload media from the anywhere in the CMS. Eg. Media section, DAMP, WYSIWYG

    2. The media should be uploaded automatically to a CDN.

    3. The site is multi region, hence the CDN which we upload to must be region specific. ie. If a folder in the media section is GB, then any media uploaded to that folder will be automatically uploaded to a specific CDN for GB. Yes multiple CDN's, but only uploading to a specific one each time!

    4. If for some reason an automated CDN upload has failed, the editor/admin is notified.

     

    Proposed implementation...

    A concern is that if we're uploading a 20mb pdf (or something larger) through DAMP, then there's going to be a broken link on the site for at least a few seconds. So what we think would be good is a field in the media item called "CDN Path".

    If the "CDN Path" is empty, then the rendered link is for the local media.

    Regarding uploading...

    We don't want the file to be uploaded to the CDN in the same request, so I'm thinking using Webhooks for Umbraco to push notifications to a console app.

    The console app will take in cdn credentials, the filesys media path of the file to upload, and the media id. When the console app finishes uploading to the CDN, it will push the CDN path to the Umbraco media item. 

    Umbraco will listen using either WebApi or an extended MediaService or possibly Darren's Media Service Package

    If the CDN upload fails, it will be retried a number of times (eg. 5 times). If it still fails, then an email is sent to the editor/admin with the logged errors.

    The path and credentials for the individual CDN's will be stored in a configuration content node under the relative region. So when an upload to the media section occurs, we traverse up the media tree to find a region root (eg. GB), go to the content tree with the same root name (GB), then get the values from a configuration node in that tree.

     

    Does all this sound reasonable? Is there a better/cleaner way to acheive what we're aiming for?

     

     

     

     

  • Nic Wise 51 posts 85 karma points
    Dec 10, 2012 @ 16:11
    Nic Wise
    2

    Some thoughts:

    1. Get a proper CDN :) Cloudfront should do transparent thruput - ie, you make the user hit http://cdn.yoursite.com/foo.jpg. Thats a CNAME to the CDN host. When the user connect to that, if it doesn't have it, it just talks back to your source server and gets it, and caches the result.

    No FTPing stuff around. We've done both here - eg http://takethat.com has a hook on all media saving (uploads and the media folder) and FTP's stuff up to a remote server. http://www.universal-music.co.jp uses the "passthru" model, which is SO SO much easier.

    Cloudfront may look more expensive than just a "plain S3 bucket", but the cost per year difference is likely to be the same as your day rate. Think about that for a second :)

    FTR, we use a CDN which is provided by our host. I think they just have a box or VM in various data centers around the world - and leverage off a few others. You could do the same with a copy of squid/varnish/haproxy running in different AWS zones/DC's, and some DNS magic (route53 would be a good start for this). But then you just recreated CloudFront, so why bother? If you want to go outside of AWS, there are lots of other places which do it too. Akamai is the most famous, but also the most insanely expensive. cachefly is another that comes to find (as they sponsor a number of podcasts I listen to)

    2. If the link is broken for 30 seconds on upload, but it's saved you 3-4 days of work, explain the trade off to the customer. We did. A lot of customers are happy to save a few thousand pounds where they can :) If it's a new article, how many people will actually read it in the first 30 seconds? Can you get them to upload it, save it (and hence the CDN moving thing kicks in), THEN publish it?

    3. You could also write a "cdn sync" service (guess what ours is called?). This just uses a .NET file watcher to watch a set of folders, and if it finds new files or changes, throws the file names into a queue. Another thread picks off the queue, and moves the files to the appropriate remote locations. We use this to UNC paths and FTP servers. It's a lot cleaner and easier to debug than using an umbraco hook. We use this to keep the media etc folders in sync between the editing instance of the site, and the (usually 2-3) public facing front ends.

    4. Just have one cdn. Keep it simple.

    Our setup now is very much like this:

    https://d2868cy5s1ejmq.cloudfront.net/Cloudfront-Diagram_Website_Updated.jpeg

    But we have 2 distinct domains: http://www.universal-music.co.jp (the main site, no CDN) and http://japan.cdn.umgi.net/ (the CDN). The main site is like any other - no CDN there at all. The CDN is like I described above, and we have a specific local node in SE Asia (Singapore I think) for the Japanese market - but we have nodes in the UK, Holland, 2 in the US and a few others too.

    We have a helper function (that you found here: http://fastchicken.co.nz/2011/12/22/christmas-css-and-cdn-fun/ , and we just embed media in a wrapper:

    <link rel="stylesheet" type="text/css" href="<%= CDNHelper.WrapUrl("/css/base.css") %>" />

    This function just adds a (web.config) setting to the front, so in production, that becomes

    http://cdn-jp.umgi.net/css/base.css?20121210

    And in dev, it stays as-is.

    The only gotcha is that you must change the FULL url if you change the content of a file. eg if you have /docs/foo.pdf, and you upload a new one, it either needs to be called /docs/foo2.pdf or /docs/foo.pdf?<keygoeshere> (20121210 in the case above), and you have some way to rotate the key (we keep it either in the web.config, or in the database). The key is per site in our case, which means we expire everything or nothing, but you could easily generate it from an MD5 hash of the file content, have one key for your user uploaded stuff and one for your static content, or something. Just dont do the MD5 on each page request :)

    Chuck in some more info about your setup - where is the server physically? What kind of budget do you have? How many downloads do you expect on these CDN-hosted files? Why are you splitting it up by country in the first place? Is it because of legal reasons (EU data must be hosted in the EU etc)? etc. More info == better. Unless you are looking at some fairly decent traffic, or if your main market is a long way from your hosting, you may not NEED a CDN.

  • Anthony Dang 1404 posts 2558 karma points MVP c-trib
    Dec 10, 2012 @ 17:19
    Anthony Dang
    0

    Cheers Nic.

    http://cdn.yoursite.com/foo.jpg.  is exactly what we need!

    "We've done both here - eg http://takethat.com has a hook on all media saving (uploads and the media folder) and FTP's stuff up to a remote server. http://www.universal-music.co.jp uses the "passthru" model, which is SO SO much easier."

    By hook do you mean web hook? 

     

    More info...

    The project is to build a framework for a company who owns over 100 sub-companies (brands). Each brand has a presence in many regions. The purpose of the build is to give each brand a ready to go platform (ultimate uber starter kit to end all starter kits) to build their own multi-region site. The requirements are very specific. They want the platform to be solid and have ready to go features. That way when the parent company gives the platform to a brand, the brand can give it to any agency on the planet. This assures that each brand is running the same version of Umbraco and has the same core features and packages.

     

    where is the server physically?

    Many hosting environments, all over the world. I just found out that many of these brands will host their sites on the same infrastructure - this will probably be on a SAN with web heads and a cms server. So a file system watcher would require a different watcher for each multi-region site (there could be dozens), or a single one which needs to be configured every time a new site is made. However the latter is not an option at all as each brand will manage their own site development.

     

    What kind of budget do you have?

    Cost (in my understanding) is not really a big concern, as this platform will unify the way the organisation does things, and ultimately save a hell of a lot money. 

     

    How many downloads do you expect on these CDN-hosted files?

    No idea at all. However it depends on the local markets and what campaigns they run. 

     

    Why are you splitting it up by country in the first place?

    As part of it being multi-region we were told that the structure of the brands (companies) are as such that the billing (yes billing) is to be spread among the regional markets. This is the purpose of the multiple CDN's. Each region pays for their own CDN. It's an interesting business requirement, but not all that surprising. 

     

    Amazon is what we've been instructured to supply functionality for. I'm not sure if it's s3 or cloudfront. Is there a real difference in difficulty or wierdness of implementation?

     

     


  • Ian Smedley 97 posts 191 karma points
    Dec 10, 2012 @ 17:26
    Ian Smedley
    0

    +1 for using cloudfront as a 'front' to your website - it seems so much easier, no uploading to do.

    Cloudfront respects any cache headers - with 24 hours being the default, useful if there are items which you know will update regularly, but useless for items that will change iregularly.

    Recently I built something that gets dynamically updated every minute (an image to go in an E-mail) - as long as the header is set to expire every minute cloudfront should 'protect' the generating servers, that's my plan - it seems like a really good and easy to use system.

    Couple that with the above wrapping methods, and I think I might use this on all new major website projects!

    If something isn't urgent to update (modify), I guess you could use an app that monitors file changes (like point 3) and then send an invalidation request - though the invalidation request may take as long as 15 minutes (in my experience it does take this long!) - so perhaps adding a querystring to the end of the media item would work best (as long as you include this in the cloudfront setup)

     

  • Nic Wise 51 posts 85 karma points
    Dec 10, 2012 @ 17:44
    Nic Wise
    0

    Tony: yes, I ment webhook. Well, Umbraco's API.... AfterPublish and all those.

    Sounds like the base "starter kit" would be good, and then have them do templates, css, js etc. If you can keep them out of the doc types, that'll make your life easier (or rather, just "adding", not moving or removing)

    If you build the front-end CDN thing in from day 1, then it's easy - they can pick the provider they want, or just leave it blank and serve it off the main server. Or if you have 100 sites, but they are somewhat low traffic, just serve the same static content off a different webhead....

    "Campaigns": oh, so email is possible :) Then yes, a CDN, 'cos you get nothing for a week, then 50-100k hits in 10 mins. It's freaking insane - almost took down our servers the first time the Japanese did it without the CDN!

    "they pay their own": then, as you already have seperate sites for each brand, just have the CDN configuration in web.config, and they can use their own CDN setup, on their own AWS account (or whatever they like), and pay it with their own creditcard :)

    Might be different if you have one Umb instance for all of them, but if it's 100 instances for 100 brand sites, then.... easy(er).

  • Anthony Dang 1404 posts 2558 karma points MVP c-trib
    Dec 10, 2012 @ 23:58
    Anthony Dang
    0

    There are 100+ brands so there will be at 100+ umbraco installations in various infrastructure. Each install will have 10+ regions.

    I probably should have mentioned that we've already built the platform/uber starter kit. All that's left in this phase is the CDN. How (and if) the separate brands use the CDN functionality is up to them. They just want it to magically work with no configuration at all. Just entering CDN credentials. Each region will take care of the entering of their credentials so web.config is not an option.

    Ian, they want everything to be instant as the different brands will be building all types of stuff. The consideration of lag time will not be acceptable to some of the builds. I'll have to find out if the client uses s3 or cloudfront.

    Suffice it to say, this is one of the most interesting builds I've worked on. Lot's of "no that won't work", and lots of gotcha's :)

     

  • Ian Smedley 97 posts 191 karma points
    Dec 11, 2012 @ 10:44
    Ian Smedley
    0

    Cool, it looks like the best method would be to use Cloudfront to become cdn.yoursite.com like Nic suggests - through the API I'm sure you could progmatically create a new distribution if one hadn't been setup before.

    The advantage is that new files are instant, because if the Cloudfront endpoint doesn't have your file in it's local cache, it will grab it directly from your site, store it in the local-region cache, and serve the file to the user!

    The only issue now would be invalidating, or updating a file with the same name - perhaps the CDNHelper.WrapUrl( ) function could be aware of the file's last modified status, or Hash of the file - and append an appropiate QueryString - instantly it'll be a new file that Cloudfront doesn't have - and it'll be forced to grab the new file from your site?

    S3 is just storage, it's not a distributed content network, traditionally you might store your files in S3, and then perhaps either use this directly (maybe for low-bandwidth or one-off files) or use Cloudfront to serve it from a CDN, but now you can use Cloudfront with your own site acting as the main storage, which is a bit easier when working with CMS sites - as there is no additional file management to worry about.

  • Nic Wise 51 posts 85 karma points
    Dec 11, 2012 @ 12:15
    Nic Wise
    0

    Anthony: CDN's bring lag - ie, you change something, and it takes a while to get out there unless you cache-bust the url (ie, change the url). Thats part of how they work, if they are working properly.

    Personally, I'd put it in docs: put a couple of config settings in web.config (or some other config file you can change from the umbraco ui?) , and have the leave it blank (no cdn) while they are building the site. Once it goes live, then put the CDN stuff in. Otherwise, they are going to want to kill you 'cos their CSS will never update :)

Please Sign in or register to post replies

Write your reply to:

Draft