Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • ThisIsMeJKB 6 posts 99 karma points
    Nov 03, 2016 @ 23:05
    ThisIsMeJKB
    0

    Undergoing a Massive Data Import....thoughts on tuning?

    Hey all (my first forum post),

    I'm in the midst of getting a book/author catalog site stood up using Umbraco 7.3.5....i've been given a MASSIVE amount of data in the form of spreadsheets to use to seed the database......and by Massive i mean, well over 500k titles.

    I've used a combination of the ContentService, MediaService, and PetaPOCO classes to start getting the data into Umbraco but at the current speed it looks like it's going to take a LONG time (we're talking weeks maybe) to get the db seeded.

    My question is this....can any of the Umbraco superhumans out there, that know a lot about the plumbing of the system, give me some ideas/tips on things I can do to "tune" this process to possibly increase the speed?

    Here are a couple things I have done already:

    1) Everywhere I save/publish a piece of Content i have the "raiseEvents" parameter set to FALSE (figured that might help).

    2) Examine is turned completely off. I have log4net configured to use the papertrailapp.com service and it looked like after a couple hours of the import running Examine would choke up and cause the application to reboot (thus killing the import). I stumbled across a couple threads on this forum surrounding that and I seem to have modified my config files appropriately enough to where nothing is indexed anymore...and that helped a lot as well.

    3) Both the site/migration script and the DB are hosted in MS Azure. The db is a Standard S2 (50DTUs) SQL Azure instance and the Umbraco installation is in a Virtual Machine...both in the same geo-region (East US). I've considered either mirroring the VM or just beefing it up to see if that helps at all but I haven't done either of those yet.

    Is there anything else I could possibly be doing to increase the throughput?

    Any takers? =)

    Thanks! - JB

  • Damiaan 442 posts 1301 karma points MVP 6x c-trib
    Nov 04, 2016 @ 07:45
    Damiaan
    100

    I don't think it is the best idea to import 500k titles into umbraco. Leave it the existing store and pull it out at runtime in the controller.

    1+2) don't publish every imported document. Only save. Publish at the end.

    3) Azure Sql: pay more, get more performance. I know it's lame, it is sad, but true. Performance is limited by the tier you are paying for.

  • ThisIsMeJKB 6 posts 99 karma points
    Nov 04, 2016 @ 12:42
    ThisIsMeJKB
    0

    Thanks for the response Damiaan.

    The problem with leaving it in the "existing store" is that there isn't one. All of these titles are scattered across numerous spreadsheets. I've written a script that is currently reading through all the files in a directory, parsing out the necessary info in each file, and creating the necessary content nodes in Umbraco.

    I've built many (much smaller) Umbraco sites in the past, and assumed dealing with this much data was going to pose a challenge, but Umbraco has too many other positives for me not to attempt using it (the caching and searching/indexing being two of the big things).

    To your point about not publishing, I thought about doing that (and I supposed I'll give it a try now) but I had some concerns of what attempting to publish that many documents at once might do.

    And in regards to Azure Sql, I did plan to play around with that some more today to see how that would effect things.

    Do you know if it's possible to leverage the the built-in Examine/Lucene engine with data in custom tables? Or would I need to pretty much create my own index and searching routines on top of the custom tables?

    Thanks again!

    -JB

Please Sign in or register to post replies

Write your reply to:

Draft