I just wondered if anyone has any performance metrics they would like to share with regard to the time taken to import content using CMSImport/
I know the source data and the hardware will affect this but I am trying to get an idea for the time to import ~10000 records, each with about 20-30 fields.
Would you consider testing the import with the free version (500 records) and then multiply it up to be sufficient or does performance degrade as the quantity increases?
The performance will degrade when the quantity increases. Importing 10.000 records has been done before, but it will take some time (aprox an hour, if you import to a single folder it might even be longer) and you need to tweak some config settings. These are documented in the manual http://www.cmsimport.com/documentation.aspx .
I have thought about writing my own DataLayer in the past but when Umbraco changes something to thier DB schema CMSImport can cause starnge issues.
The current version is compatible with Umbraco V4.0 also but the next version will be 4.5+ compatible only. That means I can use the optimized mode of a document and I must say that's much faster. Don't have exact numbers yet.
Can you please point to where in the manual there is mention of performance tuning for imports? I've read through and can't find anything.
I am importing 3000 nodes, but (and I think this is key) with *related media*. I've had to split them down to 250 record batches because of the time involved, and they're still taking 20 minutes-ish each. For a one-time import, this is painful but doable; for automated future updates this will be problematic. Is there anything I can do on my end to extend timeouts and (preferably) speed up imports.
The good news is it will not timeout, since I increase the timeout during the start of the import. Also updates on a given node are imported faster than the initial import. Also for media, if the item already exists it will use the already existing reference.
There are a few tricks you can do.Thought I've added them to the manual. will do for 2.0
First is to disable cache updates for every action.You can do this by setting ContinouslyUpdateXmlDiskCache to false in UmbracoSettings.config
<!-- Update disk cache every time content has changed --> <ContinouslyUpdateXmlDiskCache>False</ContinouslyUpdateXmlDiskCache>
It's also good to not have many nodes underneath one node. If all 3000 nodes are stored underneath one rootnode it might come handy to use a DateFolder or Alphabetfolder package to auto structure the nodes.
If you have set autopublish to true you can also set XmlCacheEnabled to false. This will prevent writing the xml cache file over and over again for every publish. By disabling this you will have a slower (not much) startup experience of the site since the xml file will normally be used during startup to load the nodes.
<!-- Enable / disable xml content cache on disk, only needed for faster startup time--> <XmlCacheEnabled>False</XmlCacheEnabled>
If you have set publish to true, it's also best to use the latest Umbraco version since this handles locking of the lucene indexes much better and no weird exceptions are thrown.
This is basically it. In 2.0 I will drop support for Umbraco 4.0. Then I can use the optimized mode on the API which will imporove performance as well. Still the API is a bit slow. This will be addressed in version 5 of Umbraco.
Please let me know if you have any additional questions.
Thanks for all these tips - I'll give them a try - and sorry, Jay, for hijacking your thread (seemed to continue the discussion, though).
A question of clarification:
If I set XmlCacheEnabledto false, the XML Cache is not updated when each node is published. Fine. So what are the implications? My understanding is that the Lucene indexer gets triggered on node publish (correct me if I'm wrong). Will the indexer still be called (ie does it index against the nde on publish, or does it use the XML cache)?
And the XML cache is required to actually serve the site, right? So does the cache get created in its entirety on app start (if it hasn't been updated after each publish)? Or node by node when pages are requested by the client?
I don't have a very strong understanding of what the XML cache does / how it's used by the system.
No worries. The xmlCacheEnabled is only for a fast startup. The internal cache is kept in memory. If you publish this internal cache will still be updated. And during app_start it will build the cache based on the database instead of the xm file. This is a bit slower but that is only once and in your situation it will improve your import process.
Publish will still trigger the lucene indexer, it only doesn't write the whole cache to disk.
Performance
I just wondered if anyone has any performance metrics they would like to share with regard to the time taken to import content using CMSImport/
I know the source data and the hardware will affect this but I am trying to get an idea for the time to import ~10000 records, each with about 20-30 fields.
Would you consider testing the import with the free version (500 records) and then multiply it up to be sufficient or does performance degrade as the quantity increases?
tia
Jay
Hi Jay,
The performance will degrade when the quantity increases. Importing 10.000 records has been done before, but it will take some time (aprox an hour, if you import to a single folder it might even be longer) and you need to tweak some config settings. These are documented in the manual http://www.cmsimport.com/documentation.aspx .
I have thought about writing my own DataLayer in the past but when Umbraco changes something to thier DB schema CMSImport can cause starnge issues.
The current version is compatible with Umbraco V4.0 also but the next version will be 4.5+ compatible only. That means I can use the optimized mode of a document and I must say that's much faster. Don't have exact numbers yet.
Cheers,
Richard
Hi Richard
Can you please point to where in the manual there is mention of performance tuning for imports? I've read through and can't find anything.
I am importing 3000 nodes, but (and I think this is key) with *related media*. I've had to split them down to 250 record batches because of the time involved, and they're still taking 20 minutes-ish each. For a one-time import, this is painful but doable; for automated future updates this will be problematic. Is there anything I can do on my end to extend timeouts and (preferably) speed up imports.
Thanks
Jonathan
Hi,
The good news is it will not timeout, since I increase the timeout during the start of the import. Also updates on a given node are imported faster than the initial import. Also for media, if the item already exists it will use the already existing reference.
There are a few tricks you can do.Thought I've added them to the manual. will do for 2.0
First is to disable cache updates for every action.You can do this by setting ContinouslyUpdateXmlDiskCache to false in UmbracoSettings.config
<!-- Update disk cache every time content has changed -->
<ContinouslyUpdateXmlDiskCache>False</ContinouslyUpdateXmlDiskCache>
It's also good to not have many nodes underneath one node. If all 3000 nodes are stored underneath one rootnode it might come handy to use a DateFolder or Alphabetfolder package to auto structure the nodes.
If you have set autopublish to true you can also set XmlCacheEnabled to false. This will prevent writing the xml cache file over and over again for every publish. By disabling this you will have a slower (not much) startup experience of the site since the xml file will normally be used during startup to load the nodes.
<!-- Enable / disable xml content cache on disk, only needed for faster startup time-->
<XmlCacheEnabled>False</XmlCacheEnabled>
If you have set publish to true, it's also best to use the latest Umbraco version since this handles locking of the lucene indexes much better and no weird exceptions are thrown.
This is basically it. In 2.0 I will drop support for Umbraco 4.0. Then I can use the optimized mode on the API which will imporove performance as well. Still the API is a bit slow. This will be addressed in version 5 of Umbraco.
Please let me know if you have any additional questions.
Cheers,
Richard
Hi Richard
Thanks for all these tips - I'll give them a try - and sorry, Jay, for hijacking your thread (seemed to continue the discussion, though).
A question of clarification:
If I set XmlCacheEnabledto false, the XML Cache is not updated when each node is published. Fine. So what are the implications? My understanding is that the Lucene indexer gets triggered on node publish (correct me if I'm wrong). Will the indexer still be called (ie does it index against the nde on publish, or does it use the XML cache)?
And the XML cache is required to actually serve the site, right? So does the cache get created in its entirety on app start (if it hasn't been updated after each publish)? Or node by node when pages are requested by the client?
I don't have a very strong understanding of what the XML cache does / how it's used by the system.
Thanks again
Jonathan
Hi Jonathan,
No worries. The xmlCacheEnabled is only for a fast startup. The internal cache is kept in memory. If you publish this internal cache will still be updated. And during app_start it will build the cache based on the database instead of the xm file. This is a bit slower but that is only once and in your situation it will improve your import process.
Publish will still trigger the lucene indexer, it only doesn't write the whole cache to disk.
Cheers,
Richard
Hi Richard,
I am facing issue while importing data from Excel CmsImport licenced.
On load balance environment Traditional (umbraco v 7.1.8).
The upload does not get completed.
Half of the nodes are created and published , some are created but not published and others are missing.
I need to know what will be effect of these optimizing setting in the load balanced environment.
I also need help on how to update to newer version of plugin.
Regards, Mayank Parekh.
is working on a reply...