Press Ctrl / CMD + C to copy this to your clipboard.
This post will be reported to the moderators as potential spam to be looked at
We have a large web site which has a lot of legacy urls, many of which are stored in RSS feeds. The URL tracker quickly runs up in number of rows, this week it hit over 1.5 million rows - partially of course due to no-one doing the obvious and fixing the links in the admin interface.
The large number of rows has a huge performance impact on the site - SQL Server was averaging 40% of the CPU usage, mainly due to queries hitting the URL Tracker table - there are no indexes and there can't be due the size of the URL string fields.
What I wanted to suggest is maybe storing an extra table field of the url being added as a hash of it - this way it can be indexed and the queries can should increase quite significantly in speed- what do you think? You could still store the original urls for reference and display, but any searches based on the data would use the hash value instead
Sounds like a good suggestion to me... Haven't done it before.
You'd like to help out?
How? Have you done this in the past?
Creating the hash is pretty simple - SQL Server (CHECKSUM) can be used to do this when adding the rows (returns an int)http://msdn.microsoft.com/en-us/library/ms189788.aspx
So (in my head)- add a new field which contains the hash which gets generated when saving the records to the table.- add an index to the field - change all existing code in the package to use the hash for searching instead of the URL (indexed based searching on the INT hash would be materially faster than a row by row scan an dcomparison of the table for the URLs which is what is happening currently)
How about just adding an index to the OldUrl field? That might give a nice performance boost (I'd have to test performance to know)...?
multiple hashes or hashes based on more than one field can also be created if different types of searching is required
I tried that :)
Problem is the fields are too big. SQL has a maximum size of 900 bytes for a field to be indexed. With the URLs stored as nvarchars, each character uses up two bytes, so you would need to drop the maximum length of the URLs from 1000 to 449 characters - this could cause issues with the URLs then being too long to store in the field....
Ah okay, well I'll try this when I've got some spare time and do some performance tests. Thanks for the help!
Btw, it's better to use HASHBYTES instead for unicode strings.
The only thing I would say there is that CHECKSUM might be better than HASHBYTES because you are storing INTs vs VARBINARY and searching and indexing INTs should be faster than a VARBINARY
Good luck, looking forward to seeing it, shout if you want any assistance :)
I suppose its a trade off between the reliability of the unique hashes and performance....
For performance CHECKSUM is absolutely better, but CHECKSUM(N'google.com') and CHECKSUM(N'g-oogle.com') both give the same result, which is wrong in this situation.
I just executed some tests with the addition of this pull request with an index on the icUrlTracker table.
The results are awesome! I had a clean table and added 2.000.000 404 records. A request for a non-existing URL took 70ms with no data.
After adding those 2.000.000 records, the load time of the page went up to ~ 800ms, which sucks :D
After adding the index as suggested in the PR, the load time went back to ~ 70ms. That's an awesome performance increase imo.
So I'll release a new version with this index and call it a day :-)
BRILLIANT news, thanks for taking the time to check it out and implement, looking forward to upgrading! :D
is working on a reply...
Write your reply to:
Image will be uploaded when post is submitted