Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Bradley Kronson 58 posts 109 karma points
    May 27, 2014 @ 14:53
    Bradley Kronson
    0

    Feature request: internal storing of urls hashed ?

    We have a large web site which has a lot of legacy urls, many of which are stored in RSS feeds. The URL tracker quickly runs up in number of rows, this week it hit over 1.5 million rows - partially of course due to no-one doing the obvious and fixing the links in the admin interface.

    The large number of rows has a huge performance impact on the site - SQL Server was averaging 40% of the CPU usage, mainly due to queries hitting the URL Tracker table - there are no indexes and there can't be due the size of the URL string fields.

    What I wanted to suggest is maybe storing an extra table field of the url being added as a hash of it - this way it can be indexed and the queries can should increase quite significantly in speed- what do you think? You could still store the original urls for reference and display, but any searches based on the data would use the hash value instead

  • Stefan Kip 1578 posts 4013 karma points c-trib
    Jun 04, 2014 @ 17:14
    Stefan Kip
    0

    Sounds like a good suggestion to me... Haven't done it before. You'd like to help out?

  • Bradley Kronson 58 posts 109 karma points
    Jun 05, 2014 @ 08:01
    Bradley Kronson
    0

    Sure :)

  • Stefan Kip 1578 posts 4013 karma points c-trib
    Jun 05, 2014 @ 10:28
    Stefan Kip
    0

    How? Have you done this in the past?

  • Bradley Kronson 58 posts 109 karma points
    Jun 05, 2014 @ 11:05
    Bradley Kronson
    0

    Creating the hash is pretty simple - SQL Server (CHECKSUM) can be used to do this when adding the rows (returns an int)

    http://msdn.microsoft.com/en-us/library/ms189788.aspx

    So (in my head)
    - add a new field which contains the hash which gets generated when saving the records to the table.
    - add an index to the field
    - change all existing code in the package to use the hash for searching instead of the URL (indexed based searching on the INT hash would be materially faster than a row by row scan an dcomparison of the table for the URLs which is what is happening currently)

  • Stefan Kip 1578 posts 4013 karma points c-trib
    Jun 05, 2014 @ 11:14
    Stefan Kip
    0

    How about just adding an index to the OldUrl field? That might give a nice performance boost (I'd have to test performance to know)...?

  • Bradley Kronson 58 posts 109 karma points
    Jun 05, 2014 @ 11:24
    Bradley Kronson
    0

    multiple hashes or hashes based on more than one field can also be created if different types of searching is required

  • Bradley Kronson 58 posts 109 karma points
    Jun 05, 2014 @ 11:28
    Bradley Kronson
    0

    I tried that :)

    Problem is the fields are too big. SQL has a maximum size of 900 bytes for a field to be indexed. With the URLs stored as nvarchars, each character uses up two bytes, so you would need to drop the maximum length of the URLs from 1000 to 449 characters - this could cause issues with the URLs then being too long to store in the field....

  • Stefan Kip 1578 posts 4013 karma points c-trib
    Jun 05, 2014 @ 11:29
    Stefan Kip
    0

    Ah okay, well I'll try this when I've got some spare time and do some performance tests. Thanks for the help!

  • Stefan Kip 1578 posts 4013 karma points c-trib
    Jun 05, 2014 @ 11:32
  • Bradley Kronson 58 posts 109 karma points
    Jun 05, 2014 @ 11:42
    Bradley Kronson
    0

    The only thing I would say there is that CHECKSUM might be better than HASHBYTES because you are storing INTs vs VARBINARY and searching and indexing INTs should be faster than a VARBINARY

    Good luck, looking forward to seeing it, shout if you want any assistance :)

  • Bradley Kronson 58 posts 109 karma points
    Jun 05, 2014 @ 11:43
    Bradley Kronson
    0

    I suppose its a trade off between the reliability of the unique hashes and performance....

  • Stefan Kip 1578 posts 4013 karma points c-trib
    Jun 05, 2014 @ 11:44
    Stefan Kip
    0

    For performance CHECKSUM is absolutely better, but CHECKSUM(N'google.com') and CHECKSUM(N'g-oogle.com') both give the same result, which is wrong in this situation.

  • Bradley Kronson 58 posts 109 karma points
    Jun 05, 2014 @ 11:55
    Bradley Kronson
    0

    Yup :)

  • Stefan Kip 1578 posts 4013 karma points c-trib
    Jun 06, 2014 @ 17:35
    Stefan Kip
    0

    I just executed some tests with the addition of this pull request with an index on the icUrlTracker table.
    The results are awesome! I had a clean table and added 2.000.000 404 records. A request for a non-existing URL took 70ms with no data.
    After adding those 2.000.000 records, the load time of the page went up to ~ 800ms, which sucks :D
    After adding the index as suggested in the PR, the load time went back to ~ 70ms. That's an awesome performance increase imo.
    So I'll release a new version with this index and call it a day :-)

  • Bradley Kronson 58 posts 109 karma points
    Jun 06, 2014 @ 19:48
    Bradley Kronson
    0

    BRILLIANT news, thanks for taking the time to check it out and implement, looking forward to upgrading! :D

Please Sign in or register to post replies

Write your reply to:

Draft