how to migrate blog posts from wordpress to umbraco

LouisDeconinckEco 1 post 71 karma points
Sep 10, 2018 @ 12:04
How to migrate blog posts from WordPress to Umbraco?

At my job they asked me to manually copy paste posts from Wordpress to Umbraco. This is very tedious work and I was wondering if there's an automated way to migrate blog posts from Wordpress to Umbraco?
Copy Link
Alex Skrypnyk 6182 posts 24284 karma points MVP 8x admin c-trib
Sep 10, 2018 @ 12:35
Hi
Try to use CMSImport package, it supports wordpress migration- https://our.umbraco.com/packages/developer-tools/cmsimport/
Alex
Copy Link
Alex Skrypnyk 6182 posts 24284 karma points MVP 8x admin c-trib
Sep 18, 2018 @ 12:23
Hi Louis
Did you solve the issue? Did you migrate content?
Alex
Copy Link
Nambi Ramamoorthy 6 posts 115 karma points
Aug 09, 2019 @ 17:55
Hi Louis, Have you able to do it?
Alex Skrypnyk, while importing how i can create based on hierarchy ie 2019/01/30 i need to create folder based on Post date. i used CMS Import tool, i tried ApplicationEventHandling, but dont know where to add the node creation code.
Copy Link
Nicholas Westby 2054 posts 7104 karma points c-trib
Aug 09, 2019 @ 19:38
I typically just build an import tool that reads in the XML file that you export from WordPress. The import tool then just creates content nodes, and sometimes uploads media (though I tend to just copy the images over in the same folder structure so importing into the Umbraco media section is not necessary).
This import tool can be wherever you want. For example, in a controller (maybe a button click sends a web request to that controller), or in a Razor view (in which case you visit the page corresponding to that view to initiate the import).
Copy Link
Nicholas Westby 2054 posts 7104 karma points c-trib
Aug 09, 2019 @ 19:50
Here's some sample code in case it helps others. Note that I've replaced a few bits here and there to protect the innocent. Also, it's incomplete (e.g., it references classes that I'm not showing here). It's also very specific to the website we were building. Use it more as a point of reference than actual code you can copy/paste:
// Namespaces.
using Archetype.Models;
using Newtonsoft.Json;
using Rhythm.Core;
using Rhythm.Core.Enums;
using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Web.Hosting;
using System.Xml.XPath;
using Umbraco.Core;
using Umbraco.Core.Logging;

/// <summary>
/// Parses the XML generated by a WordPress export.
/// </summary>
public class BlogXmlParser
{

    #region Constants

    private const string XmlFilePathSuffix = @"sample-temp\wordpress-export.xml";

    #endregion

    #region Properties

    private static bool DoneImporting = false;
    private static object ImportLock = new object();

    #endregion

    #region Public Methods

    /// <summary>
    /// Imports the blog articles from the XML file to Umbraco content nodes.
    /// </summary>
    public static void ImportBlogs()
    {
        // We don't want to import blogs anymore (it's already been done).
        return;
        if (DoneImporting)
        {
            return;
        }
        lock (ImportLock)
        {
            if (DoneImporting)
            {
                return;
            }
            DoneImporting = true;
            var blogs = ParseXmlFile();
            //var blogs = GetSampleBlogs();
            AddToUmbraco(blogs);
        }
    }

    #endregion

    #region Private Methods

    /// <summary>
    /// When testing, this can be used to get just a few blog articles.
    /// </summary>
    /// <returns>
    /// A few blog articles.
    /// </returns>
    private static IEnumerable<ArticleModel> GetSampleBlogs()
    {
        var blogs = ParseXmlFile();
        return blogs.Where(x => new[]
        {
            x.Excerpt,
            x.ImagePath,
            x.NodeName,
            x.Tags.Count() > 0
                ? "Has Tags"
                : null,
            x.Text,
            x.Title
        }.All(y => !string.IsNullOrWhiteSpace(y)))
            .Take(5).ToArray();
    }

    /// <summary>
    /// Parses the XML file to create instances of ArticleModel.
    /// </summary>
    /// <returns>
    /// A collection of ArticleModel instances.
    /// </returns>
    private static IEnumerable<ArticleModel> ParseXmlFile()
    {
        var imageInfo = GetImages();
        var articles = new List<ArticleModel>();
        var articleItems = GetBlogXPathNodeIterator();
        foreach (var articleItem in articleItems)
        {
            var casted = articleItem as XPathNavigator;
            var title = casted.SelectSingleNode("title").Value.Trim();
            var content = casted.SelectSingleNode("*[name() = 'content:encoded']").Value;
            content = ConvertMarkdownToMarkup(content);
            var excerpt = casted.SelectSingleNode("*[name() = 'excerpt:encoded']").Value;
            excerpt = excerpt == null
                ? null
                : excerpt.Trim().Replace("\r\n", " ");
            var publishDate = DateTime.Parse(casted.SelectSingleNode("pubDate").Value);
            var link = casted.SelectSingleNode("link").Value;
            var imageId = casted.SelectSingleNode("*[name() = 'wp:postmeta'][./*[name() = 'wp:meta_key']/text() = '_thumbnail_id']/*[name() = 'wp:meta_value']")?.Value;
            var image = string.IsNullOrWhiteSpace(imageId)
                ? null
                : GetImageUrl(imageId, imageInfo);
            var nodeName = GetBestNodeTitle(title, link);
            content = CorrectImagePaths(content);
            articles.Add(new ArticleModel()
            {
                ImagePath = image,
                NodeName = nodeName,
                PublishDate = publishDate,
                Tags = GetTags(casted),
                Text = content,
                Title = title,
                Excerpt = excerpt
            });
        }
        return articles;
    }

    /// <summary>
    /// Stores the blog articles as Umbraco content nodes.
    /// </summary>
    /// <param name="articles">
    /// The blog articles.
    /// </param>
    private static void AddToUmbraco(IEnumerable<ArticleModel> articles)
    {
        var blogRootId = default(int?);
        //TODO: Uncomment this when running the import.
        //blogRootId = 2108;
        if (!blogRootId.HasValue)
        {
            throw new Exception("Need to set a blog root ID.");
        }
        articles = articles.OrderBy(x => x.NodeName).ToArray();
        var contentService = ApplicationContext.Current.Services.ContentService;
        var articleCount = articles.Count();
        foreach (var article in articles)
        {
            var articleNode = contentService
                .CreateContentWithIdentity(article.NodeName, blogRootId.Value, "blogArticle");
            var properties = GetNodeProperties(article);
            foreach (var property in properties)
            {
                articleNode.SetValue(property.Key, property.Value);
            }
            contentService.SaveAndPublishWithStatus(articleNode);
            articleCount--;
            LogHelper.Info<BlogXmlParser>($@"Imported a blog article, ""{article.Title}"". {articleCount} to go.");
        }
    }

    /// <summary>
    /// Returns the Umbraco node properties for a blog article.
    /// </summary>
    /// <param name="article">
    /// The blog article.
    /// </param>
    /// <returns>
    /// The property values, stored in a dictionary with the key being the property alias.
    /// </returns>
    private static Dictionary<string, object> GetNodeProperties(ArticleModel article)
    {
        var richHeader = string.IsNullOrWhiteSpace(article.Title)
            ? null
            : $"<p>{WebUtility.HtmlEncode(article.Title)}</p>";
        var tags = (article.Tags ?? new List<string>()).ToArray();
        var serializedTags = JsonConvert.SerializeObject(tags);
        return new Dictionary<string, object>()
        {
            { "metaDescription", article.Excerpt },
            { "header", richHeader },
            { "tags", serializedTags },
            { "releaseDate", article.PublishDate },
            { "mainContent", GetArticleWidgets(article.Text) },
            { "image", GetImageArchetype(article.ImagePath, article.Title) }
        };
    }

    /// <summary>
    /// Returns the archetype to use for an image on a blog article.
    /// </summary>
    /// <param name="url">
    /// The article's image URL.
    /// </param>
    /// <param name="altText">
    /// The alt text to use for the image.
    /// </param>
    /// <returns>
    /// The Archetype model, serialized as a string.
    /// </returns>
    private static string GetImageArchetype(string url, string altText)
    {
        var model = new ArchetypeModel()
        {
            Fieldsets = new List<ArchetypeFieldsetModel>()
            {
                new ArchetypeFieldsetModel()
                {
                    Alias = "legacyImage",
                    Disabled = false,
                    Properties = new List<ArchetypePropertyModel>()
                    {
                        new ArchetypePropertyModel()
                        {
                            Alias = "alternateText",
                            Value = altText
                        },
                        new ArchetypePropertyModel()
                        {
                            Alias = "image",
                            Value = url
                        }
                    }
                }
            }
        };
        return model.SerializeForPersistence();
    }

    /// <summary>
    /// Returns the archetype widgets to use on a blog article.
    /// </summary>
    /// <param name="bodyCopy">
    /// The article's body copy.
    /// </param>
    /// <returns>
    /// The Archetype widgets, serialized as a string.
    /// </returns>
    private static string GetArticleWidgets(string bodyCopy)
    {
        var model = new ArchetypeModel()
        {
            Fieldsets = new List<ArchetypeFieldsetModel>()
            {
                new ArchetypeFieldsetModel()
                {
                    Alias = "blogArticleContainer",
                    Disabled = false,
                    Properties = new List<ArchetypePropertyModel>()
                    {
                        new ArchetypePropertyModel()
                        {
                            Alias = "mainContent",
                            Value = new ArchetypeModel()
                            {
                                Fieldsets = new List<ArchetypeFieldsetModel>()
                                {
                                    new ArchetypeFieldsetModel()
                                    {
                                        Alias = "richText",
                                        Disabled = false,
                                        Properties = new List<ArchetypePropertyModel>()
                                        {
                                            new ArchetypePropertyModel()
                                            {
                                                Alias = "text",
                                                Value = bodyCopy
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                },
                new ArchetypeFieldsetModel()
                {
                    Alias = "articleCarousel",
                    Disabled = false,
                    Properties = new List<ArchetypePropertyModel>()
                    {
                        new ArchetypePropertyModel()
                        {
                            Alias = "header",
                            Value = @"<p>Similar Blog Posts</p>"
                        }
                    }
                }
            }
        };
        return model.SerializeForPersistence();
    }

    /// <summary>
    /// Gets the tags for the specified article.
    /// </summary>
    /// <param name="article">
    /// The article from the XML file.
    /// </param>
    /// <returns>
    /// The tags.
    /// </returns>
    private static IEnumerable<string> GetTags(XPathNavigator article)
    {
        var ignoreCase = StringComparison.InvariantCultureIgnoreCase;
        var invalidTags = new[] { "Uncategorized" };
        var tags = new List<string>();
        var tagNodes = article.Select("category");
        foreach (var tagNode in tagNodes)
        {
            var casted = tagNode as XPathNavigator;
            tags.Add(casted.Value);
        }
        tags = tags
            .Where(x => !string.IsNullOrWhiteSpace(x))
            .Where(x => !invalidTags.Any(y => y.Equals(x, ignoreCase)))
            .Select(x => ToTitleCase(x, false))
            .ToList();
        return tags;
    }

    /// <summary>
    /// Attempts to get the image URL for an image with the specified WordPress image ID.
    /// </summary>
    /// <param name="imageId">
    /// The WordPress image ID.
    /// </param>
    /// <param name="images">
    /// The image dictionary.
    /// </param>
    /// <returns>
    /// The image URL, or null.
    /// </returns>
    private static string GetImageUrl(string imageId, Dictionary<string, ImageModel> images)
    {
        if (string.IsNullOrWhiteSpace(imageId))
        {
            return null;
        }
        if (images.ContainsKey(imageId))
        {
            return images[imageId].UrlPath;
        }
        return null;
    }

    /// <summary>
    /// Parses the XML file to extract image information.
    /// </summary>
    /// <returns>
    /// The images information, stored in a dictionary by the WordPress ID for each image.
    /// </returns>
    private static Dictionary<string, ImageModel> GetImages()
    {
        var knownExtensions = new[] { ".png", ".jpg" };
        var knownStarts = new[] { "https://www.sample.com/blog/wp-content/uploads/" };
        var images = new Dictionary<string, ImageModel>();
        var imageItems = GetImageXPathNodeIterator();
        foreach (var imageItem in imageItems)
        {
            var casted = imageItem as XPathNavigator;
            var imageId = casted.SelectSingleNode("*[name() = 'wp:post_id']").Value;
            var url = casted.SelectSingleNode("*[name() = 'wp:attachment_url']")?.Value
                ?? string.Empty;
            var isEmpty = string.IsNullOrWhiteSpace(url);
            var invalidExtension = !knownExtensions.Any(x => url.EndsWith(x));
            var invalidStart = !knownStarts.Any(x => url.StartsWith(x));
            if (isEmpty || invalidExtension || invalidStart)
            {
                throw new Exception("Invalid URL for an image.");
            }
            var urlPath = (new Uri(url)).PathAndQuery;
            images[imageId] = new ImageModel()
            {
                UrlPath = urlPath
            };
        }
        return images;
    }

    /// <summary>
    /// Gets the image XPath node iterator for the XML file containing blog data.
    /// </summary>
    /// <returns>
    /// The XPath node iterator.
    /// </returns>
    private static XPathNodeIterator GetImageXPathNodeIterator()
    {
        var imageXPath = @"rss/channel/item[./*[(name() = 'wp:post_type')]/text()='attachment']";
        return GetXPathNodeIterator(imageXPath);
    }

    /// <summary>
    /// Gets the article XPath node iterator for the XML file containing blog data.
    /// </summary>
    /// <returns>
    /// The XPath node iterator.
    /// </returns>
    private static XPathNodeIterator GetBlogXPathNodeIterator()
    {
        var articleXPath = @"rss/channel/item[./*[(name() = 'wp:post_type')]/text()='post'][./*[(name() = 'wp:status')]/text()='publish']";
        return GetXPathNodeIterator(articleXPath);
    }

    /// <summary>
    /// Gets the XPath node iterator for the XML file containing blog data,
    /// using the specified XPath.
    /// </summary>
    /// <param name="xpath">
    /// The XPath.
    /// </param>
    /// <returns>
    /// The XPath node iterator.
    /// </returns>
    private static XPathNodeIterator GetXPathNodeIterator(string xpath)
    {
        var invalidChars = new[] { ((char)3).ToString() };
        var path = GetImportPath();
        var contents = File.ReadAllText(path);
        foreach (var invalidChar in invalidChars)
        {

            // Need to use StringBuilder to avoid an out of memory exception.
            contents = new StringBuilder(contents)
                .Replace(invalidChar, string.Empty).ToString();

        }
        var doc = new XPathDocument(new StringReader(contents));
        var nav = doc.CreateNavigator();
        var articleResult = nav.Evaluate(xpath) as XPathNodeIterator;
        return articleResult;
    }

    /// <summary>
    /// Returns the path to the XML file that should be imported.
    /// </summary>
    /// <returns>
    /// The path to the file (e.g., "C:\r\sample.com\sample-temp\wordpress-export.xml").
    /// </returns>
    private static string GetImportPath()
    {
        var ignoreCase = StringComparison.InvariantCultureIgnoreCase;
        var homeFolder = HostingEnvironment.MapPath("~/");
        var basePath = new DirectoryInfo(homeFolder);
        while (basePath != null && !"src".Equals(basePath.Name, ignoreCase))
        {
            basePath = basePath.Parent;
        }
        if (basePath != null)
        {
            basePath = basePath.Parent;
        }
        if (basePath == null)
        {
            throw new DirectoryNotFoundException("Unable to find the directory containing the XML file to be imported.");
        }
        var path = Path.Combine(basePath.FullName, XmlFilePathSuffix);
        if (!File.Exists(path))
        {
            throw new FileNotFoundException("Unable to locate the XML file to be imported.", path);
        }
        return path;
    }

    /// <summary>
    /// Converts a string to title case.
    /// </summary>
    /// <param name="value">
    /// The string to convert to title case.
    /// </param>
    /// <param name="replaceSpaceChars">
    /// Replace characters tht are similar to spaces (e.g., dashes)?
    /// </param>
    /// <returns>
    /// The string, in title case.
    /// </returns>
    private static string ToTitleCase(string value, bool replaceSpaceChars = true)
    {
        var spaceChars = new[] { "-", "_" };
        if (string.IsNullOrWhiteSpace(value))
        {
            return value;
        }
        if (replaceSpaceChars)
        {
            foreach (var spaceChar in spaceChars)
            {
                value = value.Replace(spaceChar, " ");
            }
        }
        return CultureInfo.GetCultureInfo("en-US").TextInfo.ToTitleCase(value);
    }

    /// <summary>
    /// Returns the last path segment for a given URL.
    /// </summary>
    /// <param name="url">
    /// The URL (e.g., "http://site.com/something/that/is/a-path").
    /// </param>
    /// <returns>
    /// The last segment (e.g., "a-path").
    /// </returns>
    private static string LastPathSegment(string url)
    {
        if (string.IsNullOrWhiteSpace(url))
        {
            return url;
        }
        url = url.TrimEnd("/".ToCharArray());
        var lastSlashPos = url.LastIndexOf("/");
        if (lastSlashPos >= 0)
        {
            return url.Substring(lastSlashPos + 1);
        }
        else
        {
            return url;
        }
    }

    /// <summary>
    /// Removes the common characters that get converted into a dash when constructing
    /// the slug for a URL.
    /// </summary>
    /// <param name="value">
    /// The value to remove characters from.
    /// </param>
    /// <returns>
    /// The value without slug characters.
    /// </returns>
    private static string RemoveSlugCharacters(string value)
    {
        var dash1 = "-";
        var dash2 = "–";
        var chars = new[] { "&", " ", "?", ".", "!", "'", ",", "(", ")", dash1, dash2 };
        if (string.IsNullOrWhiteSpace(value))
        {
            return string.Empty;
        }
        foreach (var character in chars)
        {
            value = value.Replace(character, string.Empty);
        }
        return value;
    }

    /// <summary>
    /// Are the two values roughly equal, ignoring characters that get converted to
    /// a dash when constructing a URL slug.
    /// </summary>
    /// <param name="value1">
    /// The first value.
    /// </param>
    /// <param name="value2">
    /// The second value.
    /// </param>
    /// <returns>
    /// True, if the two values are roughly equal; otherwise, false.
    /// </returns>
    private static bool AreSlugwiseEqual(string value1, string value2)
    {
        var ignoreCase = StringComparison.InvariantCultureIgnoreCase;
        value1 = RemoveSlugCharacters(value1 ?? string.Empty);
        value2 = RemoveSlugCharacters(value2 ?? string.Empty);
        return value1.Equals(value2, ignoreCase);
    }

    /// <summary>
    /// Returns the best title to use for an Umbraco node based on the specified
    /// original title and link.
    /// </summary>
    /// <param name="title">
    /// The original title.
    /// </param>
    /// <param name="link">
    /// The URL of the article.
    /// </param>
    /// <returns>
    /// The best title.
    /// </returns>
    /// <remarks>
    /// A title is formed from the specified link to avoid creating a redirect (because
    /// the name of an Umbraco node determines the URL). If that one roughly matches
    /// the original title, the original title is preferred; otherwise, the title
    /// generated from the link is used.
    /// 
    /// The two possible titles are roughly equal if publishing them in Umbraco would
    /// produce identical URL slugs (e.g., "Some Page" and "Some - Page?" would both
    /// produce a URL slug of "some-page").
    /// </remarks>
    private static string GetBestNodeTitle(string title, string link)
    {
        var lastSegment = LastPathSegment(link);
        var linkTitle = ToTitleCase(lastSegment);
        var roughlyEqual = AreSlugwiseEqual(title, linkTitle);
        if (roughlyEqual)
        {
            return title;
        }
        if (string.IsNullOrWhiteSpace(linkTitle))
        {
            return title;
        }
        return linkTitle;
    }

    /// <summary>
    /// Converts a markdown string to an HTML string.
    /// </summary>
    /// <param name="value">
    /// The markdown string.
    /// </param>
    /// <returns>
    /// The HTML.
    /// </returns>
    private static string ConvertMarkdownToMarkup(string value)
    {
        var md = new MarkdownSharp.Markdown();
        var result = md.Transform(value);
        return result;
    }

    /// <summary>
    /// Checks for images to see if they are in the expected folders.
    /// </summary>
    /// <param name="value">
    /// The HTML.
    /// </param>
    /// <returns>
    /// True, if an unknown path was detected; otherwise, false.
    /// </returns>
    /// <remarks>
    /// This function is called manually; currently, it's not being called
    /// in a systematic way (it's more for a sanity check).
    /// </remarks>
    private static bool ContainsUnknownImage(string value)
    {
        var knownPaths = new[]
        {
            "http://www.sample.com/blog/wp-content/uploads/",
            "https://www.sample.com/blog/wp-content/uploads/",
            "data:image/jpeg;base64,",
            "data:&lt;;base64,",
            "data:%3c;base64,",
            "https://c1.staticflickr.com/"
        };
        if (string.IsNullOrWhiteSpace(value))
        {
            return false;
        }
        value = value.ToLower();
        var containsImage = value.Contains("<img");
        if (containsImage)
        {
            var lines = value.SplitBy(StringSplitDelimiters.LineBreak);
            foreach (var line in lines)
            {
                if (line.Contains("<img"))
                {
                    if (!knownPaths.Any(x => line.Contains(x)))
                    {
                        return false;
                    }
                }
            }
        }
        return false;
    }

    /// <summary>
    /// Corrects image paths so they start with "/" rather than "http".
    /// </summary>
    /// <param name="value">
    /// The HTML that may contain an image path.
    /// </param>
    /// <returns>
    /// The HTML with the corrected image paths.
    /// </returns>
    private static string CorrectImagePaths(string value)
    {
        var replacements = new Dictionary<string, string>()
        {
            { @"src=""http://www.sample.com/blog/wp-content/uploads/", @"src=""/blog/wp-content/uploads/" },
            { @"src=""https://www.sample.com/blog/wp-content/uploads/", @"src=""/blog/wp-content/uploads/" }
        };
        foreach (var pair in replacements)
        {
            value = value.Replace(pair.Key, pair.Value);
        }
        return value;
    }

    #endregion

}
Copy Link
Nambi Ramamoorthy 6 posts 115 karma points
Aug 16, 2019 @ 05:30
@Nicholas Westby: Thanks. You rock. Many hugs to you.
Copy Link
is working on a reply...
This forum is in read-only mode while we transition to the new forum.
You can continue this topic on the new forum by tapping the "Continue discussion" link below.
Flag this post as spam?

How to migrate blog posts from WordPress to Umbraco?