New blog, migrating from Blogger

New blog, migrating from Blogger

It has been over 10 years since I started blogging on a few technical/software related topics.

Having my blog tied to a platform like Blogger didn't really encourage me to do anything with it, so I'm finally getting around to starting the blog a new and try to document the process of migration if anyone else wants to do the same.

# Markdown from HTML

Thankfully most platforms do provide an API to get their content and Blogger has a pretty simple one.

I first tried just plain manual copy + paste because I just don't have that many posts, and this works fine with Vuepress. However, one thing I noticed is that while it presented file and worked, Blogger is still hosting the images. I didn't want to migrate the content manually just to have my images break at a later date.

This means I really need to script it. I tend to avoid writing code a lot more than I previously did, I start with trying to do something the simplest way I can, especially if it is a one time task.

JavaScript or another dynamic scripting language would generally be the common choice, however I still find myself more productive with C#, especially for anything with web requests. Using fetch is nice and simple in JS, however when using utility libraries like ServiceStack.Client, I can make a simple synchronous request.

var data = "https://www.googleapis.com/blogger/v3/blogs/894911436813129381/posts?key=abcd1234"
                .GetJsonFromUrl()
                .FromJson<BloggerPostsResponse>();

I find this a lot easier to reason about than needing to throw in managing promises. For production code, using async/await is nearly always the better choice but for these kinds of scripts, I just don't need to worry about blocking code being an issue.

I can easily load the data into a class, declaring only the properties I need to perform the migration.

public class BloggerPostsResponse
{
    public string NextPageToken { get; set; }
    public List<PostItem> Items { get; set; }
}

public class PostItem
{
    public string Id { get; set; }
    public DateTime Published { get; set; }
    public string Url { get; set; }
    public string Title { get; set; }
    public string Content { get; set; }
    public List<string> Labels { get; set; }
}

# Generating Markdown

I've using VuePress for this blog, and I've opted to start from a theme by Ahmad Mostafa (opens new window). Before dealing with the images, I'm going to populate the frontmatter, and content with Blogger hosted images.

private static string mdTemplate = @"---
title: {0}
date: {1}
{2}
---

{3}";

//.. fetch content

foreach (var postItem in initialResponse.Items)
{
    var tags = "tags:\n-" + postItem.Labels.Join("\n-");
    var mdOutput = mdTemplate.Fmt(postItem.Title, postItem.Published.ToString("yyyy-MM-dd"),
        postItem.Labels is { Count: > 0 } ? tags : "",
        postItem.Content);
    Console.WriteLine(mdOutput);
}

This generates the following markdown (excluding content).

---
title: Message based architectures empower polyglot developers
date: 2015-10-15
tags:
-.NET
-architecture
-design
-Java
-servicestack
---

Writing this all out to a filename derived from the original URL means we now have the bulk of the migration working.

# Still HTML

Vuepress deals with this HTML fine, but I decided it would be better to just have straight markdown if possible. A quick google led me to ReverseMarkdown (opens new window). Applying this to the content before writing out my markdown file, and we get clean markdown!

// HTML -> MarkDown
postItem.Content = converter.Convert(postItem.Content);

# Dealing with images

When authoring Blogger posts, you can conveniently just paste an image from your clipboard. Once pasted, Blogger hosts the image for you embedding the link to their system. Viewers see a scaled version of your image, but since I want to preserve the highest quality of the image, I want to change the URL before I fetch it.

Example of an image URL:

http://3.bp.blogspot.com/-c7HULJrBVQc/UjPb4X_VPfI/AAAAAAAABhI/8cLVanpxfNg/s400/iso10x5test.png

The s400 part of the URL above changes based on the image and how it is used in the authored blog post. Replacing the number after the s to a something really high results in the original image size not being exceeded. This is handy since I can just hack in a high scale number and persist the original size and mapping a resultant path for my new hosting to replace.

Since Vuepress wants to manage our image assets for us and that the images are all referenced using <img rather than in MarkDown, we'll host the images in the .vuepress/public folder ourselves.

This can be done before the HTML -> Markdown conversion since we are just changing src and href of anchors.

foreach (var postItem in uniquePosts)
{
    var imgDirName = postItem.Url.Split("/").Last().Replace(".html", "");
    var images = GetImageLinks(postItem.Content).Where(x => x.Contains("blogspot.com")).ToList();
    var imgDirPath = ($"..\\..\\..\\images\\archive\\{imgDirName}\\").MapAbsolutePath();

    foreach (var img in images)
    {
        Directory.CreateDirectory(imgDirPath);
        var scale = img.Split("/")[7];
        var bytes = img.Replace($"/{scale}/", "/s10000/").GetBytesFromUrl();
        var imgName = img.Split("/")[8];
        File.WriteAllBytes((Path.Join(imgDirPath, imgName)).MapAbsolutePath(), bytes);
        postItem.Content = postItem.Content.Replace(img, $"/images/archive/{imgDirName}/{imgName}");
    }

Blogger wraps the images in anchors so readers can view the full size while embedding a smaller size. I'm only using the one resolution, so replacing the href of the anchors to the same keeps the same behaviour.

// Update links
var anchorLinks = GetAnchorLinks(postItem.Content)
    .Where(x => x.Contains("blogspot.com"))
    .ToList();

foreach (var link in anchorLinks)
{
    var imgName = link.Split("/")[8];
    postItem.Content = postItem.Content.Replace(link, $"/images/archive/{imgDirName}/{imgName}");
}

# Replacing GitHub gists

In some of my posts, I was embedding GitHub gists. Vuepress doesn't like this, so I decided to just import into code fences since they are also styled per language by default.

// Replace Gist embed scripts with code fence.
var scripts = GetScriptUrls(postItem.Content)
    .Where(x => x.Contains("gist.github.com"))
    .ToList();

foreach (var script in scripts)
{
    var url = script.Substring(0, script.Length - 3);
    var gistId = url.Split("/").Last();
    var gistJson = ("https://api.github.com/gists/" + gistId).GetJsonFromUrl(request =>
    {
        request.Headers.Add("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0");
    });
    var gistDoc = JsonDocument.Parse(gistJson);
    GistFile gistFile = new GistFile();
    foreach (var file in gistDoc.RootElement.GetProperty("files").EnumerateObject())
    {
        gistFile = file.Value.ToString().FromJson<GistFile>();
    }

    postItem.Content = postItem.Content.Replace($"<script src=\"{script}\"></script>", 
        $@"
```
{gistFile.content}
```
");
}

I've posted the full Program.cs I used up on GitHub (opens new window) in case it is useful to anyone. It is extremely hacky but made the migration process I lot more straight forward. I have to deal with importing gists into code fences, common formatting problems, and while a couple of posts needed some minor edits, near enough is good enough.