Automating my Blog Migrations
The process of migrating all of the posts from chrisblogs.bearblog.dev and asr.bearblog.dev to this blog is now well underway. All of the posts from chrisblogs.bearblog.dev have been transferred over, and I have the posts from asr.bearblog.dev ready to go.
I thought I’d take a moment to document how I did this, mainly for my own reference but also because I’m feeling pretty pleased with myself for making it work.
Moving these posts over to this blog has involved three fairly simply python scripts. It probably could have all been done with just one, had I known exactly what issues I was going to face, but I didn’t and so it didn’t.
The first script did the bulk of the work. This is a simple crawler that downloads all of the blog posts from each site and dumps them into a text file. Initially I tried to use the RSS feed for the blogs, but this only returned the most recent 10 posts. Instead I crawled links on the /blog/ page and grabbed the HTML of each page I found, dumping it into a text file and populating the link:
part of the YAML header with the URL of the post.
Bear posts ask for YAML metadata in the backend which is converted into HTML. Exactly where this information ends up on each page is dictated by the blog template that you use, and I wanted to put it all back into a YAML header. This involved a second script that parsed the HTML of the dumped pages, looking for the date and time of the post to insert into published_date:
and the tags at the bottom of the post to insert into tags:
, then removing those elements from the HTML.
Having done this I thought I was good to go, but I noticed that my headers were still missing anything in the title:
(well, they weren’t, but they all simply contained the name of the blog itself) and that the link:
I had grabbed still contained the base domain rather than just the slug of the page. This called for a script that looked for the first <h1>
tag on each page and placed the text into the title:
field in the header, and then cleaned the URLs in the link:
part of the header.
The final scripts were another URL cleaner and another crawler. I realised that I have a terrible habit of writing absolute links when I’m blogging, and these exported posts were full of them - as were the posts that are remaining in situ on the blog, which would be fine except for the fact that I’ve changed the URL of the blog from chrisreads.bearblog.dev to chrisb.bearblog.dev.
The cleaner went through all of the dumped posts I’ve downloaded looking for absolute links and cleaning them. The crawler went through my remaining blog identifying posts that contained absolute links and putting the URLs into a .csv file so that I could go through and manually clean them in Bear’s backend.
This whole process took maybe and hour or two, and was much less hassle than when I’ve tried to migrate blogs in the past. My initial scraper also now doubles as a handy blog archiver, so I can take regular backups of the blog (since I have a bad habit of writing posts directly into Bear and not saving them anywhere).
All in all, not bad for an evening of work.