Metablogging

Share on:

OpenAI has been doing some very interesting work around Machine Learning (ML) and automated text composition. The release of GPT-2 showed what could be done by building a large model and training it with a large corpus of examples. That corpus was from something called WebText which was around 8 million Reddit posts that got 3 or more up votes. They did some very nice cleaning of the data around removing duplicates and near duplicates, which resulted in the model not being over trained on often re-quoted gibberish and less likely to just paraphrase popular content. With over a billion parameters it was essentially storing a lot of statistical samples and meta stats about how those can be strung together.

While nowhere near a prefect model, GPT-2 was good enough to be useful in a number of limited cases. Given a chunk of text it could generate long rambling text that seemed somewhere between a narrow stream of consciousness and a broken record for a topic. At first glance it reads like a human wrote it. Perfect for auto-completion, chatting with humans, text summarization, even code completion. Many of these smaller applications made GPT-2 seem like magic.

OpenAI did not stop there. They continued on with developing a much larger and capable model called GPT-3. It is a bruising 175 billion parameter model that makes GPT-2 look like a bug compared to a human. While the model is too large for most people to casually run (the heavy GPU machines I have at my disposal are about an order of magnitude too small and most desktop machines would be way under powered). Training the model was an amazing feat and OpenAI should be proud of what they accomplished.

Now what can you do with this massive model? It was trained on a lot of source code, and it is able to actually produce working code in a few languages. Given a small prompt it can create a large, coherent response, much like what we expect from a student in a test essay. There have even been text adventures created by the model.

Many of the examples generated since then have made many think that creation of Intellectual Property (IP) may be a fading profession. Some of the examples included:

  • Generate a SQL query from an English description of what was needed.
  • Generate CSS and React layouts from a simple description.
  • Generate LaTeX for a described formula.
  • Create Keras code for another ML model.

While impressive for auto generation none of them have been impressive enough to empty out computer science programs, nor will they any time soon. There may be some coding boot camps that might want to save up some money going forward, but that was always true. Too many low level tasks that should be mostly automated away are about to be.

Blogging

So you may have guessed what this may have to do with blogging. Machines will start to write the content and humans will only be involved in the prompting of the text and the reading. If that was your guess, your pretty close to the mark. What you may have been missing is what it says about blogs today in that the content may not be worth spending any more time to read than it will be to prompt an AI to write.

It’s been done. Liam Porr used GPT-3 to write a blog. He would start with a title, introduction and photo. GPT-3 did the rest. As of his article on 8/3/20 he had 26,000 visitors and 60 subscribers. The first post made it into Hacker News number one spot. Here is the link. go read it, I’ll wait around. It is about what you see in most blog content. Mostly because it is generated from most other blog content. That’s right, it sounds like what you would read in a thousand other places because the creators did just that to create GPT-3. They mined the web and created a statistical model of how to put the next word down based on all the words that were before it. They created a way to generate the average version of any idea!

So Mr Porr’s experiment shows that he has a machine that got more than 5 times the number of visitors I have gotten with this blog in twice the time. 60 followers? I might have that many bots reading this blog, but I’m not sure I need a second hand to count the number of people that have read more than 2 posts here. Why would I continue given that comparison?

I’ll tell you. Most of what I see out there is the dribble of the average just saying the same thing over and over. Often blogs are marketing/PR efforts to drive eyeballs to a web page. Other times they are just a way for someone to pad their resume or make a little side income. In both of those cases the content is little more than what GPT-3 would create. Let’s call that the noise floor as it just clutters up the Internet and search results for a topic.

There are few cases where blogs represent a new concept or at least a new view on an important topic. Let’s call those the signal. I can remember reading some blog entries that were just great seeds for further thought. The writers were doing those of us that read them a service by posting. There are a few blogs I have added to my RSS feed to make sure I don’t miss any future gems.

My hope is that I will rise above adding to the noise and actually add to the signal. If I do that for anyone I’ll be a happy that I did it. In the mean time it has really gotten me thinking about the concept of a blog and where so many have gone.

A few days ago I ran across this article where I think the author has been thinking along the same lines as I have been. So many blogs are just awful and only serve as a gravitational source for Google searches. Black holes of content?

The author is talking about the Blog Idea Generator which is a GPT-3 tool to prompt blog ideas to get the writers out of slumps (I have more ideas than time to write, so I only view this with amusement). Put in what you would for a Google search, or maybe your tags and let the ML model give you the writing prompt. It literally takes the search terms you want to be hit by and generates the topic. In this case the human is lead by the machine. (Does anyone else see this as the first step to AI domination of human cattle? :-P )

Well I decided to try it and see how it did for a few posts on this blog. Here are the results.

Let’s start with “anticryptography”.

Title

Anticryptography Definition Of Anticryptography By Medical Dictionary

Description

Information Security

I’m not going to loose any sleep over that one.


Next, “code as data”.

Title

Data Dialogue: Code As Data

Description

Numerical codes define the order of assembly or progression of a series of activities. From a simple set of directions that explain how to assemble a cupboard to the complicated range of directives that map spacecraft journeys, there is a specific code for everything.

Wow, that was bad. That content sounds like it came out of a bad text book that was about to show you how to print “Hello World” or abuse a GOTO statement. Kind of disappointing though as I half expected it to look like some Lisp diatribe from Lambda the Ultimate.


Finally, “How not to blog”.

Title

The Secret To Becoming A Successful Blogger

Description

Not all bloggers are successful. Here are the characteristics and strategies employed by bloggers who defy the odds and attract millions of readers.

While I’m not worried about being out performed by any of these, that last one can basically replace 50% of the blogs you would hit with a random selection.

Where does this lead us?

Basically, the technology is amazing. What we have seen here is the old concept of computers still holds. Garbage in, garbage out. There is so much filler and artificial guide posts out there that GPT-3 distilled the concept of fluff to the point it can recreate it in bulk. A good programmer or writer shouldn’t worry about their jobs being at risk, but there are a lot of people around the edges that really are not adding value and the machines are about to make you redundant.