This is going to be a long one folks lol…
But if you're interested in the world of Artificial Intelligence (AI) content and would like some insight into the challenges faced with engineering a system, as well as how we overcame those, feel free to read on.
There are a few gems in here that should get you thinking about how you too can approach Artificial Intelligence (AI) differently, and in a way that offers significant longevity while also improving the quality of output.
DISCLAIMER: I'm the Search Engine Optimization (SEO) and one of many (allegedly) big brains behind this, but I am not the engineer/developer. Be kind if I misspoke regarding a technical aspect of our system.
DISCLAIMER 2: This is not the content offered through our content agency, which is still proudly curated by a dedicated team of journalists and professional writers.
EXAMPLE OF POST PRODUCTION NO EDITING
*Additional Resources at the End of This Post*
As many of you know, over the last 6 years one of our core divisions (QuantRadius) has focused heavily on artificial intelligence in the realm of competitive intelligence, behavioral data, Search Engine Optimization (SEO), and language (among other things).
Having a dedicated data intelligence business has been one of the major driving forces behind the success of nearly all of our ventures, from Crypto to Search Engine Optimization (SEO), and everything in-between. I can't recommend enough the type of impact that brining in a few data scientists can have on your business (but I digress).
Dating back to the first Generative Pretrained Transformer (GPT) release, our team realized the potential of leveraging Artificial Intelligence (AI) in a way that could create meaningful, engaging, and intent-driven content for Search Engine Optimization (SEO)
We spent up to 5-figures monthly custom training our initial models for the first 2 years. Luckily, we qualified for cloud computing credits so we weren't running too deep into the hole on this financially (more on those resources later and how you can get them for free).
Initially, we tapped into components from Huggingface (link for the resource at the end) and expanded significantly on top of their framework.
This allowed us to focus more so on the optimization of training models and data prep early on, but also brought up several challenges with regard to changes in their modeling functions.
We stuck with this architecture for a year or so, adding additional layers to control aspects such as sentiment, optimizing inference speed, and fine-tuning using non-TPU hardware/multi-GPU clusters to reduce training costs.
However, despite reaching a level of quality in alignment with the Copy.ai and Jasper's of the world, Generative Pretrained Transformer (GPT)-3 based Artificial Intelligence (AI) has several limitations and quirks that I'm sure many of you are already intimately aware of.
Our MAJOR Issues with GPT-3 Based AI
1. It sucks for long-form:
Sometimes it gets it right, but if we're all being honest here, even companies with a billion-dollar valuation (*jasper* *Cough*) aren't getting it done much better than the rest, and I've tested around 45 of them EXTENSIVELY.
2. The content is difficult to Optimize well through automation:
Because Artificial Intelligence (AI) is making content from scratch, it doesn't have the proper mathematical and semantic parameters to optimize a piece of content for Search Engine Optimization (SEO) purposes. Additional layers of AI can be added on top of the core modules to accomplish this, but we decided to take different (and arguably better) angle. More on that further below.
3. The models are based on easily identified mathematical equations and can be fingerprinted by Google:
Is this an issue now? Up for debate. Will it be an issue in the future? My guess is most likely. We've seen how scrupulous Google has been with affiliate content being from a REAL PERSON WHO HAS PURCHASED AND USED THE PRODUCT, with REAL images being the standard on sites and even in Google My Biz (GMB) posts. I've even had Adsense accounts denied with a message saying the content was 'autogenerated' – IF THE ADSENSE TEAM CAN DETECT IT, THEIR SEO ALGOS CAN TOO. And Adsense determined that it was low quality despite readying incredibly well.
Are there use cases where Google won't care? Probably. But if you think Google is going to let everyone flood the internet with AI content at true scale, I think you're going to be in for a bad time.
For those of us around long enough, I would remind you of the days of page generation in the late 90s and early 2000s, of the Fantamos Shadowmaker days, and after that the MFA (made for AdSense) farms.
Point is, these all worked for a couple years until they became a systemic problem for Google, then they got whacked like everything else. Don't let history repeat itself at the cost of your livelihood.
SO, WHAT WAS THE SOLUTION?
Going back to the drawing board we took a step back and asked ourselves what we were really hoping to accomplish?
The answer was that we didn't want to reinvent the wheel. The wheel is a great invention. What we wanted to do is take that wheel and oil it up so it goes faster while costing less
In a nutshell, what we wanted was a system that identifies, analyzes, compiles, and reverse-engineers content that currently ranks on Google in the wild.
We then wanted data from this system piped into a new language model that would be able to interpret that data, and apply it to the new content without the loss of all the elements that made the original content rank in the first place.
Luckily, we had a SurferSEO-esq and POP-esq system already made from back in 2017. This original competitive analysis tool was built utilizing first-party data from over 80,000 websites across a span of 2-years of forward-tested live data. To do this we brokered deals with Software as a Service (SaaS) and other web tools to incorporate our code in each website (with the site owner's express authorization to do so).
This gave us an absolute treasure trove of real-time performance data on the what, how, and why sites were ranking.
It was from this first-party data that our original SEO competitive intelligence tool was born, and has contributed significantly to the success of my content and Public Relations (PR) agency since that time.
But I'm getting off track. TLDR – we have a tool similar to Surfer/POP that we developed in-house and has now been incorporated into our new content AI system.
A NEW APPROACH TO AI CONTENT
The problem with long-form AI content is that you can't control many of the elements that make for an SEO-optimized piece of content.
Instead of feeding AI a prompt and letting it go to town, we developed a multi-layered language model that is able to ingest the top-ranking pages on Google for a particular topic/keyword, understand all of the semantically relevant keywords, topics, subtopics, entities, and other words of importance, and then rewrite and restructure the content in such as way as to present the same ideas but in new ways.
The vast majority of topics and ideas are NOT new. The information is in the public domain and exists for all to read, learn, interpret, and then express in their own words.
That is exactly what this model does. It seeks to understand the topic, then write a new piece of content in its own words, expressing and approaching each topic, and ideas within that content piece, in a unique way.
In other words, if you asked TWO individual authors to write a piece on 'How to fix a bike tire', you'd get similar content (given that there are only so many ways you can fix a bike tire), but each would be expressed in a unique way and style.
WHAT ABOUT Google AND FINGERPRINTING?
Our second primary concern was the ability of Google to fingerprint our AI in the future.
Here is what we've done to overcome this:
1. A Layered Approach to AI
Our AI involves several layers, and while each layer individually could be fingerprinted, together they create a diversified approach that would be near impossible to fingerprint.
For example, we may layer a paraphrasing model, with a custom GPT-3 model we've specifically trained for certain types or parts of the content, a reverse-paraphraser that randomly expands on certain ideas in the piece, and so on.
2. Infinite Randomization
We randomize hundreds of parameters between layers so that each run of content is generated using a unique fingerprint. There are tens of millions of combinations.
WHAT WE'RE DOING WITH THE CONTENT
As most of you know, I like keeping the vast majority of what I do low-key and off of the radar. No sense painting any type of target on my back.
However, the one use case I can discuss publicly is automated enterprise content at scale. We are combining some of our competitive intelligence tools for content gap analysis, and keyword clustering at scale, then feeding that data into our content AI to mass produce optimized content.
Before this all sounds too 'pie in the sky', no machine is perfect and we still have a team of human editors touching every piece of content, and there are significant computational costs involved with both the content AI and the SEO competitive intelligence engine that feeds the data to AI
Even so, we are able to reduce content costs by around 75%, allowing us to scale harder and faster than the competition.
As a next iteration, we are toying with the idea of making a WP plugin that connects to our Application Programming Interface (API) and delivers/posts unique optimized content at scale. I can't promise this will be a thing given that I'm still uncertain whether or not I want to release this publicly.
Despite our safeguards, I'm cautious about becoming a target when we could potentially (quietly) flood the internet with undetectable quality content.
Its sites like Jasper and CopyAi (sorry guys) that pose the biggest risk from an SEO perspective. The more people use the SAME models to produce content, the bigger of a target it is for Google to say, 'Nice try' before wiping your site off the map.
That's it for now folks (aside from some resources I thought you might find helpful below).
Feel free to ask questions if you have any. Again, I'm involved with the human logic and SEO-optimization side of things, but not so much on the technical side of the AI, so keep that in mind 😊
RESOURCES FOR YOU
In case you're feeling froggy, here are some resources you can jump on to help you get started:
1. HuggingFace (
https://huggingface.co) – an incredible AI community with loads of datasets and models to play with, build upon, and create with.
2. Training Hardware/Cloud: Google Cloud is the best (TPUs), but other cloud services can work as well. Amazon Web Services (AWS) comes in a close second only because we can get qualified for up to $100k in AWS credits.
3 AWS program for free credits (if you need help getting approved, I offer this with a guarantee):
4. OpenAI (access to GPT-3) –
PRO TIP: Found an existing AI system that is crushing it but you can't figure out how they're doing it? Pay for an enterprise plan and feed your own Machine Learning (ML) system a copy of the OG content and a copy of the new content ran through their AI, then use ML to identify what was changed between the two, and how.
It's a longer road considering the amount of data you'd need to replicate their model, but it could give you a jumpstart.
55 👍🏽6 💟
Oh man, this is fascinating. I'm excited to have a front row seat to this digital arms race between Google & Artificial Intelligence (AI). Its very reminiscent of "link building" back when Google was still developing the Penguin algo.
I understand Google's position that content should be written by humans, for humans but it begs them question when does the that stop mattering. If it brings value to the reader, does it matter who wrote it? On the flip side, if you have a system that perfectly identifies, analyzes, compiles, and reverse-engineers content that currently ranks on Google, will it be able to add fresh insights/perspective to the conversation outside of its programming? If not, is it just an aggregation or, at best, a re-stating of existing knowledge? Could that be what differentiates Artificial Intelligence from Intelligence?
This subject is a little beyond my expertise so I'm kinda just shooting from the hip and thinking out loud but Id love to learn more; Like I said… fascinating! Keep the long form posts coming!
Thanks for dropping by 🙂
1. if you have a system that perfectly identifies, analyzes, compiles, and reverse-engineers content that currently ranks on Google, will it be able to add fresh insights/perspective to the conversation outside of its programming
^^^There are two ways that our system achieves this. The first is that by taking the top handful of ranking pages it understands 'gaps' in knowledge between those and works to build a page that is more encompassing (similar to a traditional skyscraper method)
The second is that while it won't be able to add additional facts outside of those identified and ingested, it is able to communicate that message by approaching it differently or from another angle.
I'll post an example bleow (I wrote this myself for illustrative purposes).
ORIGINAL: John's dog Fido loves ice cream, so he takes him to the park for a treat every Tuesday
OUTPUT: As a dog lover, every Tuesday John takes time from his busy schedule to take Fido to the park for a special ice cream treat.
Best solution is to hire writers and content creators who know what the f*ck they are talking about.
Google going to trust your article on Soybean production from some versificator program straight out of Orwell, or a guy/gal with a masters in crop science from the University of Iowa?
Wait, this is Google. They are part of it. So, I will rephrase that: "Is the reader along with other webmasters (and who can provide natural links) going to trust your article…"
I still own a traditional content agency and I agree that for some niches you just can't beat a talented human. Medical, legal, technical, specialty, etc.
But can we pump out optimized content on par with the quality you'd expect from paying a writer $5-$7/100 words at 75% lower costs? Yeah, we can, finally.
The biggest benefit here is for the general informational type content and the fluff that all the big boys pump out daily.
We can scale thousands of high-quality articles monthly at a fraction of the cost. For niches where trust and authority are paramount, as the pages gain traction we have an option for clients to have a human circle back to that piece and update it (which, although gimmicky and up for debate, still seems like Google gives a boost for updated content).
Jeb » Jess
Thanks for this. Good stuff. So your view is that Google will be able to detect Jasper, etc, because it uses "easily identifiable mathematical equations." Okay, got it. Point well taken.
I suspect that by combining multiple tools, though, this need not be the case. In other words, generate content on Jasper, have a copy editor rewrite some sections, use some kind of rephrase tool for other parts of the content, etc. Thoughts?
This would presumably confuse any kind of detection algo.
Although we have completely moved away from Generative Pretrained Transformer (GPT) except for small modules, one of the biggest takeaways here is the layering of several AI models to create something truly unique to YOU.
The problem with Jasper etc., is the millions of pieces of content beign produced daily using the same algo, and not just that, but one that is easily detectable.
I've already ran 200 sites through adsense and a good clip were denied for 'computer generated content'. Google knows.
By contrast, all of our sites using this non-GPTai are all working just fine
Jess ✍️ 🎓
*Using these tools as a 'helper' and having a human edit them also works
Interesting. Jess So if you try to run Google ads to a Jasper-created blog post, say, Google will reject it because it's AI content created? Is this the best way to detect if an article has (or hasn't been) rewritten enough to evade detection?
Jess ✍️ 🎓 » Jeb
Google doesn't care if you pay to run display ads that link to a site/page of AI content.
But as a PUBLISHER (not an advertiser), Adsense is seemingly detecting it. Our test with GPT regarding this covered 280 sites.
We have not seen this issue with our system (non-GPT based) and using the layered approach and variables system we mentioned.
Which makes sense. Google has stated that AI content is against Webmaster Guidelines and is considered by Google to be spam:
*With one update: Google has said that in certain circumstances AI content may be appropriate. My guess is things like ecom descriptions, ad copy, etc. But not blog posts.
We've already seen how strict Google is on HUMAN first-hand reviews in the affiliate space, going so far as to require first-hand pictures you take at home while using the product etc.
Thanks, Jesse. Will send you a quick Direct Message (DM). Appreciate the insight here.
Writing New Content vs. Updating Old Content
The Useful Content
8 Million Leads per Year through Content Clusters (Case Study)
To Start Writing Content for an Affiliate Marketing Website
Backlinks for Content with Skyscraper Technique Case Study
To Write Articles or Publish Content Fast
Do Google Updates Push People to Pay for Advertising?