Wednesday, 2 July 2025

Extraordinary C@W blog stats: AI 'training' at work?

We had a short review of the increasingly 'cosmopolitan' nature of C@W readership a while back: I set a little quiz inviting guesses as to the 2024 breakdown of hits, to which the answers were, in descending order -  

  1. Hong Kong
  2. China
  3. USA
  4. Singapore
  5. UK

Well, guess what: since then, the readership stats have shot up, going stratospheric in the last month.  Here's the plot for the last 3 months:


And the countries?

  1. Brazil
  2. USA
  3. India
  4. Japan
  5. Bangladesh
  6. UK

I have an acquaintance who also runs a blog: he's seen something similar, though the numbers are not so extreme and Vietnam features at the top of his list.  The best explanation he can come up with is that the blogs are being used to train LLMs !

Any other suggestions?

Heaven help the "AI" that results from nearly 20 years of C@W.  I suppose we should be flattered ...

ND 

PS: in the circumstances, I thought about re-engaging with Google 'Adsense' to make a bob or two out of advertising to the increased readership.  But (a) the reader-experience isn't much improved by ads; and (b) the small print is so extensive and restrictive, I'll bet Google would rule that we've somehow been artificially boosting readership with bots, and that we wouldn't qualify.

Aren't you grateful?

11 comments:

dearieme said...

Should we all start using foul language to ensure that the bots have had a suitably liberal education?

Anonymous said...

Does LLM training make sense from those locations? I had assumed that most of the LLM training supercomputers are US based? If so then why would they scrape from the far east to then pipe it over to the US to schedule LLM training?
Al

Sobers said...

What on earth can a LLM learn from scraping the internet, including this site and its esteemed posters? How can it decide what information it gathers is true and whats not? If I write 'The sky is green' will that very mean that a LLM reading it will assign a very small possibility to the fact that the sky is indeed green? Given vast swathes of what is written on the internet is bollox on stilts, how can LLMs ever get a true picture of reality if so much of its input is nonsense?

Anonymous said...

Obviously China and HK are building a giant database of everyone in Western Europe and the USA, our comments and NDs writing will add to the information from our doorbells, mobile phones and Huawei routers. If their social credit database can encompass a billion Chinese, that's a similar size task.

Nick Drew said...

Anon@11:23 - that was my general assumption, too. But sometimes the www / cloud works in mysterious ways. For example, my understanding is that movies that are being streamed are "located" wherever in cyberspace it is optimal for the streaming that's taking place at that point in time. So, if (say) 9pm is peak streaming time for a particular movie, it'll be "located" optimally for 9pm streaming in Japan when it IS 9pm in Japan, and it'll have been "moved westwards" for 9pm streaming in Europe when it IS 9pm in Europe

(I may not have explained that very well)

I'll make one other empirical comment: the hour-by-hours stats have been very flat - I'll publish another post with a graph. Yet the "locations" of the "readership" have been all over the place. This feels like a sophisticated operation.

formertory said...

Isn't that the problem? Dwell fleetingly on the thought of an LLM somewhere digesting the possibility that Mad Miliband is in fact the saviour of the human race, and be very afraid as it regurgitates it as (a version of) Truth in a future world. Orwellian to the max.

Caeser Hēméra said...

@ND I think you're trying to explain 'edge' or locale based deployments. I can shift data around the globe pretty easily to where the consumers are to reduce latency, but that's not without its costs.

Probably not what's happening here though, there are plenty of open source LLMs out there, and governments and companies are doubtlessly unleashing them.

I'm more worried about China, LLMs can use textual analysis to tie together personas across sites if they have enough data, and strip away what little bits of privacy are left.

As for learning, we're already seeing signs of model collapse, and as LLMs start eating their own dogfood they'll get decreasingly less useful.

LLMs/AI has plenty of potential in some narrow areas, but in wider aspects their limitations are becoming screamingly obvious despite all the promotion and marketing.

We're already having "AI" businesses turn out to actually be some outsourced chumps, and agents to be useless at anything but the smallest of tasks in the real world. Expect more of that.

Caeser Hēméra said...

Pardon the double negative, when one was meant...

Anonymous said...

@ Sobers / bollox on stilts. If the cap fits- !

Anonymous said...

Perhaps you really do have a lot of avid readers of your site in all those countries.......ha ha

Nick Drew said...

I always suspected some of you BTL commenters were based in Brazil & Bangladesh. Thanks for helping us to go viral there!