Gemfinder Q3 2021; AI is eating the world. Or is it?

Finding gems — Image credit: Walter G. Mason, Public domain, via Wikimedia Commons

Articles I’ve read in Q3 2021. And why they struck a chord with me. Tried to organise them in these sections:

Data engineering and “AI”
Devops
Software Engineering
Writing docs
Miscellaneous

Data engineering and “AI”

Machine learning’s crumbling foundations

I believe we all heard expressions as “robots are eating the world”. Or “AI is taking over the world”.

I always take such claims with a grain of salt. And I never read an article that elaborated the basis for my skepticism this well:

ML is rife with all forms of statistical malpractice – AND it’s being used for high-speed, high-stakes automated classification and decision-making, as if it was a proven science whose professional ethos had the sober gravitas you’d expect from, say, civil engineering.

Civil engineers spend a lot of time making sure the buildings and bridges they design don’t kill the people who use them. Machine learning?

This post is a gem. There’s more, specifically on garbage in, garbage out:

The ML models failed due to failure to observe basic statistical rigor. One common failure mode?

Treating data that was known to be of poor quality as if it was reliable because good data was not available.

Obtaining good data and/or cleaning up bad data is tedious, repetitive grunt-work. It’s unglamorous, time-consuming, and low-waged. Cleaning data is the equivalent of sterilizing surgical implements – vital, high-skilled, and invisible unless someone fails to do it.

I feel this. A lot.

A huge percentage of the “data” or “AI” job ads I look at have one common theme. They emphasise the usage of “the latest tool” that will make getting results of data ASAP.

Yet they mention very little about making source data usable (i.e. reliable).

For this solid software engineering practices, in particular testing, are needed. I.e. the ability to think about the source and destination data models. And how to “extract, transform, load” the source data to the destination. Once safely at its destination, and only at that point, can these tools that the 90% of most job adverts are about be utilised.

This unglamorous part is often overlooked. Usually candidate #1 for outsourcing. The article above provides real-life examples of where this happened. And things went wrong.

But a lot of thought should go into it. Whoever is doing it need not only have good programming skills. Communication and ability-to-understand-the-business-domain skills are as important:

Producing good data and validating data-sets are the kind of unsexy, undercompensated maintenance work that all infrastructure requires – and, as with other kinds of infrastructure, it is undervalued by journals, academic departments, funders, corporations and governments.

But all technological debts accrue punitive interest. The decision to operate on bad data because good data is in short supply isn’t like looking for your car-keys under the lamp-post – it’s like driving with untrustworthy brakes and a dirty windscreen.

Devops

A Tired Raccoon’s Containerization Manifesto

By Glyph Lefkowitz, is a call to action to move to containerization. I have, on a personal (side-project) level, tried to stay away from containerization. Been fond of a combination of ansible and fabric for the last decade or so.

However this does seem like a great position to be in:

I’ve done it for a bunch of minor things I maintain and it’s improved my life greatly; I just re-build the images with the latest security updates every week or so and let them run on autopilot, never worrying about what previous changes have been made to the host. If you can do it, it’s worth it.

Can I do it? Should I? There’s only one way to find out.

Software Engineering

This section is mostly about the “PAGNI series”.

YAGNI exceptions

This “series” was triggered by Luke Plant‘s YAGNI exceptions post:

I’m essentially a believer in You Aren’t Gonna Need It — the principle that you should add features to your software — including generality and abstraction — when it becomes clear that you need them, and not before.

However, there are some things which really are easier to do earlier than later, and where natural tendencies or a ruthless application of YAGNI might neglect them.

Luke’s collection at the time of writing:

Application of Zero One Many. I’ve seen Client model with fields phone, phone2 and phone3. Where phone2 and phone3 allow null values. This YAGNI exception, or “PAGNI” is really worth the effort. I.e. Creating a one-to-many relationship between Client and Phone in this case. I always try to embed this concept in my database model design. But had never known there is a term for it!
Versioning to Embrace Change.
Logging. I’ve seen “over logging”. And logs not being monitored. So I feel there are some caveats. Especially with monitoring tools like Sentry. These present incidents in a more manageable way compared to “linear” logging tools.
Timestamps. Fully agree on this! Django’s auto_now_add and auto_now make this a breeze.
Not going as minimalistic as possible with data. If not collected data is lost forever. If collected you can always create scheduled tasks to truncate it by timestamp.
Relational database. Well, I never worked on an application backend that wasn’t backed by a relational database. What Luke means is:

[..] if you need a database at all, you should jump to having a relational one straight away, and default to a relational schema, even if your earliest set of requirements could be served by a “document database” or some basic flat-file system.

This post by Luke Plant led to a response post by Simon Willison in which he coins the term “PAGNI”:

PAGNIs: Probably Are Gonna Need Its

YAGNI—You Ain’t Gonna Need It—is a rule that says you shouldn’t add a feature just because it might be useful in the future—only write code when it solves a direct problem.

When should you over-ride YAGNI? When the cost of adding something later is so dramatically expensive compared with the cost of adding it early on that it’s worth taking the risk. On when you know from experience that an initial investment will pay off many times over.

Lukes’s exceptions to YAGNI are well chosen: things like logging, API versioning, created_at timestamps and a bias towards “store multiple X for a user” (a many-to-many relationship) if there’s any inkling that the system may need to support more than one.

Because I like attempting to coin phrases, I propose we call these PAGNIs—short for Probably Are Gonna Need Its.

The most relatable PAGNIs in Simon Willison’s list:

Automated deploys.
Continuous Integration (and a test framework)
Ideally with a testing styleguide
Which leads to Continuous deployment!, i.e. automatic deploy when the tests pass.
API pagination. Which should be a no brainer to someone who uses DRF like I do.

Simon concludes that all these drive down the cost:

One trick with all of these things is that while they may seem quite expensive to implement, they get dramatically cheaper as you gain experience and gather more tools for helping put them into practice.

Any of the ideas I’ve shown here could take an engineering team weeks (if not months) to add to an existing project—but with the right tooling they can represent just an hour (or less) work at the start of a project. And they’ll pay themselves off many, many times over in the future.

I digested this further between (1) reading it the first time (2) bookmarking it (3) reading it again before writing about it.

Putting these in place not only drive costs down.

They drive engineers’ stress down.

All these, once embedded in the team’s process to deliver software, reduce stress:

Stress to track down and resolve bugs and/or performance issues.
Stress in implementing new features alongside existing ones.
Stress in onboarding new team members to the project.
Stress in getting back to a project myself after some time not working on it!

So before complaining about burnout, next time I feel burnt out, I’ll go through the two lists above. And ask myself: what am I missing?

These items also make up for a nice list of questions. Questions to ask your potential employer/client next time a job/project opportunity comes around.

P.S. Jacob Kaplan-Moss responded to the above two posts with a security-oriented post: Probably Are Gonna Need It: Application Security Edition.

Finally, another software engineering gem, unrelated to the PAGNI series:

A great video HTMX tutorial

By Matthew Freire on JustDjango.com. I admit I’m not a fan of video tutorials. I prefer consuming tutorials in written form.

But I did enjoy watching this one:

I already have trouble keeping up-to-date with everything backend and devops. HTMX feels empowering to someone like me.

With the move of the frontend to frameworks as Vue.js, ReactJS, etc I have lost confidence of ever shipping a full product on my own. Unless it has a rudimentary interface.

HTMX changes that!

For the written form refer to the author’s site: Django Formsets Tutorial - Build dynamic forms with Htmx

Writing Docs

Technical documentation writing quick tips

By Marijke Luttekes, splits quick tips into General Tips and Web-specific tips.

Of the General Tips this is the one I need to pay attention to the most:

Tip 7; Avoid down-talking your audience: People read your article to inform themselves on something they don’t know yet. Avoid down-talking your readers by leaving certain words out of your text.

Don’t tell readers that what you’re teaching them is “easy”, “beginner”, “simple” if it is not (and even if it is, be careful), otherwise you will leave a dent in their self-esteem.

The “sections list” above is my attempt to follow tip #1 from the Web-specific tips. For the record, I’ve followed the “manual” approach recommended in this answer on StackOverflow.

Each tip, while not technically-advanced, might not be obvious. Each has an impact on improving writing quality. Even non-technical writing.

Miscellaneous

Hanging By A Thread

By Morgan Housel, goes through simple episodes that had, and still have, far reaching consequences. With a great lesson to keep an open mind:

I try to keep two things in mind in a world that’s this fragile to chance.

One is to base your predictions on how people behave vs. specific events. Predicting what the world will look like in, say, 2050, is just impossible. But predicting that people will still respond to greed, fear, opportunity, exploitation, risk, uncertainty, tribal affiliations and social persuasion in the same way is a bet I’d take.

Another – made so starkly in the last year and a half – is that no matter what the world looks like today, and what seems obvious today, everything can change tomorrow because of some tiny accident no one’s thinking about. Events, like money, compound. And the central feature of compounding is that it’s never intuitive how big something can grow from a small beginning.

That’s all for Q3 folks!

Untangled Development