Navigating Hope and Fear in a Socio-Technical Future

I was just finishing a talk on Living with Legacy, which covers a great number of concepts related to maintainability and viability in legacy systems at UberConf in Denver when all of a sudden the next day… another great socio-technical meltdown. I wrote a bit about this in my The No-IT Disaster. What happens when technology is such a critical part of our society, our financials, and our security, and yet it is delivered in the same unsafe ways it was during the .com bubble – fast, unstable, and poorly designed?

The recent IT outage is a very straightforward example of the very real and guaranteed set of socio-technical emergencies that will continue to get worse throughout the world in the coming decades if not handled now. This is not a phenomenon to be explained by a single set of decisions or culprits. This was not just Microsoft, CrowdStrike, and the operations teams of the affected companies, though I’m sure their executives will blame them. That is reasonable, if completely short-sighted. No, this is much more a systemic issue in technology, business and society. This situation was caused by an agreement between society and technologists to just not look too deep. To not slow down the giants, to believe the hype, to hope that technology can save us from technology, and to not take responsibility as a group for the failures. In all other complex systems of engineering, we have navigated this relationship, be it in buildings, airplanes, or energy. Only high technology has remained immune. That immunity is ending.

I predict this outage will be the first of many that will allow us to fix what is wrong with the relationship between current IT, vendor, employees, funding, and boards of directors.

What Failed

Since I neither work for one of the vendors involved nor one of the companies who lost millions due to the outage I cannot give the internal facts. I would likely not be allowed to anyway. However, having dealt with hundreds of these systems and thousands of people who create, design and manage these systems I can comment on what my experience and our contacts are telling us.

Technology Failures

Lewis Curtis wrote a wonderful piece on the combined technology and business failures (Did the CrowdStrike incident reveal the biggest Achilles heel of Enterprise IT and Cloud system management?). Lewis is a great leader and also a board certified distinguished architect. A designation that will become tremendously important in coming years.

Simply said, the Crowdstrike application had kernel-level access to Windows machines. Crowdstrike published an update to their platform which caused these windows systems to be unable to start up properly. This caused signficant problems in data centers but even worse problems in hosted containers, as it is much more difficult to connect to unstartable containers. This has caused a dramatically slow recovery time as so much has to be done manually.

Other excellent pieces on the issue of technology are here:

https://www.nytimes.com/2024/07/19/business/microsoft-outage-cause-azure-crowdstrike.html

CrowdStrike—How Microsoft Will Protect 8.5 Million Windows Machines

CrowdStrike outage explained: What caused it and what’s next

Process Failures

A deployment pipeline in operations is a very complicated beast. In essence, it controls exactly what platform elements get put on a machine, device, or container. It also controls whether that device can retrieve updates itself and the level of automation to do that. In a hardened deployment pipeline, nothing can get put on a machine or downloaded to the running system without first being tested in a quality assurance environment. Some of those tests would be automated, some manual. However, the fundamental principle of deployment is to trust as few actors as possible.

However, it is very difficult and expensive to run a truly hardened deployment pipeline. Staff issues, testing environments, variable trust ratings, and mission-critical levels all play a role in the process of deploying a fix or update to production. There are numerous modern ways of optimizing this, such as immutable infrastructure, canary releases, automated testing, etc., but most of the world’s systems are not modern, nor can we afford them to be (more on legacy modernization in my next article).

It is critical that a skilled and qualified architect and an engineer review and approve these decisions related to society’s safety, the company’s risk appetite, and its executives’ business prioritization—in exactly that order.

Prioritization Failures

Most people, and likely all of the involved companies, will blame the technologists and the vendors for this, but the root of these problems was business prioritization and the competencies of people.

No system can be optimized to meet all levels of standards and [quality attributes](Quality Attributes | IASA – BTABoK). It is both technically impossible and exorbitantly expensive. So tradeoffs get made. No automated testing, no review of updates before they are pushed out and other decisions to balance out cost and time. People say, ‘oh we can trust our vendor so we can save money here and there’. But let us be perfectly clear. This is the responsibility of a board of directors. To protect their shareholders and customers from blatant failures as well as to optimize investment to produce the best results. Planes not being in the air is the worst result for an airline.

The board responsibility is to appoint executives who deliver on these results, but no executive can have this level of daily involvement, nor are they likely to have the skills. For technical issues of this magnitude, the world depends on engineers, and, yes, architects. But titles do not make for responsibility or quality. [Only measured and managed competencies can do that](The Case For A Managed Career for Architects).

Thus, it is the responsibility of the board of directors to ensure that appropriate levels of professionals are applied to protect shareholder interests and to prioritize the appropriate work through executives. This has not been a priority at the board level before. I believe that an architecture practice should have board level visibility, and yes a certain degree of liability in these matters.

People Failures

We will spend time talking about the CloudStrike commits and the infrastructure decisions, but this is much more about how we meet the coming demands of complex socio-technical systems.

I am putting a call out to the world of business and technology. Our current education and competency development methods are wholly inadequate to meet challenges of this level of complexity. The people who were involved in designing, developing, and running these complex systems have no clear competency measures in place. Instead, we depend on companies to train and develop their people completely independently. I can tell you for a fact how difficult it is to get companies to prioritize the training and managed experience of architects. Most of our vendors do it themselves. Most of our banks do it themselves.

It takes between 5-8 years to create a board-certified professional-level architect. And roughly 8-10+ a distinguished architect, even with optimum conditions. And we have anything but optimum conditions right now.

Iasa has been surveying, interviewing and developing skills, knowledge, education, mentoring and certification programs for over 20 years. We are currently the only organization to have a career management pathway. All other groups provide only certification (The Open Group, BCS, and SEI all provide some aspects of the professional controls necessary, but they are all lacking in some areas as well.)

And currently, there is no consistent or required method to create and train these essential professionals. If I were on the board of a Fortune 2000 this would terrify me.

This situation leads to randomly skilled individuals currently employed in these roles. It’s roughly akin to having random people take on the job of a surgeon. This is much less about passing certification and much more about how we generate between 50-100,000 new architects in the next ten years. Without this level of commitment and growth in professional skills, we will continue to have these massive systemic failures, and I believe they will get much worse.

A Bit of Context

I have a very unique opportunity to get raw information with zero filtering from architects from organizations around the world, and while I never share details, I do get to hear exactly how many poor decisions live in production in the world’s most important systems. The technology debt. The manually held together connections. Think about the Amazon store… where people were secretly reviewing and approving transactions. Oh it gets much more interesting out there.

Iasa communities and training classes interact with thousands of developers, architects, and managers. The modern technical world, with all its wonders, is somewhat like Venice. Move a couple of logs, and the whole place gets scary and may just fall into the sea.

Believe it or not I am NOT a doomsayer, a conspiracy theorist, or even grumpy about the situation (though the lack of commitment and awareness is problematic). We had to build all these wonderful systems today to get to the kind of modern society we currently enjoy. And while we could have made better decisions, that is true of everything in life.

But let me say this clearly, unless our societies, businesses and governments commit to a few necessary changes, we will see a collapse of the things that depend so deeply on technology.

All the easy payments, the flights, airbnbs, electricity, water, food… the supply chain. This belief comes from a study of modern approaches to technology as well as an awareness that we can avoid the worst of it! If we only work together.

The commitment will take some will, and it will cost some money, but the approach I suggest is neither new or even detrimental. It will not slow us down, or not for long. In fact, it will create net positives for all parties involved and allow us to scale our economies, technology, and humankind’s reach to unbelievable new heights.

150 years ago we ended 2nd Industrial Revolution. It started with similar enthusiasm, investments, growth, and, yes, great human failures. It used child labor (ai labeling?), was horrible for the environment (data centers CO2 and water?), had rickety construction techniques(modern software?), horrible working conditions (death marches?), and massive societal impacts (Instagram depression?) and is the root of much of the worlds terrible approach to wealth.

And yet, without it, we would not be able to support the world’s population, nor have made such advancements in science and medicine. But the reason for that is… we learned to not let wealth, monopolies, power structures and convenience override the needs of workers and society related to engineering and construction. We built new trades and new professions. We educated, learned and shared working techniques. We also learned that building things well results in much longer-term wealth and growth for all. Think of all the engineering excellence that has come from learning how to build cleaner, better, faster, cheaper, but also more repeatably safe manufacturing and industrial management.

The Current Approach Won’t Keep Working

The modern technology stack is a volatile, complex, and controversial set of decisions. It can be deeply productive for a team or company, but also comes with a lot of moving parts. What’s more, this has been true for a long time. And all of those teams have been hard at work for an even longer time. Think about the sprawl of all that code. Some estimate its 2 trillion lines of code. And deployed on billions if not trillions of platforms. And that rate is just increasing. All of these decisions were made by people in companies and governments.

The layers of code, their versions, the connections, the customizations, the products, the patterns, the styles. A big beautiful messy complicated set of decisions all interconnected through bigger and smaller pipes. Imagine the layers of ages of modern above and below Manhattan. And those are relatively simple physical connections. Pipes, electrical, foundations, steel supports, optical fibers. But not the nuance of a thousand versions of protocol, connection type, data types, and above all dependencies on human systems. Phone routing, electrical grid, transportation, internet, commerce, payments, banking, health records. All layered in a kind of messy set of archaic sprawl. As a technologist it is both terrifying and beautiful. Grady Boochone of my mentors and a great thought leader called some of his work technology archeology for a reason.

A Call to Action

I see a number of fundamental steps to reaching our technology dreams. More regulation will not work by itself, as we are already facing huge compliance and regulation loads.

“Best Practice” becomes Working Practice

Do you read LinkedIn and Medium? Is RAG easy or hard? Are Microservices alive, dead or did they ever even exist? Are SPAs the right design mechanism? Is DataMesh a pretty name name for workgroup datamarts? Which patterns apply to what problem? Should architects code? Is agile dead or on its way back? What in the world is self-provisioning? Is platform engineering the newest way to sell DevOps platforms? And most of all WHO DO WE BELIEVE?

I am an architect and I have been lucky enough to talk to and work with the greatest architects on this planet, from all over the world. But I have identified a nerve wracking trend. Our thought leadership do not agree on basic practice and our organizations believe they are all unique. We have a sever case of Snowflake thinking. Lets look at a few possible issues. We need a repository that is free, managed by professionals and independent of vendor marketing, hype and cooersion to maintain our working practices. What is safe, and what is not. [This is why we built the BTABoK](Business Technology Architecture Body of Knowledge | IASA – BTABoK), the only open source body of knowledge for all architects.

Name things and store techniques effectively.

Is that thing an Application, a Service, a Platform, a Technical Capability. We have been working on the BTABoK for 15 years, and I can attest to that difficulty. Not that we can’t do it. But since there is no carrot or stick for agreement, the vendors and the thought leaders continuously redefine terms and techniques. Go try to look up the Statesman pattern in Microsoft, AWS and other so-called Cloud Architecture Frameworks… and you will see what I mean. The vendors pay marketing teams millions to twist words.

Businesses need to allow, and even require, that their architects share working practice with other architects. This means opening up in professional circles about what is working and what isnt.

Get Motivated by the Right Things

I was once told by a leader in well-known company, “We do not care about our customers or how our technology fits their needs. We care about sales.” That is a direct quote from a major executive leader in a major vendor about how they approach sales. We laud every new book on architecture as a major new paradigm, when many of them simply re-write a method from years ago with a new wrapper and a new name. It has taken us 20 years to roll out a project management methodology. The emporor has no clothes indeed.

We have all heard the phrases, “You dont get fired for going with (insert vendor here)” and “Developers have to use the latest technology to be modern”. Or “build fast and break things”. Speed is not value. Vendors are not the solution. Each context requres a deeply capable and competent set of architects and engineers to clarify decisions and delivery strategy.

We must stop our hoarding IP and knowledge

Whether you like the BTABoK or not, it is free, it is open source and it can be changed without a paid individual or corporate membership. Meaning disagree with it! But help us improve it. It has a patterns, structure, a way of describing and growing quality attributes, value realization, business capabilities, engineering, complexity, it is framework agnostic and much more. It is designed to work for software or business architects equally. It has a robust, tested career path. If not the BTABoK then pick something else that works in the same ways. You believe in open-source, so let’s open source our IDEAS and our TECHNIQUES.

It is Time that Management Recognize Technologists

This far in the digital age, board of directors need to be directly responsible for their technology decisions, down to the finest-grained detail. Mergers and aquisitions depend on technology, healthcare, science, sales, stock, fiat, all of them based on technologists. And yet, ‘the business’ is still everyone else. This should be a wake up call. Poor executive understanding lead to negative decisions which have railroaded our dire warnings. You tell us to cut costs. Then you fire us when the things explode, just like we told you they word.

Governments! people! it is time to hold the board responsible for technology failures. Then they will pay attention. This goes beyond a token C-Suite role who hasn’t been involved in a project in 20 years. The next time a company loses my credit card or identity, along with 72 million others, I want it to matter. The next time I have to cancel a client meeting due to a blackout, I want to know the board is taking technology seriously.

It is about quality of decisions not the amount you spend

It is not about just spending more, that isnt really working, you must SPEND BETTER. I and other architects litterally train for decades to both cut costs and make great investment decisions. Technical debt acrual, technical health goals, technical strategy dont just deserve a seat at the table. They are becoming the table.

A little more rationally, in all complex engineering fields, we are required to get signoff from legitimate professionals who have been measured against legitimate and hard-earned competencies. Not only does this create more stable outcomes, it actually saves and makes the economy money. Instead of ‘paying for two ok systems’, we pay for ‘one great one’.

Governments Must Regulate People Not Just Standards

In all complex engineering ecosystems it is not just outputs and companies that are regulated. The role and skills of architects and engineers are not secret and they really aren’t that different by company. I believe I am the worlds expert on architecture skills or at least one of a dozen of them. I have interviewed and assessed hundreds of companies, and thousands of architects. It is time to begin licensing. And it must be handed to a real professional society. It cannot be a vendor consortium. They create great standards. But standards do not create great people.

We are already working with many governments around the world. This is showing up in Singapore, in the Netherlands, in Ireland, and in the US (at the state level). It is time the government step in and help guarantee our safety, by requiring inspections, licenses and other professional level guarantees. Not anyone can perform a surgery, so why can anyone build a surgical support software and hardware system.

Create a Reasonable Typing, Categorization and Adoption Method for New Technology

It takes an average of 14 years for new medical techniques to be rolled out to people. This is for obvious reasons. New technology should not touch Tier 1 Socio-Technical systems. It is great for systems of engagement in non-critical sectors, though I will discuss better, faster, cleaner methods of adoption for ALL companies in future articles. This alone will create better outcomes, but there is a reason doctors are not allowed to market medicines and techniques without signficant controls. We have to stop selling hype, before we get blamed for its failures.

So in all of this I see a fulfillment of our role in society. I hope you will join us on our quest for value, effectiveness, ethics and growth! find out more at iasaglobal.org.