Off-Call

It's The Network with Leon Adato (part 1)

Episode Summary

Paige talks with Leon Adato about "the network " being an easy target to blame for system issues and how networking is in fact a shared responsibility between Dev and Ops. Leon shares unique experiences from 18 years holding the pager on-call. Part 1 wraps up with a reflection about learning your first monitoring tool (for Leon this was Tivoli) and how these user expectations and these tools have evolved.

Episode Notes

About Leon Adato

In my sordid career, I have been an actor, bug exterminator and wild-animal remover (nothing crazy like pumas or wildebeests. Just skunks, snakes, and raccoons.), electrician, carpenter, stage-combat instructor, ASL interpreter, and Sunday school teacher. Oh, yeah, I’ve also worked with computers.

While my first keyboard was an IBMs electric, and my first digital experience was on an Atari 400, my professional work in tech started in 1989 (when you got Windows 286 for free on twelve 5¼” when you bought Excel 1.0). Since then I’ve worked as a classroom instructor, courseware designer, helpdesk operator, desktop support staff, sysadmin, network engineer, and software distribution technician.

Then, about 25 years ago, I got involved with monitoring. I’ve worked with a wide range of tools: Tivoli, BMC, OpenView, janky perl scripts, Nagios, SolarWinds, DOS batch files, Zabbix, Grafana, New Relic, and other assorted nightmare fuel. I’ve designed solutions for companies that were modest (~10 systems), significant (5,000 systems), and ludicrous (250,000 systems). In that time, I’ve learned a lot about monitoring and observability in all its many and splendid forms.

Find Leon around the web:

 

Episode Transcription

Episode 1 - "It's The Network!" (part 1) - Leon Adato

[00:00:00] Introduction

Paige Cruz: Hi, Paige Cruz here, and I'm so excited to bring you the very first episode of Off-Call featuring Leon Adato. We actually ended up covering so much ground, and there were just too many gems that I couldn't bear to edit out, that I've decided to split up our conversation across three episodes. You're listening to part one, where we delve into the benefits of knowing just enough networking as a developer, Leon's experiences managing 18 years on-call, including his worst and most surprising pages, and wrap up with reflections about the ins and outs of learning your first monitoring tool.

Enjoy!

Welcome to Off-Call, the podcast where we meet the people behind the pagers and learn a little bit about monitoring along the way. Today, I am absolutely delighted to be joined by Leon Adato, a former X Relic like myself, and Wiz around networking, up and down stack, and today We're going to get into talking about,

[00:01:14] It's Never The Network!

Paige Cruz: "It's the network!!" It's a common thing you hear developers lob as reasons for glitches and things going wrong in the system. And today we're going to unpack what is going on with the network, and is it always the network's fault?

Leon Adato: It's never the net- Oh my gosh!

Paige Cruz: Never!

Leon Adato: It's never the network! It's always DNS.

Paige Cruz: Sometimes.

Leon Adato: Okay, it's never the network except when it is.

and I think therein lies, the rub, which is that for a lot of folks, especially DevOps folks who are on the Devy side rather than the Opsy side, is that it, is this black box and they're not sure what's going on. And therefore, I've checked all my stuff, so it must be that. Maybe not. And that's really the crux of our conversation.

Paige Cruz: Absolutely. And we still do have network engineers that many a company. I've had a whole networking team, so it is. You're we'll let them look into it. It's got to be a network thing. That's what they're in charge of, right? But I may be suspecting that it's a bit of a shared responsibility.

Something we should on the dev side care about as well.

Leon Adato: Yeah. I mean everything in tech is a shared responsibility in the sense that yes, there is such a thing as MTTI, right? Mean Time To Innocence. What me? Ha We want to keep that very low, but If you stop there and you really just walk away, you're a jerk.

there's still a problem and you need to, you may not have to take responsibility or ownership for it. But you can be part of the problem solving process. And also, I think anyone who's been in tech for more than 15 minutes knows that a lot of problems are multi element. yes, you're right, these three things weren't the problem, but this other two things that are being masked by the network or the database or DNS or whatever it is, But once you resolve those things, those other two things that are your responsibility now show up clearly, and you still need to be involved.

Paige Cruz: Absolutely. I am a big fan of, yeah, sharing operational responsibilities, including investigations. And that is a little bit of why I wanted to bring this podcast out into the world. I ask myself a lot, does the world need another podcast? My answer to that was no, for a very long time, until I reflected back on the privilege that it is that you and I have both worked for monitoring and observability companies.

And that gives us a pretty unique perspective in seeing behind the scenes, the backstage of how you render a metric chart over five years or three years. And what does that mean for every individual data point? There's a lot of stuff that, when you're responsible for building and operating a monitoring system, that you learn about how to do things better, how to interpret your data.

And that was a bit of why I wanted to bring you on. we can give developers the tools to say, "I confidently know it's not the network, and here is the data that I used to come to that conclusion." I think your ops and network team would be a lot happier if you came armed with facts and having done a mini investigation.

Leon Adato: Absolutely. And, turn it around, a lot of us have had to work helpdesk and we've all gotten that call, like the internet is down, or whatever it is, and it's clearly not. but what if a user had called us and said, "hey, server number 7 seems to be having, 15 seconds, 15 full seconds of latency on it. And I think it's related to interface 5. And also, I think that one of the function API calls is running a little slow." if they had sent that. would you say, "why are you hacking my systems?" No, you would say, "would you like a spot on my team?"

Paige Cruz: Yeah.

Leon Adato: We should be, that for each other.

We should, we on the network team should be that for the database folks. We on the application development team should be that for the, on the network team or whatever, we, should all have an interest or a curiosity in how the piece that we're responsible for fits into the larger. ecosystem.

[00:05:28] Leon's Best Worst Pages

Paige Cruz: Absolutely. So on this note, we're going to talk a little bit about your personal time being on the pager on-call, and then we are going to flip back into a little bit unpeeling the layer of what data do we have about the network and what does it actually mean? Because I do think we suffer from an abundance of data and data is not necessarily information.

Data is just data. You need a bit of analysis and a bit of context around that. And so how do we make sense of this piles of network, whatever bits per second going across links? what does that all actually mean for us and our end users? So thinking back to your time on-call and just to confirm, you were not on-call as of this recording, you have left the life of pagers behind.

Leon Adato: It feels so good.

Paige Cruz: Yeah, it is. The water is warm over here. But thinking back to the times that you were on-call, I have to say, it's not always the most convenient responsibility to have, so do you have memories of a standout time that you've been paged and it's just been inconvenient or silly or strange?

Leon Adato: Yeah, okay, so I will mention that there is a top slot of, worst pager call ever that I will not be sharing today, and I will allow your audience to let their imaginations roam wild and free as to what could possibly have happened to Leon. But he wouldn't even be willing to talk to me about it, because I have actually no filter, and I will talk about almost everything.

Second place-

Paige Cruz: understood,

Leon Adato: yeah, second place goes to a boss who really loved testing whether I meant it when I said I was an Orthodox Jew, and that I really wasn't online, on Shabbat, and so would find increasingly obnoxious reasons to page me out at those hours to see if finally, They would say the thing that would cause me to, come online so they could say, "Ha! I knew you were just slacking off! "And, and nope.

Paige Cruz: "Gotcha." It's all been a front. What a, strange front to have. no,

Leon Adato: No. I would routinely turn the pager on after, and, by the way, someone else was covering on-call. It wasn't like on-call was going on- I had arranged for somebody.

Paige Cruz: You had a plan!

Leon Adato: And so the rest of the team said, "Yeah, he said, blah, blah, blah, blah, blah, and oh, look, the data center's on fire." Literally, that was one of the messages. That was obnoxious. Please don't be like that.

Paige Cruz: Absolutely, yeah, this is what not to do, managers.

Leon Adato: Really, but the one that, really comes to mind when you're talking about festive was the 2 am call.. I got this more than once. 2 am because there were snakes in an attic.

Now, for context, I was not working in tech. I was actually working for a pest control company that specialized in wild animal removal. So this was a normal call, but 2 am, bleary eyed and bushy faced about 20 years old, I was crawling up into somebody's attic into the insulation to go look for a snake that was also crawling through the insulation.

The good part about it is I live in Cleveland, Ohio. We do not really have a lot or any poisonous snakes that are indigenous to the, so I wasn't worried about, necessarily coming through their ceiling was an option, it was really just a matter of finding and, then trapping the snake who really probably didn't want to be there anymore than I did.

Paige Cruz: Totally, but I Something related to on-call is the thing I'm being paged for an actual emergency? And I have to suspect that snake had been having the time of their life up in the attic, and would it have- what was the rush? Why could this not have waited till 8am, till 10am?

Leon Adato: I struggle with that one because if someone said, there's a snake in the attic to me, I would be hard pressed to feel like, oh that can wait.

Paige Cruz: Okay.

Leon Adato: Who among us hasn't watched Snakes on a Plane? Who among us hasn't worried about, it's I can understand both sides.

Paige Cruz: If you know their location, you want to get them out before it has time to hide.

Leon Adato: Before it has time to go somewhere else, like the bathroom or whatever. yeah, I can, I sympathize.

Paige Cruz: Okay- the 3am page, it's almost like a joke at this point that, okay, the pager wakes you up and it's 3am. Yeah, sometimes it really does happen. It is... I've never been paged for a snake. That one is, I have a feeling you might take the top prize for that one. For best worst page.

Leon Adato: I hope I do. I hope that is honestly the worst thing anyone says, because there, there's, obviously there are worse, and I hope that this is the absolute worst one, you and all of your guests ever have had to deal with.

Paige Cruz: I'll reach out at the end of the year with the pages, the awards that I just made up.

[00:10:33] What Tivoli Taught Leon

Paige Cruz: Thinking back, I, have a hypothesis that a lot of us build our understanding of monitoring and observability and system health really through the lens of the first tool we were introduced to. Because everything after that, you compare it. How is it like New Relic? How is it like Prometheus? Do you have fond memories of the first monitoring tool you used? What was it?

Leon Adato: That's such a loaded question. Do I have fond memories of it? Ignoring ping and traceroute, which I think are a lot of people's first experience with the concept of monitoring. How do I know something is there or up or not? But not really in, in terms of a robust monitoring system.

The first one I ever used, and we'll be honest, okay, the gray on my hair is earned, I'm 57. I've been around, in tech for over 35 years now. I started in IT when Windows came for free on twelve 5 1/4" floppies.

Paige Cruz: The floppy era!!

Leon Adato: Yeah. and I'm like five and a quarter inch, those, not the little, so anyway.

Not the compact. Yes. the hard disk floppy disks. Anyway, my first monitor tool was Tivoli. Which, when, when I started using it had already been acquired by IBM. which made things a little bit interesting because what people don't recognize is that Tivoli is actually a drinking company with a small software problem.

When you go to Planet Tivoli, which was their convention at the time, the bar opened at 10. It was-

Paige Cruz: Oh my!

Leon Adato: There's a whole other story going on with that, but So Tivoli is really, basically, 15 Perl scripts in a trench coat. that is what you buy when you get, or got Tivoli was a bunch of Perl scripts and an agent, like a super agent that could do anything.

And that was it. And so if you wanted to do something else, you had to code it. You either had to take the existing scripts and modify them or whatever. Do I have fond memories of that? I like Perl, and so I really enjoyed learning how to become a better Perl developer. many people who are developers today are probably feeling waves of nausea coming over them as I actually say the word Perl and developer in the same sentence.

I apologize for that, but I had, a good time. Also, The interesting thing about that first tool you use is that you have nothing to compare it to, and so it seems perfectly normal. The most outrageously impressive workflows seem totally fine.

Paige Cruz: Standard. Business as usual.

Leon Adato: Yeah, an example I use is that the first editor I ever used was Vi.

I was 16. It was used as the default editor on a bulletin board system I was on. And so it was that, or X, or a franken-app called Fred. And so I picked Vi and :wq!, Of course, that's how you save and get out. That makes perfect, right? Quit and out.

Paige Cruz: So intuitive. Absolutely intuitive.

Leon Adato: And we really have no basis for comparison. And Tivoli's okay, this is how monitoring and software distribution and inventory like this is how it works. Okay. It wasn't until I can look, back on it from other tools and say, that was an absolute horror show. That was. That was terrible. but at the time, two things at the time, it really was top of its class.

Paige Cruz: I think that's what we forget. when we look back on the past, you have to take into the context of what else you had going on and where tech was. It has not always been cloud, containers, microservice, sprawl everywhere. We really built all this up from what sounded like a, faster, simpler times.

Leon Adato: Yeah, it was a simplistic time, if nothing else. it was top of class. It was, the best you could buy and it cost, no exaggeration, a million dollars. And,

Paige Cruz: So monitoring being expensive is not new?

Leon Adato: Oh, no. Oh, no. Now, part of that was IBM. Yeah. Part of that was IBM, just a million dollars for things.

And you could only run it on, on a AIX servers of course, because of course you could.

Paige Cruz: What is, may I ask what a AIX is?

Leon Adato: It was IBM's custom version of Unix .

Paige Cruz: That's so wrong. Okay.

Leon Adato: That ran, exclusively on IBM hardware. That cost, I mean the million dollars was for stuff anyway.

Paige Cruz: Wow. So it's a whole ecosystem vendor lockin not, this is what I'm hearing, not new.

Leon Adato: Yeah, no, In fact, we've gotten so far away from it now in, in the course over time, that what we have now is minimal compared to, if you were a Sun, Spark works, running on Sun systems, that's what you are running on, and they could charge anything they wanted to, the days of CA and Unicenter and stuff, and I know this is off track, but yeah, lock in today is nothing like lock in, yesterday.

But going back to the original question, Tivoli, Which I was responsible for maintaining, having never used it before. The agreement was, "Leon, we know you don't know this, so we're going to let you learn on our time and our dime, and you are going to, do your best, and you're going to fix it, and, you're not going to quit, rage quit in the middle." And so it was a nice little partnership, like I got to, we will forgive the mistakes you make, because you're going to keep on trying. Because the company has nobody else.

Paige Cruz: They gave you that space to learn. What I love about that's different from what I think a lot of situation devs find themselves in today is "yeah, you can learn on the job", and then, but we're gonna fill your time and your sprints so jam packed that "just fit that learning in wherever you can. It's fine." and you're lucky to even get a learning budget these days.

So I would encourage companies and managers to listen to this and to take some notes, like giving people space and time to learn and importantly say, yes, mistakes will happen and we expect that and that's okay because in the end we will all have a better solution. That is, that's the approach to go because we really just throw people into monitoring. We hand them the pager after three months, we say, "you've got to be onboarded, surely you know the ins and outs of incident response and our data. Good luck, have fun."

Leon Adato: And okay, so Cautionary Tale, the thing I just explained, that they gave me space, was hard won.

They had a year before that, brought in consultants to install Tivoli, again, million dollars, plus another half million dollars for the consultancy itself. They had it all installed, set up, and then they, the leadership bought into the lie that this is self maintaining. You set it and forget it, it's just gonna manage itself. And you just need teams to do the operations. So there was a software distribution team, there was a distributed monitoring team, there was an event correlation team, separate teams. Remember, using the same framework. And one team would go in and change the framework level settings, which would completely bork two other teams not knowing it. And so you ended up three months into it with these internecine team wars of somebody changing the framework and not telling the other person because they needed to get this done and hoping the other team didn't blah blah blah blah blah. The place, the entire system was collapsing under the weight of politics and lack of awareness to the point where they either had to write off a million and a half dollars and go find something else and there wasn't anything else. Or they could put myself and it was two other people and say "look you're just going to be in charge of the care and feeding of the Tivoli system because it's obviously not set it and forget it and we don't want to bring in another half million dollar consultants in to fix it so good luck" so It was really nice that there was the space to learn, but they had already been to the bad place.

Paige Cruz: They felt the pain.

Leon Adato: Yeah, and so managers, hopefully you don't have to piss on the electric fence first to know that it's a bad idea. there are already these cautionary tales out there. Give people space to learn, give people space to try and make mistakes. Maybe invest in a demo, environment, or a QA environment, or whatever it is, like I know it seems like it's a little bit extra for licensing, but it's actually not in the long run, etc.

It's, really the right way to go.

Paige Cruz: A playground. Absolutely. Oh man, I almost love the cyclical nature of the trends, or just, when I hear stories of people starting out their career and I'm like, "Oh my gosh, I have been there. I have been the person that had to get put on the team because the monitoring system was falling over."

And in a strange way, the suffering kind of bonds, I feel like I'm joining a long line of operators, who have been there and done that.

Leon Adato: A long and August tradition of failure and pain. Welcome to the, family.

Paige Cruz: So we're not alone. Things I assume are getting better.

[00:19:49] The On-Call Experience circa 2014

Paige Cruz: I personally had to retire from SRE. I was very burnt out from a few back to back stints at startups, which startups are a great place to get your hands on a lot of technology, often pretty cutting edge, but the broad level of responsibilities while maintaining on-call for a small team just really got to me.

And so I was on-call for six years. How many years were you on-call and did you manage that? Because I had to tap out. I'm very honest about that. It's not for me anymore.

Leon Adato: Yeah, 18 years altogether, but I'm including the two years of pest control. 16 years of technical on-call. rotations as part of desktop support teams, sysadmin teams, monitoring engineering teams.

And, my last on-call experience, my on-call rotation ended in 2014 when I pivoted to start working for vendors and doing, whether we call it developer relations advocacy or technical evangelism or like whatever spokes model for-

Paige Cruz: oh, I like that one.

Leon Adato: Yes, really. Yeah, there we go. It was 16 years, and part of the reason why we didn't feel burnout was, again, it was a simpler time.

Again, 2014, where DevOps was just starting, and cloud really, Amazon had, I think, S3 containers in 2014, but they hadn't become the thing that they are today. So most of the work was on premises. Most of the work, we were still very much in the Pets Not Cattle concept of and I'm not making this up we had servers named Hugin and Munin, the two crows from, and because they all were subsystems to Odin, the, main server. We had routers that were named after the dwarves in Cinderella. We had, like we knew that Odie and Garfield, Garfield were the two VAC systems and that Odie was down, but Garfield would take over or whatever. We had named our pets and we knew them on a relatively personal and not intuitive, but deep level. So when there was a failure, it was a known failure. It was simpler in the sense they weren't microservices. There were no API calls. So when something failed, the troubleshooting process was fairly well known and well documented because it followed a standard routine.

You either work from the back of the box, out to the wall or from the wall into the back of the box, those kinds of things. So on-call rotations were much more predictable in what might happen.

Paige Cruz: Yeah.

Leon Adato: And that probably kept a lot of the burnout from happening because just the pace of what could possibly be wrong, was very, different. That didn't make it any less stressful when I was in hour 46 of the 48 hour staying awake because email was down. Oh. there were still kinds of things. Yeah. For a very long time at every company there were a lot of applications and three religions. One religion was email and then there were two other ones that depended like it might be order entry, it might be customer relation, it might be whatever. But there was three religions that could never ever be down and then there was a bunch of other applications that you had to keep running also, and email was down for the company for 48 hours while we tried to, bring it back from the dead.

Paige Cruz: Oh, that's, okay. Okay, simpler times, but incidents have always been stressful. Technology finds ways, and these systems find ways to throw curveballs, and it sounds like there was maybe an implicit agreement or expectation that email was 100 percent available, would you say? It sounded like there was no tolerance, from the users for downtime.

Leon Adato: Yeah, there was, absolutely, again, I won't say no internet, but internet was email for a little bit of web browsing- online order entry wasn't really a thing at some parts of that, again, that 16 years, which ended in 2014. For a long part of the internet, it wasn't a thing at all, and then it was just a cute little novelty.

You had one internet gateway, and people would walk out to it to go check AOL or something. Oh my gosh. Yeah, it was much less critical than it, obviously, it wasn't ubiquitous the way it is today. Yeah. But even yeah, email was the means by which all corporate communication occurred.

[00:24:27] Wrap Up

Paige Cruz: Well that wraps up our first episode.

From getting paged about snakes in an attic, to organizations learning that monitoring stacks don't just manage themselves, and so much more, I really, hope that you enjoyed it. Stay tuned for part 2, coming soon, where we delve into how monitoring relates to the 3 things businesses and your CFO cares about, ways Leon unwinds off call, and getting your bearings with SLOs.

Thank you so very much for listening to Off-Call. And a big thank you to our sponsor Chronosphere, the one and only observability platform that gives you complete control so that you can focus on the data that matters, remediate faster, and optimize cost. Check us out on the web at chronosphere. io and check the show notes for the spelling on that.

Cheers!