Off-Call

It's The Network with Leon Adato (part 2)

Episode Summary

The second part (of three) of my conversation with Leon Adato

Episode Notes

It's here! Part 2 (of 3) of my conversation with Leon Adato!.

We chat about how to relate monitoring to the only 3 things businesses care about, how  the medical field's "See One, Teach One, Do One" can be adapted for onboarding devs, the real life give and take with on-call, and our requirements for paging alerts!

Find Leon around the web:

Episode Transcription

Paige Cruz: Hello! This is Paige Cruz, and you're listening to Off-Call, a podcast brought to you by Chronosphere. And I'm back with the second part of three of my conversation with Leon Adato from Kentik. Now, in part one, we left off talking about an extended email outage Leon was on call for back in the day. 

Today, in part two, we'll kick off discussing how developers can relate their work with monitoring to broader business goals. Balancing life on and off call, and finding your bearings with SLOs. Let's dive in. 

Leon Adato: I think every organization still has that breakdown of a couple of apps, two or three apps, that are religions. They must always be there. They are the pillars upon which the company, at least culturally, is built. And then there's a lot of applications. And the applications might be important, but they're not at that stage of business critical.  

An important thing as an IT professional is to know what those are for two reasons. One, because you want to make sure they're up and running as monitoring engineers. We both care about that. Two, if you want to make sure that your career is, and your projects get approved, always find a way to attach what you're doing to one of those religions. 

Paige Cruz: Yes. 

Leon Adato: For monitoring engineers specifically- and I keep on saying that, monitoring engineer, that's not a thing. Yes, it is totally a thing! So as monitoring engineers, if we can contextualize the monitoring software against the three religions and Paige, this is something we talked about last week at the time of this recording. 

We were both at DevOps Day Seattle and somebody asked about "How do you make something relevant to the business? How do you help?" I've understood for a long time talking with different CFOs and things that businesses care about three things. Can you increase revenue? Can you decrease cost? Can you remove risk? And I think as monitoring engineers, we can speak to at least the second two.  

If you can make monitoring show how we increase revenue, reduce cost, remove risk, right? Yeah. Say it again. We can show how we somehow decrease cost and remove risk to the business that will help justify the work we do and allow them to have, allow them to enable us to do more of it. 

Paige Cruz: Absolutely. I think the best thing you could do as a software engineer is really understand the business and the space that you are in, whether it's reading competitors who are public, their financial statements, how does a competitor make money? And really just, we should not be so distinct and be so far away and we just work on the beeps and the boops and the computers. You do work for a business or an organization that has a mission. How are you connected to that? Answering that and finding your why or why your team exists very important.  

Leon Adato: Bob Lewis was a pundit in a few different technical magazines for years and years, said something probably 25- 30 years ago that has stuck with me. 

He said, "There are no technology projects. There are only business initiatives that have a technology component." If you don't recognize that, you will very quickly tack your way out of a job. You will continue to stress the wrong things to the wrong people at the wrong times. So that's another, witticism or wisdom or whatever that you can take with you. 
 

[00:03:33] On-Call Advice for Newbies 

Paige Cruz: On that theme, do you have any on call advice for newbies? We've talked about how maybe the systems today are more complex, and yet there are so many parallels in the experience of the stress of incidents in this big responsibility of the pager that does affect your personal life. For those who are about to hop on the pager, either they're a senior engineer at a new company, or a new engineer for the first time joining on-call, what should they keep in mind? 

Leon Adato: So the first piece of advice goes for both sides of the training experience. One, for the newbies, try to push for this model. For the people who are responsible for newbies, try to create or set up this model. And I call it "See One, Share One, Do One". It's a model of the "See One, Teach One, Do One" that you find in the medical space. 

See One, Share One, Do One, what I mean by that is, when you're talking about a skill or a process or a workflow, the first thing is that the new person should watch over the shoulder of the other person. Not hands on keyboard, not pair, not anything. Just watch the process start to finish. Ask questions either during or after, but make sure that they see what a good execution of that thing. So when we're talking about on-call, that would be one whole on-call rotation where the new person is, shadowing the other person.  

Now, on-call is funny, right? It's not a constant, it's an interruption. That person has to be nearby, either in Slack or if you're in the old before times, sitting in a cubicle next to them, or whatever it is, but able to be right there when the pager goes off, when the message comes in. So that every time the regular person who's on-call gets a new ticket or a new message, they both jump into a shared space and they are seeing it and able to experience it in real time together. That's the See One.  

Share One is where that is happening, except that now there's a conversation. It's more like pair programming, or it's more concerto for four hands. It is where both of you are working, "oh, I'm going to go do this part, you do that part." It is up to the person who is teaching to direct that action. Do not expect the new person to say, "I'll do this part!" They will need to be given permission and also there's going to be a lot of - you need to go do this...oh, you don't have access. Okay. We'll get you access in a minute. And so then that's the Share One.  

Finally, the do one is a reversal of the see one. It's not, on your own, good luck kid. It is where the experienced person that's, trainer, is there for every single page, every single call, looking over the shoulder of the new person and allowing them to direct the action so that if they have a question, or they're not sure.  

Once you've done that, and sometimes it takes a couple of cycles, sometimes you have to See One twice, and then Share One, and then Do One. Sometimes you have to See One once, but the Share One has to be a couple of rotations, and then the do one they're comfortable with, or whatever it is. But once you've gone through that somebody is really set up for success and nobody has to be dragged back in. There's no recrimination, finger pointing, blamestorming, etc. that goes on with it. That's my first and most important piece of advice.  

Paige Cruz: Love it.  

Leon Adato: The second for the newbies is clarify ahead of time what and who your resources are. Make sure you know before you start your on-call rotation where the knowledge bases are, where the docs are, and that you can access them. Again, make sure that you can get into every single system. It's easy enough to read a doc that says, first go into the Gefrinkel system. And then you're on a call and there's an emergency and it's like, Oh gosh, I can't get into Gefrinkel and the person who gives me access is off on vacation till next week. That's not the time to find out. That is the time you will find out. Know who the SMEs are for all the systems that you're responsible for. So again know what on-call covers and what it doesn't cover. And know, if the system is really down, these are the people who own it, who are in charge of it, so that you can call them out if they need to. 

Before your rotation starts, meaning in the rotation before that one, spend an hour each morning reviewing the tickets or calls or pages or whatever it is. Review them with the person who is on-call. That's not the See One, Share One, Do One. That is separate to say, oh, now I have a sense of history of what has been happening right up to the moment when I'm stepping in. So that when that call comes, it's oh, that's been happening for the last three days because you know that. . That process has served me in good stead to make sure that I'm not completely overwhelmed or outgunned on, in on-call situations. 

Paige Cruz: Wise words of advice that if you do not have a formal handoff process for your team, like hopping on a call and sharing the notes get that started today. Especially if your team is growing or you're adding new people to the rotation that explicit handoff, there's a lot of valuable information to get passed back and forth. 

Leon Adato: Yeah, it's absolutely critical and organizations don't want to do it because obviously it takes time and it pulls people away from doing other stuff and blah, blah, blah. And yet, the question as a newbie to ask when they say,  

"Oh, we don't really need to do this."  

"So we don't really want to be successful at this then?" 

" Oh no! You have to be successful." 

Leon Adato: " This is what I need to be successful." 

It is as important as asking for accommodations with any sort of need. I won't say, it's not even disability. Any need you have, if you need accommodations, you have to ask for them. This is an accommodation, especially as you're starting. 

Paige Cruz: Oh yeah.  

Leon Adato: I need that hour of handoff. I need this in order to be successful. So either you don't want me to be successful, or we're going to find a way to make this work. 
 

[00:09:25] Leon's Off-Call Life 

Paige Cruz: We talked a lot about on-call let's talk about when you turn the pager off. What lights up your life when you're not off call? 

And conversely, when you were on-call, what was these pages and incidents taking you away from?  

Leon Adato: Oh, that's good. Okay my hobby is, long walks on the beach and stuff like No. Obviously, being a person, Orthodox Jew, religion, it's not really a hobby. However a life that has a religious, moral, or ethical structure to it often has its own sort of rhythm or structure that takes up time in the day, and the week, and the season, and the year that also conflicts in many cases with on-call. You have to find a way to make them work together because on call wants 100 percent of your time and it becomes very difficult if you are required whether it's parenting or caring for another person, whether it's a little person or a big person or whatever, or a fur person. If you have care requirements, if you have religious requirements, if you have any of those things, getting a page in the middle of it, is not possible. 

There has to be a give and take to that. I spend a lot of time in my day and week and et cetera with with my faith and that was always a piece that I had to find an accommodation for, or a way to, to manage and navigate. 

Also something that my family and I like to do, we like to play games. We're very into not just Parcheesi and Monopoly. Never ever can play another game of Monopoly, ever, even like the Harry Potter, I don't care, I'm monopolied out! We have Kings of Tokyo and Munchkins and Exploding Kittens and Too Many Bones and lots of other games. Both long running games and short, very quick games, cooperative and competitive games. On Shabbat, anything with an on switch is off limits. In the summer, because Shabbat runs from sun down to sun down, so you have all your daylight, that's really long. The day isn't over until, 9:30 - 10:00! Therefore, you want things to do if it's raining outside or what have you. 

And I actually do archery. I don't really hunt. I just go out in the back and I shoot targets. I'm not competitive about it. It is very zen for me if you follow my Instagram, it's mostly archery pictures because that's just fun and it's also not embarrassing as a 57 year old white dude. There's not a lot of pictures that aren't usually incriminating in some way.  

I like to read sci-fi, fantasy, mostly. 

Finally I'm a polyglot wannabe. I love languages. I like programming languages. I'm not very good at it. They call me a script kitty. It's an insult to scripts and kitties everywhere. But I do like programming languages. Perl. I'm also very comfortable in PHP. I dabble in Python because it begins with a P. So of course I have to do that one too.  

Paige Cruz: You've got the whole collection.  

Leon Adato: Yeah, all the P languages are all in there. 

Also I support a Go open source library. So I support that. So I do a little bit of Go and stuff like that. So that's what they do in the off hours. And that's what on-call is competing with, at least used to compete with anymore, because I haven't done it in years. I think I got out at the right time because I think if I was on-call now, it would be a lot more stress inducing, honestly. 

Paige Cruz: Absolutely. That's what I really wanted to highlight is that just because we're on-call. Yeah there's not an incident happening every second of every day, but to be in a 24/7 rotation for, one week on primary. You do try to live your life and it is really frustrating if you are supporting an unreliable system or you happen to have that large incident. 

We are real people with real lives and hobbies outside of technology and it's nice to be able to indulge in that when we're not officially on the clock.  

Leon Adato: It is not hard to move through the world and see somebody walking out of the movie theater, out of the regular "legitimate" theater, out of the amusement park, out, walking out of spaces, on the phone saying, "yeah, I just have to get to my laptop, give me a minute." 
 

[00:13:32] Tempering On-Call With A Dose of Reality 

Leon Adato: The one that was most heartbreaking for me, I was at Cisco Live, one of the years. Cisco Live, for those people who've never been, is somewhere around 20,000-25,000 people, usually in Vegas. The tickets themselves are north of $1, 500 a piece, not counting travel and hotel and all the rest of it. It is not a small thing to commit to going to Cisco Live for a week to learn and to hear talks and to, build your career and grow. 

I remember multiple years walking in on the first day, this massive, like just stream of humanity coming into the space. And there's an equally large stream of humanity coming out. This is the beginning of the day. And every single one of them you could see are on their cell phone. "So yeah no, I know it's down. I understand. I'm no, I'm jumping on a plane. I'm coming back. I get it." 

You're like, they just got here and now their job is demanding that they come back. That's another piece of the, how do you balance on-call is make sure that on-call is tempered with a dose of reality. If you're going to an event, whether it's a two day DevOpsDays or a week long vendor event or whatever it is, make sure that there isn't some bizarre expectation that you will also be doing on-call. That there are people to cover, even if it's second tier on call. Remember, we're talking mostly about what we consider as front line. 

But there's also escalation. Make sure that if you are an escalation point, there is a mitigation strategy so that you aren't being dragged out of these very expensive events or experiences. The joke is," I'm going in for kidney surgery but I'll have my cell phone on the entire time and I'll be able to fix that system in about 10 minutes once the anesthesia wears off." 

Paige Cruz: I'll be reachable.  

Leon Adato: Yeah, no. No, I won't! And it's okay.  
 

[00:15:27] The Only Experience That Matters Is The Users' 

Leon Adato: Back to what we were saying about the snake in the attic, some people would say, email being down that is absolutely business critical. I think a lot of people would be like, I mmmm I think it can make it 15 minutes. I think that we can last that long.  

That's part of, now we have these SLA's and SLO's and SLI's. Be aware of what those are and live with, make sure they're set reasonably, intelligently, and live within them. Every call can't be an emergency to you or to people around you.  

Paige Cruz: Yes, and if you are confused where to start, I think a lot of developers get bogged down in the SLI piece the measuring. 

The place I always start is I go to the terms and conditions. If you have access to the sales contracts, I actually look at the legal, what we are legally telling and making an agreement with our customers to, and then I work backwards. And you'll find a lot of interesting stuff in those contracts once you start looking. 

I encourage folks to not get bogged down in the weeds of this metric and the rate and the average over time, please just don't use averages, but look at what the customer promise is. The legally binding customer promise. It worked backwards from there, and if that's unfamiliar or strange to go to the terms of service or contracts, good! This is part of learning about the business. I think it's a good growing pain to experience.  

Leon Adato: To emphasize something about that SLI, remember that SLIs, SLOs, SLAs, SLMO- USEs, is, it's almost always based on the user's experience. The reason I say that is that we get a lot of alerts and tickets and messages and interruptions that say that, system number seven is experiencing 15 seconds of latency or whatever it is. 

What's the user's experience? If the user's experience is the same, then yes, that is something that has to be fixed, but it's not an emergency. And I think that a lot of times, we draw too much of a straight line correlation between a element, or even a subsystem having a slow response, and the user's experience. 

Almost all on-call tickets, the things that, again, drag you out of bed at two o'clock in the morning, the things that get you away from what you were doing into this, should be focused on user experience has been impacted in this way, and we need to resolve it. Not a disk is full, or whatever. Those are things that have to be resolved, but they don't need to be a pageable offense. 

Paige Cruz: Yes!!!  

Leon Adato: That will also help you to make sure that the things you're interrupted for and the number of times you're interrupted while you're on-call will be reasonable and manageable.  

Paige Cruz: Everybody should have that as the goal. Reasonable and manageable and actionable pages. 
 

[00:18:14] Wrap Up 

Paige Cruz: That brings us to the end of part two today with lots to mull over. Like tying your monitoring work to either increasing revenue, decreasing costs, or managing risk. aka the three things businesses care about. And some recommendations for very fun board games you could be playing instead of getting paged by that inactionable high CPU alert. 

I truly hope that you enjoyed listening today, and that you're looking forward to Part 3, which drops in the near future. Where I ask Leon, how much has networking really changed in the last decade? And unpack the essential need to know networking concepts for developers to get familiar with. Thank you again for listening to Off-Call, and thank you to our sponsor, Chronosphere, the observability platform that puts you in control. 

Check us out on the web at www.chronosphere.io. And no, I won't be spelling all that. You can find it in the show notes. Cheers!