DAVID LEDGERWOOD: Randy, thanks for joining me. Good to have you on.
RANDY LAYMAN: Thanks so much. It’s fun to be here.
LEDGE: Cool. Would you mind giving a background story of yourself and your work, let the audience know where you're coming from?
RANDY: Certainly. I've been in software for way longer than I care to admit, 20-something years, and now I'm at a company called AVOXI. We do telephone systems for around the world. Interesting to deal with foreign governments and their telephony regulations. A lot of the stuff that we do is pretty cool, working on the Google Cloud, Docker, Go, some of the cutting-edge technology. We try and be what I call forward-leaning.
I've done this for a while. I used to work at a company called Vocalocity – now goes under the name Vonage Business. A handful of patents doing some cool engineering stuff.
Now I am the VP of engineering, which means I just do some architecture and I got to deal with sucky stuff like budgets and the not-fun parts of this business too.
LEDGE: You had to take the path of leave the code and go manage. Did you choose that or it just happened? I have this conversation all the time. Like, how much do you get to code? Do you never get to code? Why did you go the management route? Sometimes people just want to stay engineers.
RANDY: Years ago when I was at a Vocalocity, we talked about, one of our board members came and talked to the engineering leadership and said, you know, as a rank-and-file developer, you should probably be spending 90% of your time writing code or individual technical contribution. Then as you get beyond that into a team lead, maybe 75%. Engineering management, maybe you're down to 60/50%. All the way up to the CTO, you still need to be doing this technical stuff. Maybe it's not writing code but it's executing test plans or architecting systems or really being on the technical side.
He was advocating that we still need to be like 10% all the way at the CTO level. That was, I thought, pretty cool. Now, we were a company of like a hundred people. It wasn't like the CTO of Cisco or Google where you got tens of thousands of people you're managing. I thought that was interesting, that you could still go through the management ranks and still be very technical.
For a large portion of my career, I've bounced back and forth between this architect role where it's purely technical, with the people relations of convincing teams as to why they want to build it that way, versus the more being in the management side and really doing the mentoring and the team-building aspect of, what's the right roles we need, what's the right composition we need, what's the right mix of different people we need within the team?
I kept bouncing back and forth for a while, and then probably the last four or five years has been just straight on the management track. Mainly because it's what the companies I've been at have needed, not necessarily that's what I want to do. Like you said, a lot of engineers, you want to get in, you just want to stay technical. I think that's great, that's wonderful, but that's not always what the company needs.
I've ended up in this role where we needed someone to do a little bit of the leadership, a little bit of the team management aspect of it. Go out and work with the recruiting team, go out and work with some of the product team to figure out what it is we need to do.
LEDGE: Right on. Before we started, I love this, you were talking about how you instituted a team book club. That you're doing some learning that I think is really pertinent. I'd love if you'd talk about that.
RANDY: So, one of the things at AVOXI we want to recognize is that everyone is not at the end of their career, and everyone still has this opportunity to grow and get better and improve themselves. It's now called book club.
Another thing, we're in the midst of launching a new product. For a lot of the team members, this is their first Software as a Service product or the first Software as a Service product that doesn't have a large operation seem to support it. We have one DevOps engineer – and he’s awesome but he's only one guy and so we need to provide him a little bit of support.
So we're going through the Google SRE book, one chapter a week, and everybody read the chapter. Then we sit around over lunch and talk about, well, what's that mean to us and how would we do monitoring? Or, how does our postmortem responses not quite line up with what Google recommends, and should we?” because not every problem is a Google-shaped problem.
The team's really responding to that well. In some ways it bonds the team together and says, hey, we're all going through this together, we're learning together, and that's really good. Then some of our junior engineers have this safe environment where they can say, "I don’t what that means. Can you explain to me how what Google calls the board really relates to what we're doing?" and of course that's Kubernetes, which we use a lot of.
Understanding how all the pods in the deployment or the services, everything in Kubernetes fits together. It's an environment where those junior devs wouldn’t have felt comfortable ahead of time really asking those questions. Now, it's a venue for us to really share some of the things that people might be embarrassed to ask about as well.
LEDGE: Yeah, right on. What's everybody learning? That's the interesting and high-value, high-attention space right now around this SRE conversation. The Google ecosystem obviously being, I think central to that, that it's started to gain traction because of their work.
So, what are some of those best practices in learning for people who, A, haven’t read the book but, B, haven't really implemented it in practice?
RANDY: There's a lot of answers to that question. The one thing I think the team has taken the most to so far is the SLI/SLO/SLA concept, and how that relates to an error budget.
Google talks a lot about, as a system staying up you have a certain amount of errors that are acceptable and 100% uptime is incredibly complicated to get to. They make the comment that every additional nine you put on the end of your uptime is usually an order of magnitude more cost. At some point, some business it doesn't make sense in – other than the people who make health devices and nuclear power plants. So, some people, 100% it's great. Phone companies, AT&T established five nines as our target, but for some businesses it's not as high.
LEDGE: Five nines means, if I recall correctly, seven minutes of downtime per year.
RANDY: It's between six and seven, yeah. It's not a lot of downtime that you get every year.
LEDGE: Just put that in your head, because Gmail has not achieved that. Slack definitely hasn't achieved that. Thank God, because we all need to have lunch once in a while, but…
RANDY: The is, how much would you have to pay, how many ads on Gmail would you have to look at for them to achieve five nines? How much would my monthly Slack bill be if they were giving me a five nine service?
That's the intentional tradeoff and that's what we've really taken to heart, is the error budget. For us to be down for six minutes a month or for us to have X hundreds or thousands of calls that fail – which if any of our customers are listening, we don't want any of your calls to fail but the reality is a few of them might.
LEDGE: We also don't want to charge you $1,000 per call.
RANDY: Exactly. So how do we get to those tradeoffs? Keeping track of how many errors you have in a month and then converting that into a, “You know what? We've hit our error budget for the month. We've got to stop making feature changes and really focus on the quality of the product or just not make changes at all.”
LEDGE: Right. It's an actual tangible risk calculation that can be translated into dollars. That's really valuable.
RANDY: Exactly. The other thing that goes along with that then is also, how do you mitigate it?
LEDGE: Well, how do you track it at all? You're talking about an enormous amount of data. How do you even know? I think a lot of people would be like, "Hell, we haven't looked at our logs like that." How do you even process that?
RANDY: In the telephony, we're a little bit fortunate in that phone calls come to us from our upstream carriers. So we have partners and we can get records from them of how many calls they sent us, and we know how many calls we actually processed. The delta is usually in the errors.
Now we do have to go through sometimes and correct the vendor and say, "That call really wasn’t for us."
LEDGE: But you still have to, in some kind of machine way, troll all that data then.
RANDY: Absolutely. In our case, you've got 30 different vendors who are providing you data in at least 30 different formats. Daily, you got to download that data, suck it into a database, and then cross-reference it with our own data to calculate the number that went sideways.
LEDGE: So, 30 different ETL transactions – or more than transactions, they're paradigms I guess.
RANDY: Thirty-one if you include our own data. We got to get our own data in there for.
LEDGE: Sure. When I talk to machine learning/AI folks, they talk about the dirty secret of all that is that you spend the first two years building ETL and data ingestion, and then finally you get to do something interesting with your data.
RANDY: For us, it's fortunate. We're not in the AI with unstructured data. It's much more of the structured format where we're able to say, okay, these columns map to those things. There's no heuristics. It's all straightforward algorithmic-driven.
We're very thankful in that regard that the vendors we have have relatively good formats, clean formats. Mostly it's CSV, or god-forbid XML, files that you got to purse and load into some structure database.
LEDGE: Right. So, get back to the SRE stuff because that was a little bit of a tangent.
RANDY: It’s still on that topic of the error budgets. Then it's a, so how do we mitigate the risk? We have a new release going out, what do we want to do to make it so that it doesn't risk all of our phone calls? That's where you get the concepts of canary releasing and feature flags and segmenting your users into different clusters. None of which we do yet but all of which are, now because of this discussion, something the team's very interested and excited to do.
LEDGE: Absolutely. Feature flagging is a big one. I think some of these things a smaller company can do, smaller teams can do, and some are clearly out of scope for a limited team size.
The feature flagging is a thing that makes a lot of sense for anyone that has multiple customers and is trying to keep their release velocity up there.
RANDY: It's also a great way just to get early feedback from customers. We use our staging environments and give some selected users access to that to get their feedback. But if I could turn features on in production, I have a lot more stability in my production system because my CI/CD pipeline isn’t restarting it five times a day.
They're able to get that great experience and they can actually use the product all day long. That's something very beneficial that we're looking forward to, but haven't quite gotten to executing on yet.
LEDGE: CI/CD, how long down the road are you there? I often hear a lot about CI. Not as many have accomplished CD.
RANDY: We are not CD-to-production. We are a CD indoor staging environment.
So a developer mergers code in. It goes through the unit testing on the master branch. We then deploy that into our Kubernetes staging cluster, depending upon what the change is. That might or might not cause massive changes.
You know, you change the wrong component and, yeah, we’ll rebuild a whole lot of stuff. If you change the right components, just one container restarts. We're doing CI/CD into staging and then we push into production about every week.
So, come Monday afternoon, we take a snapshot at staging. Say, here's all the images that we're running in Kube right now. Let's go put those into production. We get to push those out, right now about once a week. We want to hit the once a day, is our target for 2019.
LEDGE: Wow. Jeez, what's the path there? That's intense. Everybody wants to do that. Well done. Tell the stories. That's crazy.
RANDY: The biggest thing for us right now is more automated testing. We've spent a lot of time, a lot of effort on automated testing. Got to give a shout-out to Katie, one of our great Q/A engineers, who has done a ton of work in this area.
We're at a point now where we feel pretty good but we still have about two hours' worth of manual effort to certify a release. We don't really want to have Katie spend two hours every day, certifying a release, so we need to get that down more like 5 to 10-minute runtime. Once we can get there, then I think we're able to crank that number up to releasing most things every day.
LEDGE: How big is the team? You talk about it like it’s not huge, so I'm curious.
RANDY: Nineteen people including myself. We're split across two offices and a couple of remotes.
LEDGE: How do you handle that collocation, hybrid, remote type of a thing? I hear a lot of folks talk that hybrid is actually the hardest team configuration.
RANDY: I think that hybrid is probably the hardest. I think the easiest is fully collocated. The second easiest is fully distributed. When you get into that hybrid, people are now, "Well, I'm expecting my interactions with one person to be one way but one to be another way." That's hard.
For us, first of all the skillsets that we have remote are different. If you need to go work with telephony people, you're going to be working with people who are remote. If you need to work with the frontend devs or QA or DevOps or that middle API layer, they're going to be in the office with you.
In some ways, I by skillset for us, and that's an artefact of where we can recruit talent – which is very hard to do especially in the telephony space. But then the other thing that we do is we do periodically get everybody together. It's important for that human connection. Even the two offices that we do have, we get everybody from one office to go visit the other one a couple of times a year.
LEDGE: Most of the leaders that I talk to in the hybrid environment talk about, everybody becomes sort of using the toolsets of fully remote.
It just so happens that you've got a lot of people who are remote but in their own desk, and happen to be next to everybody else because you have to do it or you lose that connectivity. Your muscle memory and behavior all go easily down the path of collocated, and you drop off your remote people.
I don't know if you've had that experience, or which tools or techniques work the best.
RANDY: I'll say, we do not have that experience. I've heard a bunch of people say the exact same thin and my thought is we're slightly different because I think it's where that skillset divide. If I've got a UI issue, if I got a screen that's not rendering, I don't have to open that up in the Slack channel, I can turn around and tap the guy on the shoulder.
A couple remote guys are really senior and they're really sharp. What we have in the offices is more of a spectrum – from a couple of people this is their first job, into the senior people. Mentoring is immensely hard remote. It's hard to see that guy struggling, even through the daily standups. So, having that same office, A, really helps you mentor them because you can see them, but it also helps mentor them because they're more comfortable looking and saying, "Oh, I don't see his headphones on. I can go ask him a question."
LEDGE: Right. Okay. It's like, unless you can actively manage accessibility and sort of online-ness and all those things.
I know in our own environments I had to play with different things, because I'm the guy that's like, "Is he on the recording. What's he doing?" I had LED lights on my desk and I coded different statuses like, "No, I'm listening to something. This is really important. Don't call me right now."
RANDY: In our offices we do what's called the headphone rule. If someone's got their headphones on, you send them a Slack message to say, "When you've got a chance, ping me." That's for us.
You’d asked about tools as well. We are big users of Slack. All of our engineers are on Slack and we're pretty much responsive on Slack all the time. We also do Slack Polls. You're in the system and it's like our point is not getting across, let's hop on a Slack Poll real quick because it’ll be faster to talk it out.
Now that Slack has added screen share, we've stopped on the Google Hangouts for screen share even. It's hop in, "Hey, can you see my screen? Here we go. Let's work through this."
LEDGE: Absolutely. What we also found is, internal heuristic was if you can't solve this in five minutes of Slack, call, because it's just not going to work. You'd see these half-hour long Slack missives, paragraph upon paragraph. It's like, "Y'all, just stop. Just pick the thing up."
We seem, humans - all the we, especially engineers - seem to just go to, “I'm just going to type forever.” If you're not careful, the call becomes hard to make.
RANDY: I feel like it's inertia. I'm in this mod, I'm thinking this mode, and I'm not thinking about changing my mode. I think that that's part of it. Then part of it also the interruption. It's asynchronous – I'll type a message and they're going to read it when they get to it, even though I'm in the middle of a discussion and it's obvious that they're right there with me.
LEDGE: What do you recommend or just learnings from and interests in the telephony space? It's changed so much. Now, like you just said, you've got all these apps that are making calls and everything. Yet, you're building cutting-edge SaaS tools in telephony. How does all that fit together?
RANDY: I will say the one thing that I've learned the most doing telephony is, I'm amazed phone calls actually work. I've talked to a guy who did credit card processing the other week and he's like, "You know, at times I'm amazed my card actually goes through just because of how complicated the systems are." Phone system's are the same way. Probably every system we've ever worked on is the same way. It gets so complicated.
Really for the communications part, that's the interesting part, is how they all come together. What we do as business communications is the public face of the business which, worldwide people still tend to phone call although SMS and chat are there – and that’s part of the AVOXI suite. We want to bring all those channels of communication together so that your sales team, your support team, it doesn't matter the channel the customer's on, you're able to be there and service them.
That's the product that we're working towards. Something like a Slack. Something like a Google Hangouts. Those are really towards internal collaboration and not quite the public face. While there are some plugins that do that in Slack and there are some connectors that Google builds for you, that's not their first class citizen. They're not there to target that outside the world.
That's really our first step, is dealing with the outside coming in. Then our secondary thought is then internal collaboration – which we still support but it's not the primary thrust of our product.
LEDGE: Right. You have to think about the, like you said, 31 different providers of all the ways that the customer can be getting to me as the business user from the outside. I suppose it's easy to forget about those things. That we only exist to work on Slack because there are people like you guys that are bringing us customers to talk to.
Well played. Well played.
Well, Randy, it's good to have you on. Thanks for the insights about the engineering team, and it's…
RANDY: Certainly. It's been a lot of fun. If you ever have a gap in your schedule in the future, I'd be happy to hop back in, we can talk some more.
LEDGE: Absolutely. Appreciate it.