Sean Moriarity, creator of the Axon deep learning framework, co-creator of the Nx library, and author of Machine Learning in Elixir and Genetic Algorithms in Elixir, published by the Pragmatic Bookshelf, speaks with SE Radio host Gavin Henry about what deep learning (neural networks) means today. Using a practical example with deep learning for fraud detection, they explore what Axon is and why it was created. Moriarity describes why the Beam is ideal for machine learning, and why he dislikes the term “neural network.” They discuss the need for deep learning, its history, how it offers a good fit for many of today’s complex problems, where it shines and when not to use it. Moriarity goes into depth on a range of topics, including how to get datasets in shape, supervised and unsupervised learning, feed-forward neural networks, Nx.serving, decision trees, gradient descent, linear regression, logistic regression, support vector machines, and random forests. The episode considers what a model looks like, what training is, labeling, classification, regression tasks, hardware resources needed, EXGBoost, Jax, PyIgnite, and Explorer. Finally, they look at what’s involved in the ongoing lifecycle or operational side of Axon once a workflow is put into production, so you can safely back it all up and feed in new data.
This transcript was automatically generated. To suggest improvements in the text, please contact content@computer.org and include the episode number and URL.
Gavin Henry 00:00:18 Welcome to Software Engineering Radio. I’m your host Gavin Henry. And today my guest is Sean Moriarty. Sean is the author of Machine Learning and Elixir and Genetic Algorithms and Elixir, both published by the pragmatic Bookshelf co-creator of the NX Library and creator of the Axon Deep Learning Framework. Sean’s interests include mathematics, machine learning, and artificial intelligence. Sean, welcome to Software Engineering Radio. Is there anything I missed that you’d like to add?
Sean Moriarty 00:00:46 No, I think that’s great. Thanks for having me.
Gavin Henry 00:00:48 Excellent. We’re going to have a chat about what deep learning means today, what Axon is and why it was created, and finally go through an anomaly fraud detection example using Axon. So deep learning. Sean, what is it today?
Sean Moriarty 00:01:03 Yeah, deep learning I would say is best described as a way to learn hierarchical representations of inputs. So it’s essentially a composition of functions with learned parameters. And that’s really a fancy way to say it’s a bunch of linear algebra chain together. And the idea is that you can take an input and then transform that input into structured representations. So for example, if you give an image of a dog, a deep learning model can learn to extract, say edges from that dog in one layer and then extract colors from that dog in another layer and then it learns to take those structured representations and use them to classify the image as say a cat or a dog or an apple or an orange. So it’s really just a fancy way to say linear algebra.
Gavin Henry 00:01:54 And what does Elixir bring to this problem space?
Sean Moriarty 00:01:57 Yeah, so Elixir as a language offers a lot in my opinion. So the thing that really drew me in is that Elixir I think is a very beautiful language. It’s a way to write really idiomatic functional programs. And when you’re dealing with complex mathematics, I think it simplifies a lot of things. Math is really well expressed functionally in my opinion. Another thing that it offers is it is built on top of the Erlang VM, which has, I would say 30 years of deployment success. It’s really a super powerful tool for building scalable fault tolerant applications. We have some advantages over say like Python, especially when dealing with problems that require concurrency and other things. So really Elixir as a language offers a lot to the machine learning space.
Gavin Henry 00:02:42 We’ll dig into the next section, the history of Axon and why you created it, but why do we need deep learning versus traditional machine learning?
Sean Moriarty 00:02:51 Yeah, I think that’s a good question. I think to start, it’s better to answer the question why we need machine learning in general. So back in, I would say like the fifties when artificial intelligence was a very new nascent field, there was this big conference of like academics, Marvin Minsky, Alan Turing, some of the more famous academics you can think of attended where they all wanted to decide essentially how we can make machines that think. And the prevailing thought at that time was that we could use formal logic to encode a set of rules into machines on how to reason, how to think about, you know, how to speak English, how to take images and classify what they are. And the idea was really that you could do this all with formal logic and this kind of subset grew into what is now called expert systems.
Sean Moriarty 00:03:40 And that was kind of the prevailing wisdom for quite a long time. I think there honestly are still probably active projects where they’re trying to use formal logic to encode very complex things into machines. And if you think of languages like prologue, that’s kind of something that came out of this field. Now anyone who speaks English as a second language can tell you why this is maybe a very challenging problem because English is one of those languages that has a ton of exceptions. And anytime you try to encode something formally and you run into these edge cases, I would say it’s very difficult to do so. So for example, if you think of an image of an orange or an image of an apple, it’s difficult for you to describe in an if else statement style. What makes that image an apple or what makes that image an orange?
Sean Moriarty 00:04:27 And so we need to encode things. I would say probabilistically because there are edge cases, simple rules are better than rigorous or complex rules. So for example, it’s much simpler for me to say, hey, there’s an 80% chance that this picture is an orange or there’s an 80% chance like so let’s say there’s a very popular example in Ian Goodfellow’s book Deep Learning. He says, if you try to come up with a rule for what birds fly, your rule would start as all birds fly except penguins, except young birds. And then the rule goes on and on when it’s actually much simpler to say all birds fly or 80% of birds fly. I mean you can think of that as a way to probabilistically encode that rule there. So that’s why we need machine learning.
Gavin Henry 00:05:14 And if machine learning in general’s not suitable for what we’re trying to do, that’s when deep learning comes in.
Sean Moriarty 00:05:20 That’s correct. So deep learning comes in when you’re dealing with what’s essentially called the curse of dimensionality. So when you’re dealing with inputs that have a lot of dimensions or higher dimensional spaces, deep learning is really good at breaking down these high dimensional spaces, these very complex problems into structured representations that it can then use to create these probabilistic or uncertain rules. Deep learning really thrives in areas where feature engineering is really difficult. So a great example is when dealing with images or computer vision especially is one of the classical examples of deep learning, shining well before any traditional machine learning methods were overtaking traditional machine learning methods early on in that space. And then large language models are just another one where, you know, there’s a ton of examples of natural language processing being very difficult for someone to do feature engineering on. And deep learning kind of blowing it away because you don’t really need to do any feature in your engineering at all because you can take this higher dimensional complex problem and break it down into structured representations that can then be used to classify inputs and outputs essentially.
Gavin Henry 00:06:27 So just to give a brief example of the oranges and apples thing before we move on to the next section, how would you break down a picture of an orange into what you’ve already mentioned, layers? So ultimately you can run it through algorithms or a model. I think they’re the same thing, aren’t they? And then spit out a thing that says this is 80% an orange.
Sean Moriarty 00:06:49 Yeah. So if you were to take that problem like a picture of an orange and, and apply it in the traditional machine learning sense, right? So let’s say I have a picture of an orange and I have pictures of apples and I want to differentiate between the two of them. So in a traditional machine learning problem, what I would do is I would try to come up with features that describe the orange. So I might pull together pixels and break down that image and say if 90% of the pixels are orange, then this value over here is a one. And I would try to do some complex feature engineering like that.
Gavin Henry 00:07:21 Oh, the color orange, you mean.
Sean Moriarty 00:07:22 The color orange. Yeah, that’s right. Or if this distribution of pixels is red, then it’s an apple and I would pass it into something like a support vector machine or a linear regression model that can’t necessarily deal with higher dimensional inputs. And then I would try my best to classify that as an apple or an orange with something like deep learning, I can pass that into a neural network, which like I said is just a composition of functions and my composition of functions would then transform those pixels, that high dimensional representation into a learned representation. So the idea that neural networks learn like specific features, let’s say that one layer learns edges, one layer learns colors is correct and incorrect at the same time. It’s kind of like at times neural networks can be a black box. We don’t necessarily know what they’re learning, but we do know that they learn useful representations. So then I would pass that into a neural network and my neural network would essentially transform those pixels into something that it could then use to classify that image.
Gavin Henry 00:08:24 So a layer in this parlance would be an equation or a function, an Elixir.
Sean Moriarty 00:08:30 That’s right. Yeah. So we map layers directly to Elixir functions. So in like the PyTorch and in the Python world, that’s really like a PyTorch module. But in Elixir we map layers directly to functions
Gavin Henry 00:08:43 And to get the first inputs to the function, that would be where you’re deciding what part of an image you could use to differentiate things like the curve of the orange or the color or that type of thing.
Sean Moriarty 00:08:57 Yep. So I would take a numerical representation of the image and then I would pass that into my deep learning model. But one of the strengths is that I don’t necessarily need to make a ton of choices about what images or what inputs I pass into my deep learning model because it does a really good job of essentially doing that discrimination and that pre feature engineering work for me.
Gavin Henry 00:09:17 Okay. Before we get deeper into this, because I’ve got a million questions, what shouldn’t deep learning be used for? Because people tend to just grab it for everything at the moment, don’t they?
Sean Moriarty 00:09:27 Yeah, I think it’s a good question. It’s also a difficult question, I think.
Gavin Henry 00:09:32 Or if you take your consultancy hat off and just say right.
Sean Moriarty 00:09:35 . Yeah. Yeah. So I think the problems that deep learning shouldn’t be used for obviously are just like simple problems you can solve with code. I think people have a tendency to reach for machine learning when simple rules will do much better. Simple heuristics might do much better. So for example, if I wanted to classify tweets as positive or negative, maybe a simple rule is to just look at emojis and if it has a happy face then you know it’s a happy tweet. And if it has a frowny face, it’s a negative tweet. Like there’s a lot of examples in the wild of just people being able to come up with clever rules that do much better than deep learning in some spaces. I think another example is the fraud detection problem, maybe I just look for links with redirects if someone is sending like phishing texts or phishing emails, I’ll just look for links with redirects in email or a text and then say hey that’s spam. Regardless of if the link or if the actual content is spammy, just use that as my heuristic. That’s just an example of something where I can solve a problem with a simple solution rather than deep learning. Deep learning comes into the equation when you need, I would say a higher level of accuracy or higher level of precision on some of these problems.
Gavin Henry 00:10:49 Excellent. So I’m gonna move us on to talk about Axon which you co-created or created.
Sean Moriarty 00:10:55 That’s correct, yes.
Gavin Henry 00:10:56 So what is Axon, if you could just go through that again.
Sean Moriarty 00:10:59 Yeah, Axon is a deep learning framework written in Elixir. So we have a bunch of different problems in the Elixir machine learning ecosystem. The base of all of our projects is the NX project, which a lot of people, if you’re coming from the Python ecosystem can think of as NumPy. NX is implemented like a behavior for interacting with tensors, which are multidimensional arrays in the machine learning terminology. And then Axon is built on top of NX operations and it kind of takes away a lot of the boilerplate of working with deep learning models. So it offers ways for you to create neural networks to create deep learning models and then to also train them to work with things like mixed precision work with pre-trained models, et cetera. So it takes away a lot of the boilerplate that you would need now for people getting introduced to the ecosystem. You don’t necessarily need Axon to do any deep learning, like you could write it all on an X if you wanted to, but Axon makes it easier for people to get started.
Gavin Henry 00:11:57 Why was it created? There’s a lot of other open source tools out there, isn’t there?
Sean Moriarty 00:12:01 Yeah, so the project started really, I would say it was back in 2020. I was finishing college and I got really interested in machine learning frameworks and reverse engineering things and I at the time had written this book called Genetic Algorithms and Elixir and Brian Cardarella, the CEO of Dockyard, which is an Elixir consultancy that does a lot of open source work, reached out to me and said, hey, would you be interested in working with José Valim on machine learning tools for the Elixir ecosystem? Because his assumption was that if I knew about genetic algorithms, those sound a lot like machine learning related and it’s not necessarily the case. Genetic algorithms are really just a way to solve intractable optimization problems with pseudo evolutionary approaches. And he just assumed that, you know, maybe I would be interested in doing that. And at the time I absolutely was because I had just graduated college and I was looking for something to do, looking for something to work on and somewhere to prove myself I would say.
Sean Moriarty 00:12:57 And what better opportunity than to work with José Valim who had created Elixir and really built this ecosystem from the ground up. And so we started working on the NX project and the project initially started with us working on a project called EXLA, which is Elixir Bindings for a linear algebra compiler called XLA from Google, which is built into TensorFlow and that’s what JAX is built on top of. And we got pretty far along in that project and then kind of needed something to prove that NX would be useful. So we thought, you know, at the time deep learning was easily the most popular and honestly probably less popular than it is now, which is crazy to say because it was still crazy popular then It was just pre Chat GPT and pre some of these foundation models that are out and we really needed something to prove that the projects would work. So we decided to build Axon and Axon was really like the first exercise of what we were building in NX.
Gavin Henry 00:13:54 I just did a show with José Valim on Lifebook Elixir and the complete machine learning ecosystem. So we do explore just for the listeners there, what NX is and all the different parts like Bumblebee and Axon and Scholar as well. So I’ll refer people to that because we’re just gonna focus on the deep learning part here. There are a few versions of Axon as I understand, based on influences from other languages. Why did it evolve?
Sean Moriarty 00:14:22 Yeah, so it evolved for I would say two reasons. As I was writing the library, I quickly realized that some things were very difficult to express in the way you would express them in TensorFlow and PyTorch, which were two of the frameworks I knew going into it. And the reason is that with Elixir everything is immutable and so dealing with immutability is challenging, especially when you’re trying to translate things from the Python ecosystem. So I ended up reading a lot about other attempts at implementing functional deep learning frameworks. One that comes to mind is think.ai, which is I think by the people that created SpaCy, which is a natural language processing framework in Python. And I also looked at other inspirations from like Haskell and other ecosystems. The other reason that Axon kind of evolved in the way it did is just because I enjoy tinkering with different APIs and coming up with unique ways to do things. But really a lot of the inspiration is the core of the framework is really very, very similar to something like CARIS and something like PyTorch Ignite is a training framework in PyTorch and that’s because I want the framework to feel familiar to people coming from the Python ecosystem. So if you are familiar with how to do things in CARIS, then picking up Axon should just be very natural because it’s very, very similar minus a few catches with immutability and functional programming.
Gavin Henry 00:15:49 Yeah, it’s really difficult creating anything to get the interfaces and the APIs and the function names. Correct. So if you can borrow that from another language and save some brain space, that’s a good way to go, isn’t it?
Sean Moriarty 00:16:00 Exactly. Yeah. So I figured if we could reduce the cognitive load or the time it takes for someone to transition from other ecosystems, then we could do really, really well. And Elixir as a language being a functional programming language is already unfamiliar for people coming from beautiful languages and imperative programming languages like Python. So doing anything we could to make the transition easier I think was very important from the start.
Gavin Henry 00:16:24 What does Axon use from the Elixir machine learning ecosystem? I did just mention that show 5 88 will have more, but just if we can refresh.
Sean Moriarty 00:16:34 Yeah, so Axon is built on top of NX. We also have a library called Polaris, which is a library of optimizers inspired by the OPT X project in the Python ecosystem. And those are the only two projects really that it relies on. We try to have a minimal dependency approach where you know we’re not bringing in a ton of libraries, only the foundational things that you need. And then you can optionally bring in a library called EXLA, which is for GPU acceleration if you want to use it. And most people are going to want to do that because otherwise you’re gonna be using the pure Elixir implementation of a lot of the NX functions and it’s going to be very slow.
Gavin Henry 00:17:12 So that would be like when a language has a C library to speed things up potentially.
Sean Moriarty 00:17:17 Exactly, yeah. So we have a bunch of these compilers and backends that I’m sure you get into in that episode and that kind of accelerates things for us.
Gavin Henry 00:17:26 Excellent. You mentioned optimizing deep learning models. We did an episode with William Falcon, episode 549 on that which I’ll refer our listeners to. Is that optimizing the learning or the inputs or how do you define that?
Sean Moriarty 00:17:40 Yeah, he is the PyTorch lightning guy, right?
Gavin Henry 00:17:43 That’s right.
Sean Moriarty 00:17:43 Pretty familiar because I spent a lot of time looking at PyTorch Lightning as well when designing Axon. So when I refer to optimization here I’m talking about gradient based optimization or stochastic gradient descent. So these are implementations of deep learning optimizers like the atom optimizer and you know traditional SGD and then RMS prop and some other ones out there not necessarily on like optimizing in terms of memory optimization and then like performance optimization.
Gavin Henry 00:18:10 Now I’ve just finished pretty much most of your book that’s available to read at the moment. And if I can remember correctly, I’m gonna have a go here. Gradient descent is the example where you’re trying to measure the depth of an ocean and then you’re going left and right and the next measurement you take, if that’s deeper than the next one, then you know to go that way sort of thing.
Sean Moriarty 00:18:32 Yeah, exactly. That’s my sort of simplified explanation of gradient descent.
Gavin Henry 00:18:37 Can you say it instead of me? I’m sure you do a better job.
Sean Moriarty 00:18:39 Yeah, yeah. So the way I like to describe gradient descent is you get dropped in a random point in the ocean or some lake and you have just a depth finder, you don’t have a map and you want to find the deepest point in the ocean. And so what you do is you take measurements of the depth all around you and then you move in the direction of steepest descent or you move basically to the next spot that brings you to a deeper point in the ocean and you kind of follow this greedy approach until you reach a point where everywhere around you is at a higher elevation or higher depth than where you started. And if you follow this approach, it’s kind of a greedy approach but you’ll essentially end up at a point that’s deeper than where you started for sure. But you know, it might not be the deepest point but it’s gonna be a pretty deep part of the ocean or the lake. I mean that’s kind of in a way how gradient descent works as well. Like we can’t prove necessarily that wherever your loss function, which is a way to measure how good deep learning models do that your loss function when optimized through gradient descent has actually reached an optimal point or like the actual minimum of that loss. But if you reach a point that’s small enough or deep enough, then it’s the model that you’re using is going to be good enough in a way.
Gavin Henry 00:19:56 Cool. Well let’s try and scoop all this up and go through a practical example of the remaining time. We’ve probably got about half an hour, let’s see how we go. So I’ve hopefully picked a good example to do fraud detection with Axon. So that could be, should we do credit card fraud or go with that?
Sean Moriarty 00:20:17 Yeah, I think credit card fraud’s good.
Gavin Henry 00:20:19 So when I did a bit of research in the machine learning ecosystem in your book, me and José spoke about Bumblebee and getting an existing model, which I did a search on a hugging tree.
Sean Moriarty 00:20:31 Hugging face. Yep.
Gavin Henry 00:20:31 Hugging face. Yeah I always say hugging tree and there’s things on there but I just want to go from scratch with Axon if we can.
Sean Moriarty 00:20:39 Yep, yep, that’s fine.
Gavin Henry 00:20:40 So at a high level, before we define things and drill into things, what would your workflow be for detecting credit card fraud with Axon?
Sean Moriarty 00:20:49 The first thing I would do is try to find a viable data set and that would be either an existing data set online or it would be something derived from like your company’s data or some internal data that you have access to that maybe nobody else has access to.
Gavin Henry 00:21:04 So that would be something where your customer’s reported that there’s been a transaction they didn’t make on their credit card statement, whether that’s through credit card details being stolen or they’ve put ’em into a fake website, et cetera. They’ve been compromised somewhere. And of course these people would have millions of customers so they’d probably have a lot of records that were fraud.
Sean Moriarty 00:21:28 Correct. Yeah. And then you would take features of those, of those transactions and that would include like the price that you’re paying the merchant, the location of where the transaction was. Like if the transaction is somewhere overseas and you live in the US then obviously that’s kind of a red flag. And then you take all these, all these features and then like you said, people reported if it’s fraud or not and then you use that as kind of like your true benchmark or your true labels. And one of the things you’re gonna find when you’re working through this problem is that it’s a very unbalanced data set. So obviously when you’re dealing with like transactions, especially credit card transactions on the scale of like millions, then you might run into like a couple thousand that are actually fraudulent. It’s not necessarily common in that space.
Gavin Henry 00:22:16 It’s not common for what sorry?
Sean Moriarty 00:22:17 What I’m trying to say is if you have millions of transactions, then a very small percentage of them are actually gonna be fraudulent. So what you’re gonna end up with is you’re gonna have a ton of transactions that are legitimate and then maybe 1% or less than 1% of them are gonna be fraudulent transactions.
Gavin Henry 00:22:33 And the phrase where they say rubbish in and rubbish out, it’s extremely important to get this good data and bad data differentiated and then pick apart what is of interest in that transaction. Like you mentioned the location, the amount of the transaction, is that a big specific topic in its own right to try and do that? Was that not feature engineering that you mentioned before?
Sean Moriarty 00:22:57 Yeah, I mean absolutely there’s definitely some feature engineering that has to go into it and trying to identify like what features are more likely to be indicative of fraud than others and
Gavin Henry 00:23:07 And that’s just another word for in that big blob adjacent for example, we’re interested in the IP address, the amount, you know, or their spend history, that type of thing.
Sean Moriarty 00:23:17 Exactly. Yeah. So trying to spend some time with the data is really more important than going into and diving right into designing a model and training a model.
Gavin Henry 00:23:29 And if it’s a fairly common thing you’re trying to do, there may be data sets that have been predefined, like you mentioned, that you could go and buy or go and use you know, that you trust.
Sean Moriarty 00:23:40 Exactly, yeah. So someone might have already gone through the trouble of designing a data set for you and you know, labeling a data set and in that case going with something like that that’s already kind of engineered can save you a lot of time but maybe if it’s not as high quality as what you would want, then you need to do the work yourself.
Gavin Henry 00:23:57 Yeah because you might have your own data that you want to mix up with that.
Sean Moriarty 00:24:00 Exactly, yes.
Gavin Henry 00:24:02 So self improve it.
Sean Moriarty 00:24:02 Yep. Your organization’s data is probably gonna have a bit of a different distribution than any other organization’s data so you need to be mindful of that as well.
Gavin Henry 00:24:10 Okay, so now we’ve got the data set and we’ve decided on what features of that data we’re gonna use, what would be next?
Sean Moriarty 00:24:19 Yeah, so then the next thing I would do is I would go about designing a model or defining a model using Axon. And in this case like fraud detection, you can design a relatively simple, I would say feedforward neural network to start and that would probably be just a single function that takes an input and then creates an Axon model from that input and then you can go about training it.
Gavin Henry 00:24:42 And what is a model in Axon world? Is that not an equation function rather what does that mean?
Sean Moriarty 00:24:49 The way that Axon represents models is through Elixir structs. So we build a data structure that represents the actual computation that your model is gonna do and then when you go to get predictions from that model or you go to train that model, we essentially translate that data structure into an actual function for you. So it’s kind of like additional layers in a way away from what the actual NX function looks like. But an Axon, basically what you would do is you would just define an Elixir function and then you specify your inputs using the Axon input function and then you go through some of the other higher level Axon layer definition functions and that builds up that data structure for you.
Gavin Henry 00:25:36 Okay. And Axon would be a good fit for this versus for example, I’ve got some notes here, logistic regression or decision trees or support vector machines or random forests, they just seem to be buzzwords around Alexa and machine running. So just wondering if any of those are something that we would use.
Sean Moriarty 00:25:55 Yeah, so in this case like you might find success with some of those models and as a good machine learning engineer, like one thing to do is to always test and continue to evaluate different models against your dataset because the last thing you want to do is like spend a bunch of money training complex deep learning models and maybe like a simple rule or a simpler model blows that deep learning model out of the water. So one of the things I like to do when I’m solving machine learning problems like this is basically create a competition and evaluate three to four, maybe five different models against my dataset and figure out which one performs best in terms of like accuracy, precision, and then also which one is the cheapest and fastest.
Gavin Henry 00:26:35 So the ones I just mentioned, I think they’re from the traditional machine learning world, is that right?
Sean Moriarty 00:26:41 That’s correct. Yep,
Gavin Henry 00:26:42 Yep. And Axon would be, yeah. Good. So you would do a sort of fight off as it were, between traditional and deep learning if you’ve got the time.
Sean Moriarty 00:26:50 Yep, that’s right. And in this case something like fraud detection would probably be pretty well suited for something like decision trees as well. And decision trees are just another traditional machine learning algorithm. One of the advantages is that you can kind of interpret them pretty easily but you know, I would maybe train a decision tree, maybe train a logistic regression model and then maybe also train a deep learning model and then compare those and find which one performs the best in terms of accuracy, precision, find which one is the easiest to deploy and then kind of go from there.
Gavin Henry 00:28:09 When I was doing my research for this example, because I was coming from immediately the rule-based mindset of how try and tackle, when we spoke about classifying an orange, you’d say right, if it colors orange or if it’s circle, that’s where I came to for the fraud bit. When I saw decision sheets I thought oh that’d be quite good because then you could say, right, if it’s not in the UK, if it’s greater than 200 pounds or if they’ve done five transactions in two minutes, that type of thing. Is that what a decision tree is?
Sean Moriarty 00:28:41 They essentially learn a bunch of rules to partition a data set. So like you know, one branch splits a data set into some number of buckets and it kind of grows from there. The rules are learned but you can actually physically interpret what those rules are. And so a lot of businesses prefer decision trees because you can tie a decision that was made by a model directly to the path that it took.
Gavin Henry 00:29:07 Yeah, okay. And in this example we’re discussing could you run your data set through one of these and then through a deep learning model or would that be pointless?
Sean Moriarty 00:29:16 I wouldn’t necessarily do that. I mean, so in that case you would be building essentially what’s called an ensemble model, but it would be a very strange ensemble model, like a decision tree into a deep learning model. Ensembles, they’re pretty popular, at least in the machine learning competition world ensembles are essentially where you train a bunch of models and then you also take the predictions of those models and train a model on the predictions of those models and then it’s kind of like a Socratic method for machine learning models.
Gavin Henry 00:29:43 I was just thinking about something to whittle through the data set to get it sort of sorted out and then shove it into the complex bit that would tidy it up. But I suppose that’s what you do on the data set to begin with, isn’t it?
Sean Moriarty 00:29:55 Yeah. And so that’s common in machine learning competitions because you know like that extra 0.1% accuracy that you might get from doing that really does matter. That’s the difference between winning and losing the competition. But in a practical machine learning environment it might not necessarily make sense if it adds a bunch of additional things like computational complexity and then complexity in terms of deployment to your application.
Gavin Henry 00:30:20 Just as an aside, are there deep learning competitions like you have when they’re working on the latest password hashing type thing to figure out which way to go?
Sean Moriarty 00:30:30 Yeah, so if you go on Kaggle, there’s actually a ton of active competitions and they’re not necessarily deep learning focused. It’s really just open-ended. Can you use machine learning to solve this problem? So Kaggle has a ton of those and they’ve got a leaderboard and everything and they pay out cash prizes. So it’s pretty fun. Like I have done a few Kaggle competitions, not a ton recently because I’m a little busy, but it is a lot of fun and if people want to use Axon to compete in some Kaggle competitions, I would be more than happy to help.
Gavin Henry 00:30:59 Excellent. I’ll put that in the show notes. So the data we should start collecting, do we start with all of this data we know is true and then move forward to sort of live data that we want to decide is fraud? So what I’m trying to ask in a roundabout way here, when we do the feature engineering to say what we’re interested in is that what we’re always gonna be collecting to feed back into the thing that we created to decide whether it’s gonna be fraud or not?
Sean Moriarty 00:31:26 Yeah, so typically how you would solve this, and it’s a very complex problem, is you would have a baseline of features that you really care about but you would do some sort of version control. And this is where like the concept of feature stores come in where you identify features to train your baseline models and then as time goes on, let’s say your data science team identifies additional features that you would like to add, maybe they take some other features away, then you would push those features out to new models, train those new models on the new features and then go from there. But it becomes kind of like a nightmare in a way, like a really challenging problem because you can imagine if I have some versions that are trained on the snapshot of features that I had on today and then I have another model that’s trained on a snapshot of features from two weeks ago, then I have these systems that need to rectify, okay, at this point in time I need to send these, these features to this model and these new features to this model.
Sean Moriarty 00:32:25 So it becomes kind of a difficult problem. But if you just only care about training, getting this model over the fence today, then you would focus on just the features you identified today and then you know, continue improving that model based on those features. But in the machine learning deployment space, you’re always trying to identify new features, better features to improve the performance of your model.
Gavin Henry 00:32:48 Yeah, I suppose if some new type of data comes out of the bank to help you classify something, you want to get that into your model or a new model like you said straight away.
Sean Moriarty 00:32:57 Exactly. Yeah.
Gavin Henry 00:32:58 So now we’ve got this data, what do we do with it? We need to get it into a form someone understands. So we’ve built our model which isn’t the function.
Sean Moriarty 00:33:07 Yep. So then what I would do is, so let’s say we’ve built our model, we have our raw data. Now the next thing we need to do is some sort of pre-processing to get that data into what we call a tensor or an NX tensor. And so how that will probably be represented is I’ll have a table, maybe a CSV that I can load with something like explorer, which is our data frame library that is built on top of the Polaris project from Rust. So I have this data frame and that’ll represent like a table essentially of input. So each row of the table is one transaction and each column represents a feature. And then I will transform that into a tensor and then I can use that tensor to pass into a training pipeline.
Gavin Henry 00:33:54 And Explorer, we discussed that in show 588 that helps get the data from the CSV file into an NX sort of data structure. Is that correct?
Sean Moriarty 00:34:04 That’s right, yeah. And then I might use Explorer to do other pre-processing. So for example, if I have categorical variables that are represented as strings, for example the country that a transaction was placed in, maybe that’s represented as the ISO country code and I want to convert that into a number because NX does not speak in strings or, or any of those complex data structures. NX only deals with numerical data types. And so I would convert that into a categorical variable either using one hot encoding or maybe just a single categorical number, like zero to 64, 0 to like 192 or however many countries there are in the world.
Gavin Henry 00:34:47 So what would you do in our example with an IP address? Would you geolocate it to a country and then turn that country into an integer from one to what, 256 main countries or something?
Sean Moriarty 00:35:00 Yeah, so something like an IP address, I might try to identify like the ISP that that IP address originates from and like I think something like an IP address I might try to enrich a little bit further than just the IP address. So take the ISP maybe identify if it originates from A VPN or not. I think there might be services out there as well that identify the percentage of likelihood that an IP address is harmful. So maybe I take that harm score and use that as a feature rather than just the IP address. And you potentially could let’s say break the IP address into a subnet. So if I look at an IP address and say okay, I am gonna have all the /24s as categorical variables, then I can use that and then you can kind of derive features in that way from an IP address.
Gavin Henry 00:35:46 So the original feature of an IP address that you’ve selected at step one for example, might then become 10 different features because you’ve broken that down and enriched it.
Sean Moriarty 00:35:58 Exactly. Yeah. So if you start with an IP address, you might do some further work to create a ton of different additional features.
Gavin Henry 00:36:04 That’s a massive job isn’t it?
Sean Moriarty 00:36:05 There’s a common trope in machine learning that like 90% of the work is working with data and then you know, the fun stuff like training the model and deploying a model is not necessarily where you spend a lot of your time.
Gavin Henry 00:36:18 So the model, it’s a definition and a text file isn’t it? It’s not a physical thing you would download as a binary or you know, we run this and it spits out a thing that we would import.
Sean Moriarty 00:36:28 That’s right, yeah. So like the actual model definition is, is code and like when I’m dealing with machine learning problems, I like to keep the model as code and then the parameters as data. So that would be the only binary file you would find. We don’t have any concept of model serialization in Elixir because like I said, my principle or my, my thought is that your, your model is code and should stay as code.
Gavin Henry 00:36:53 Okay. So we’ve got our data set, let’s say it’s as good as it can be. We’ve got our modeling code, we’ve cleaned it all up with Explorer and got it into the format we need and now we’re feeding it into our model. What happens after that?
Sean Moriarty 00:37:06 Yeah, so then the next thing you would do is you would create a training pipeline or you would write a training loop. And the training loop is what’s going to apply that gradient descent that we described earlier in the podcast on your model’s parameters. So it’s gonna take the dataset and then I’m going to pass it through a definition of a supervised training loop in Axon, which uses the Axon.loop API conveniently named. And that essentially implements a functional version of training loops. It’s, if you’re familiar with Elixir, you can think of it as like a giant Enum.reduce and that takes your dataset and it generates initial model parameters and then it passes them or it goes through the gradient descent process and continuously updates your model’s parameters for the number of iterations you specify. And it also tracks things like metrics like say accuracy, which in this case is kind of a useless metric for you to to track because like let’s say that I have this data set with a million transactions and 99% of them are legit, then I can train a model and it’ll be 99% accurate by just saying that every transaction is legit.
Sean Moriarty 00:38:17 And as we know that’s not a very useful fraud detection model because if it says everything’s legit then it’s not gonna catch any actual fraudulent transactions. So what I would really care about here is the precision and the number of true negatives, true positives, false positives, false negatives that it catches. And I would track those and I would train this model for five epochs, which is kind of like the number of times you’ve made it through your entire data set or your model has seen your entire data set. And then on the end I would end up with a trained set of parameters.
Gavin Henry 00:38:50 So just to summarize that bit, see if I’ve got it correct. So we’re feeding in a data set that we know has got good transactions and bad credit card transactions and we’re testing whether it finds those, is that correct with the gradient descent?
Sean Moriarty 00:39:07 Yeah, so we are giving our model examples of the legit transactions and the fraudulent transactions and then we’re having it grade whether or not a transaction is fraudulent or legit. And then we are grading our model’s outputs based on the actual labels that we have and that produces a loss, which is an objective function and then we apply gradient descent to that objective function to minimize that loss and then we update our parameters in a way that minimizes those losses.
Gavin Henry 00:39:43 Oh it’s finally clicked. Okay, I get it now. So in the tabular data we’ve got the CSV file, we’ve got all the features we’re interested in with the transaction and then there’ll be some column that says this is fraud and this isn’t.
Sean Moriarty 00:39:56 That’s right. Yep.
Gavin Henry 00:39:57 So once that’s analyzed, the probability, if that’s correct, of what we’ve decided that transaction is, is then checked against that column that says it is or isn’t fraud and that’s how we’re training.
Sean Moriarty 00:40:08 That’s right, exactly. Yeah. So our model is outputting some probability. Let’s say it outputs 0.75 and that’s a 75% chance that this transaction is fraud. And then I look and that transaction’s actually legit, then I’ll update my model parameters according to whatever my gradient descent algorithm says. And so if you go back to that ocean example, my loss function, the values of the loss function are the depth of that ocean. And so I’m trying to navigate this complex loss function to find the deepest point or the minimal point in that loss function.
Gavin Henry 00:40:42 And when you say you are looking at that output, is that another function in Axon or are you physically looking
Sean Moriarty 00:40:48 No, no. So actually like, I shouldn’t say I’m looking at it but it, it’s like an automated process. So the actual training process Axon takes care of for you.
Gavin Henry 00:40:57 So that’s the training. Yeah, so I was thinking exactly there’d be a lot of data to look at and go no, that was right, that was wrong.
Sean Moriarty 00:41:02 Yeah. Yeah, , I guess you could do it by hand, but
Gavin Henry 00:41:06 Cool. So this obviously depends on the size of the dataset we would need to, I mean how’d you go about resourcing this type of task hardware wise? Is that something you’re familiar with?
Sean Moriarty 00:41:18 Yeah, so something like this, like the model you would train would actually probably be pretty inexpensive and you could probably train it on a commercial laptop and not like I don’t I guess I shouldn’t speak because I don’t have access to like a billion transactions to see how long it would take to crunch through them. But you could train a model pretty quickly and there are commercial and, and are also like open source fraud datasets out there. There’s an example of a credit card fraud dataset on Kaggle and there’s also one in the Axon repository that you can work through and the dataset is actually pretty small. If you were training like a larger model or you had to go through a lot of data, then you would more than likely need access to A GPU and you can either have one like on-prem or if you, you have cloud resources, you can go and provision one in the cloud and then Axon if you use one of the EXLA like backends or compilers, then it’ll, it’ll just do the GPU acceleration for you.
Gavin Henry 00:42:13 And the GPUs are used because they’re good at processing a tensor of data.
Sean Moriarty 00:42:18 That’s right, yeah. And GPUs have a lot of like specialized kernels that can process this information very efficiently.
Gavin Henry 00:42:25 So I guess a tensor is what the graphic cards used to display like a 3D image or something in games and et cetera.
Sean Moriarty 00:42:33 Yep. And that kind of relationship is very useful for deep learning practitioners.
Gavin Henry 00:42:37 So I’ve got my head around the dataset and you know, other than working through example myself with the dataset, I get that that could be something physical that you download from third parties that have spent a lot of time and being sort of peer reviewed and things. What sort of things are you downloading from Hugging Face then through Bumblebee models?
Sean Moriarty 00:42:59 Hugging face has in particular a lot of large language models that you can download for tasks like text classification, named entity recognition, like going to the transaction example, they might have like a named entity recognition model that I could use to pull the entities out of a transaction description. So I could maybe use that as an additional feature for this fraud detection model. Like hey this merchant is Adidas and I know that because I pulled that out of the transaction description. So that’s just an example of like one of the pre-trained models you might download from say Hugging Face using Bumblebee.
Gavin Henry 00:43:38 Okay. I just understand what you physically download in there. So in our example for fraud, are we trying to classify a row in that CSV as fraud or are we doing a regression task as in we’re trying to reduce it to a yes or no? That’s fraud?
Sean Moriarty 00:43:57 Yeah, it depends on I guess what you want your output to be. So like one of the things you always have to do in machine learning is make a business decision on the other end of it. So a lot of like machine learning tutorials will just stop after you’ve trained the model and that’s not necessarily how it works in practice because I need to actually get that model to a deployment and then make a decision based on what my model outputs. So in this case, if we want to just detect fraud like yes, no fraud, then it would be like a classification problem and my outputs would be like a zero for legit and then a one for fraud. But another thing I could do is maybe assign a risk score to my actual dataset and that might be framed as a regression task. I would probably still frame it as like a classification task because I have access to labels that say yes fraud, no not fraud, but it really kind of depends on what your actual business use case is.
Gavin Henry 00:44:56 So with regression and a risk factor there, when you described how you detect whether it’s an orange or an apple, you’re kind of saying I’m 80% sure it’s an orange with classification, wouldn’t that be one? Yes, it’s an orange or zero, it’s no, I’m a bit confused between classification and regression there.
Sean Moriarty 00:45:15 Yeah. Yeah. So regression is like dealing with quantitative variables. So if I wanted to predict the price of a stock after a certain amount of time, that would be a regression problem. Whereas if I’m dealing with qualitative variables like yes fraud, no fraud, then I would be dealing in classifications.
Gavin Henry 00:45:34 Okay, perfect. We touched on the training part, so we’re, we’re getting pretty close to winding up here, but the training part where we’re, I think you said fine tuning the parameters to our model, is that what training is in this example?
Sean Moriarty 00:45:49 Yeah, fine tuning is often used as a terminology when working with pre-trained models. In this case we’re, we’re really just training, updating the parameters. And so we’re starting with a baseline, not a pre-trained model. We’re starting from some random initialization of parameters and then updating them using gradient descent. But the process is identical to what you would do when dealing with a fine tuning, you know, case.
Gavin Henry 00:46:15 Okay, well just probably using the wrong words there. So a pre-trained model is probably like a functional Alexa where you can give it different parameters for it to do something and you’re deciding what the output should be?
Sean Moriarty 00:46:27 Yeah, so the way that Axon API works is when you kick off your training loop, you call Axon.loop.run. And when you are using a pre-trained model, like that takes an initial state like an ENO reduce wood, and when you’re dealing with a pre-trained model, you would pass your like pre-trained parameters into that run. Whereas if you’re dealing with just training a model from scratch, you would pass an empty map because you don’t have any parameters to start with.
Gavin Henry 00:46:55 And that would be discovered through the learning aspect later on?
Sean Moriarty 00:46:58 Exactly. And then the output of that would be your model’s parameters.
Gavin Henry 00:47:02 Okay. And then if you wanted at that point, could you ship that as a pre-trained model for someone else to use or that just be always specific to you?
Sean Moriarty 00:47:09 Yep. So you could upload your model parameters to Hugging Face and then keep the code and for that model definition. And then you would update that maybe for the next million transactions you get in, maybe you retrain your model and or someone else wants to take that and you can ship that off for them.
Gavin Henry 00:47:26 So are the parameters the output of your learning? So if we go back to the example where you said you have your model in code and we don’t do like in Pearl or Python, you sort of freeze the runtime state of the model as it were, are the parameters, the runtime state of all the learning that’s happened so far and you can just kind of save that and pause that and pick it up another day? Yep.
Sean Moriarty 00:47:47 So then what I would do is I would just serialize my parameter map and then I would take the definition of my model, which is just code. And you would compile that and that that’s kind of like a way of saying I compile that into a numerical definition. It’s a bad term if you’re not able to look directly at what’s happening. But I would compile that and that would give me a function for doing predictions and then I would pass my trained parameters into that model prediction function and then I could use that prediction function to get outputs on production data.
Gavin Henry 00:48:20 And that’s the sort of thing you could commit to your Git repository or something every now and again to back it up in production or however you choose to do that.
Sean Moriarty 00:48:28 Exactly, yep.
Gavin Henry 00:48:29 And what does, what would parameters look like in front of me on the screen?
Sean Moriarty 00:48:34 Yeah, so you would see an Elixir map with names of layers and then each layer has its own parameter map with the name of a parameter that maps to a tensor and that that tensor would be a floating point tensor you would just see probably a bunch of random numbers.
Gavin Henry 00:48:54 Okay. Now that’s making a clear picture in my head, so hopefully it’s helping out the listeners. Okay. So I’m gonna move on to some more general questions, but still around this example, is there just one type of neural network or we decided to do the gradient descent, is that the standard way to do this or is that just something applicable to fraud detection?
Sean Moriarty 00:49:14 So there are a ton of different types of neural networks out there and the decision of what architecture you use kind of depends on the problem. There’s just like the basic feedforward neural network that I would use for this one because it’s cheap performance wise and we’ll probably do pretty well in terms of detecting fraud. And then there’s a convolutional neural network, which is often used for images, computer vision problems. There’s recurrent neural networks which are not as popular now because of how popular transformers are. There are transformer models which are massive models built on top of attention, which is a type of layer. It’s really a technique for learning relationships between sequences. There’s a ton of different architectures out there.
Gavin Henry 00:50:03 I think you mentioned quite a few of them in your book, so I’ll make sure we link to some of your blog posts on Dockyard as well.
Sean Moriarty 00:50:08 Yeah, so I try to go through some of the baseline ones and then gradient descent is like, it’s not the only way to train a neural network, but like it’s the only way you’ll actually see end use in practice.
Gavin Henry 00:50:18 Okay. So for this fraud detention or anomaly detection example, are we trying to find anomalies in normal transactions? Are we classifying transactions as fraud based on training or is that just the same thing? And I’ve made that really complicated?
Sean Moriarty 00:50:34 It’s essentially the same exact problem just framed in different ways. So like the anomaly detection portion would only be, I would say useful in like if I didn’t have labels attached to my data. So I would use something like an unsupervised learning technique to do anomaly detection to identify transactions that might be fraudulent. But if I have access to the labels on a fraudulent transaction and not fraudulent transaction, then I would just use a traditional supervised machine learning approach to solve that problem because I have access to the labels.
Gavin Henry 00:51:11 So that comes back to our initial task, which you said is the most difficult part of all this is the quality of our data that we feed in. So if we spent more time labeling fraud, not fraud, we would do supervised learning.
Sean Moriarty 00:51:23 That’s right. Yeah. So I say that the best machine learning companies are companies that find a way to get their users or their data implicitly labeled without much effort. So the best example of this is the Google captchas where they ask you to identify
Gavin Henry 00:51:41 I was thinking about that when I was reading some of your stuff.
Sean Moriarty 00:51:43 Yep. So that’s, that’s like the prime example of they have a way to, it solves a business problem for them and also they get you to label their data for them.
Gavin Henry 00:51:51 And there’s third party services like that Amazon Mechanical Turk, isn’t it, where you can pay people to label for you.
Sean Moriarty 00:51:58 Yep. And now a common approach is to also use something like GPT 4 to label data for you and it might be cheaper and also better than some of the hand labelers you would get.
Gavin Henry 00:52:09 Because it’s got more information of what something would be.
Sean Moriarty 00:52:12 Yep. So if I was dealing with a text problem, I would probably roll with something like GPT 4 labels to save myself some time and then bootstrap a model from there.
Gavin Henry 00:52:21 And that’s commercial services I would guess?
Sean Moriarty 00:52:24 Yep, that’s correct.
Gavin Henry 00:52:25 So just to close off this section, quality of data is key. Spending that extra time on labeling, whether something is what you think it is, will help dictate where you want to go to back up your data. Either the model which is Code and Axon and how far you’ve learned, which are the parameters. We can commit that to a Git repository. But what would that ongoing lifecycle or operational side of Axon involve once we put this workflow into production? You know, do we move from CSV files to an API submit new data, or do we pull that in from a database or you know, how do we do our ops to make sure it’s doing what it should be and say everything dies. How did we recover that type of normal thing? Do you have any experience on that?
Sean Moriarty 00:53:11 Yeah, it’s kind of an open-ended problem. Like the first thing I would do is I would wrap the model in what’s called an NX serving, which is our like inference abstraction. So the way it works is it implements dynamic batching. So if you have a Phoenix application, then it kind of handles the concurrency for you. So if you have a million or let’s say I’m getting a hundred requests at once overlapping within like a 10 millisecond timeframe, I don’t want to just call Axon.Predict, my predict function, on one of those transactions at a time. I actually want to batch those so I can efficiently use my CPU or GPU’s resources. And so that’s what NX serving would take care of for me. And then I would probably implement something like maybe I use like Oban, which is a job scheduling library in Elixir and that would continuously pull data from whatever repository that I have and then retrain my model and then maybe it recommits it back to Git or maybe I use like S3 to store my model’s parameters and I continuously pull the most up-to-date model and, and, and update my serving in that way.
Sean Moriarty 00:54:12 The beauty of the Elixir and Erling ecosystem is that there are like a hundred ways to solve these continuous deployment problems. And so,
Gavin Henry 00:54:21 No, it’s good to put a description on it. So NX serving is kind of like your DeBounce in JavaScript where it tries to smooth everything down for you. And the request you’re talking about, there are real transactions coming through from the bank into your API and you’re trying to decide whether it should go ahead or not.
Sean Moriarty 00:54:39 Yep, that’s right.
Gavin Henry 00:54:40 Yeah, start predicting if it’s fraud or potential fraud.
Sean Moriarty 00:54:42 Yeah, that’s right. And I’m not, um, super familiar with DeBounce so I I don’t know if
Gavin Henry 00:54:47 That’s Oh no, it’s just something that came to mind. It’s where someone’s typing a keyboard and you can slow it down. I think maybe I’ve misunderstood that, but yeah, it’s a way of smoothing out what’s coming in.
Sean Moriarty 00:54:56 Yeah. In a way it’s like a dynamic delay thing.
Gavin Henry 00:55:00 So we would pull new data, retrain the model to tweak our parameters and then save that somewhere every so often.
Sean Moriarty 00:55:07 Yep. And it’s kind of like a never ending life cycle. So over time you end up like logging your model’s outputs, you save some snapshot of the data that you have and then you’ll also obviously have people reporting fraud happening in, in real time as well. And you want to say, hey, did my model catch this? Did it not catch this? Why didn’t it catch this? And those are the examples you’re really gonna want to pay attention to. Like the ones where your model classified it as legit and it was actually fraud. And then the ones your model classified as fraud when it was actually legit.
Gavin Henry 00:55:40 You can do some workflow that cleans that up and alerts someone.
Sean Moriarty 00:55:43 Exactly it and you’ll continue training your model and then deploy it from there.
Gavin Henry 00:55:47 Okay, that’s, that’s a good summary. So, I think we’ve done a pretty great job of what deep learning is and what Elixir and Axon bring to the table in 65 minutes. But if there’s one thing you’d like a software engineer to remember from our show, what would you like that to be?
Sean Moriarty 00:56:01 Yeah, I think what I would like people to remember is that the Elixir machine learning ecosystem is much more complete and competitive with the Python ecosystem than I would say people presume. You can do a ton with a little in the Elixir ecosystem. So you don’t necessarily need to depend on external frameworks and libraries or external ecosystems and languages in the Elixir ecosystem. You can kind of live in the stack and punch above your weight, if you will.
Gavin Henry 00:56:33 Excellent. Was there anything we missed in our example or introduction that you’d like to add or anything at all?
Sean Moriarty 00:56:39 No, I think that’s pretty much it from me. If you want to learn more about the Elixir machine learning ecosystem, definitely check out my book Machine Learning and Elixir from the pragmatic bookshelf.
Gavin Henry 00:56:48 Sean, thank you for coming on the show. It’s been a real pleasure. This is Gavin Henry for Software Engineering Radio. Thank you for listening.
Sean Moriarty 00:56:55 Thanks for having me. [End of Audio]