jmwhitworth's Tensorflow email classifier eric git repository banner

Using TensorFlow binary classification to identify email addresses

Programmatically identifying email addresses within text isn’t very complex. Most solutions rely on identifying some traits within the string: Does it contain an ‘@’ sign? Does it end with ‘.co.uk’, or ‘.com’, or any other typical domain ending? Using simple rules such as these, you can quite accurately identify email addresses within text.

This project is not trying to improve email identification or serve as a better alternative approach. Instead, this project was a fun way for me to practice using TensorFlow and creating machine learning models. It allowed me to apply my Python machine-learning knowledge to a quantifiable objective: Accurately identifying if a given string is an email address or not.

Goals & Features

  • Create a Machine Learning model that can accept a string input and then classify it as 1 of 2 types: Email or not an email.
  • Assemble the appropriate data sets to train the model to recognise email addresses.
  • Allow the model to be saved: sparing the need to re-train the model on each run.
  • Develop a better understanding of TensorFlow and Keras models.
  • Give it a name with an acronym (Because it just feels right when naming AI models). E.R.I.C: Email Reader In Code.

Tech Stack

Demonstration

A screen recording demonstrating the binary classification model being trained, classifying data, and then being saved.

In the above demonstration, I run the program a couple of times. The first time, you’ll see the model is trained against the given data. 5 ‘epochs’ (passes) are run over the data when the model is training, and then the final model is used to predict if the given strings are likely to be email addresses. It correctly determines that ‘[email protected]‘ is an email address, and ‘hello, world!’ is not. To learn more about how this training process works, be sure to read through the diving deeper section below.

After the program finishes, I run it again. This demonstrates that the model was able to be loaded from the file system without the need for retraining. This feature is important for larger projects where the quantity of training data or the complexity of the model is vast enough that the training process takes a long time. It also saves on computational power, which is good for ecological deployment.

Diving Deeper

How does the training work?

There are two sets of data which are used to create this model: Training data, and testing data. Each set contains multiple strings of text, each of which is labelled as either 1 or 0 (1 meaning the corresponding string is an email, and 0 meaning it isn’t).

The machine learning model is fed the training data to ‘show’ it what an email address looks like, and what something that isn’t an email address looks like. After the model is given a chance to study the training data, it’s then fed the testing data. The testing data is structured the same, but they’re all entirely different strings that the model hasn’t ‘seen’ before. The model is then fed the strings from the testing data to see how accurately it predicts whether each string is an email or not.

Potential uses

This project was largely a very useful learning exercise. Due to the simplicity of manually programming rules to identify email addresses, it’s unlikely anyone would find any use in bloating that process by use of a machine learning model. I’ve considered building this into a quick API to allow people to access it from any program but is still solving an issue that doesn’t exist for other developers. Because of this, I am happy to label this project in its current state as finished.

What is important to note is that this project can easily be adapted. Different data sets can be used to train the model to look for different patterns besides email addresses. Potentially, this can be done with zero code modifications. If you’d like to use the model to identify more than just short strings, such as paragraphs, this would also be relatively simple to achieve when working from this starting point. This means that this project can be easily utilised to run sentiment analysis on entire bodies of text: An example here would be using it to determine if reviews are positive or negative.

Closing thoughts on this project

This was an invaluable learning experience. I’m very interested in machine learning and its potential uses in the real world. While ChatGPT is very cool, it can only have a limited impact on smaller businesses that don’t know how to integrate it. Broader machine learning, however, has the potential to impact businesses in cool and exciting ways: Automating reports, and forecasts, analysing reviews, and processing big data sets. The list goes on. This project has given me an excellent foundation to build my machine-learning portfolio, and I’m looking forward to putting it to good use in the future.