Searching emails with Python & IMAP

Finding forgotten accounts With keywords

THE BASICS

Cyber hygiene is invaluable, but like many, I don’t remember every throwaway account I’ve made using an alternative email. The method outlined below uses IMAP to search email’s for specific keywords. Of course, the keywords can be anything and used in many different ways.

I was inspired by chapter 16 of Al Sweigart’s book Automating the Boring Stuff with Python: Practical Programming for Total Beginners. Unfortunately, some of the IMAPClient code has become outdated since the book was released resulting in some changes. Sweigart’s book is fantastic and I highly recommend supporting his work with Python 3+.

I’m not a Python developer by trade. If you’re familiar the style guide for Python you may notice some departures from it. I am also not responsible for any user issues.

The Python script uses IMAPClient, pyzmail, email, and a few other libraries to log into an email account, search the inbox, and make a list of all the unique addresses I’ve received emails from.

To further narrow down results you could also extract the email subjects and cross-reference them against keywords found in the typical boiler-plate language of confirmation/activation emails received when making an account. I have not added that extra layer.

NOTES ON EMAIL SETTINGS

Most email clients allow their users to use the Internet Message Access Protocol (IMAP) to access their mail servers. However, some providers like Gmail have IMAP turned off by default while others like Yahoo allow access without updating any settings.

If you’re using Gmail you may find difficulty connecting even after enabling IMAP. I recommend checking Secure Account. Google may have blocked access and listed your system as a security issue. Several hours after confirming it was me I was able to access my account with the script.

If you’re still unable to connect you may also have to change your settings to allow access from less secure devices. Proceed at your risk. It’s best to turn this back to off after you’re done. If you’re using two-factor authentication you should look into application specific passwords to be able to access your email. Click here for Gmail and here for Yahoo.

After you’ve allowed IMAP you first need to figure out what the IMAP domain is for your email provider. I’ve listed a few common ones below:

  • Gmail: imap.gmail.com
  • Outlook.com/Hotmail.com: imap-mail.outlook.com and outlook.office365.com
  • Yahoo Mail: imap.mail.yahoo.com

THE SCRIPT

Libraries

import imapclient
import pyzmail
import pprint
import getpass
import imaplib
import datetime
import email

If you’re missing IMAPClient or Pyzmail you can download them using the pip module manager:

 pip install imapclient 
pip install pyzmail36

If you’re using Python 3.6 you’ll need to install the fork of pyzmail above. If you’re using a earlier version you can simply remove the “36”.

Before moving on we’re going to add a single line of code:

imaplib._MAXLINE = 10000000

The default size for searches is 10,000 bytes. For many people with too many email messages this will be too small. To avoid our script erroring as a result we’ll increase the amount 50 10,000,000 bytes.

Now that we’ve imported the necessary modules it’s time to get cracking. First, let’s simply connect to our email by adding the following code to your file:

email_address = input("Enter Email Address:")
email_pass = getpass.getpass("Enter Password:")

imap_obj = imapclient.IMAPClient('imap.mail.yahoo.com', ssl=True)

imap_obj.login(email_address, email_pass)

print("SUCCESS")

Let’s break it down:

email_address = input("Enter Email Address:")
email_pass = getpass.getpass("Enter Password:")

You shouldn’t insert passwords directly into your code. In this case, we’re simply asking the user to input the email and password each time. We’re using the getpass module to obfuscate the password from any potential prying eyes.

imap_obj = imapclient.IMAPClient('imap.mail.yahoo.com', ssl=True)

The imapclient.IMAPClient() function creates an IMAPClient object which connects to the IMAP server using the address parameter (Yahoo’s IMAP server in this case). Most email providers require SSL or TLS so we add ssl=True. We’ll use our newly created imap_obj with various IMAPClient methods, including the next line of code below:

imap_obj.login(email_address, email_pass) 

Next we simply pass the email and password provided by the user as strings into the login() function which will attempt to login.

Now run the script. If receive an error you may want to try the following:

  • Double-check to make sure the email and password you used are correct
  • Make sure IMAP is allowed
  • Allow access from less secure devices
  • Check “Secure Account” if you’re using Gmail

If all went well you should see something like this:

Mailboxes

Now that we can now connect to the IMAP server let’s see what mailboxes are available by adding the code below:

email_folders = imap_obj.list_folders()
pprint.pprint(email_folders)

Your script should output something like this:

[((b'\Junk', b'\HasNoChildren'), b'/', 'Bulk Mail'),
((b'\Archive', b'\HasNoChildren'), b'/', 'Archive'),
((b'\Drafts', b'\HasNoChildren'), b'/', 'Draft'),
((b'\HasNoChildren',), b'/', 'Inbox'),]

More or less depending on what all folders you have in your inbox.

The list_folders() method returns a tuple collection with nested indexes. All we want is the folder’s full name so that we can later pass it to the search()method. In this case it would be the Bulk Mail, Archive, Draft, and Inbox folder names. If you’re using Gmail some folders names may be preceded by “[Gmail]/”. This is part of the full folder name.

We can use indexing to make the output more readable by replacing the pprint function with the code below:

for i in range(0, len(email_folders)):
    print(email_folders[i][-1])

Which should in turn output something like this:

Bulk Mail
Archive
Draft
Inbox

Let’s go ahead and add the next line of code:

imap_obj.select_folder('Inbox', readonly=True)

We pass the string of the inbox we want to search to the select_folder() method. In this case let’s just search through the inbox folder. So we don’t accidentally delete or otherwise mess with anything we’ll use the readonly=True parameter. They will also not be marked as read.

Searching a Folder

Now that we’ve selected a folder let’s see look at some emails adding the code below:

email_uids = imap_obj.search(['ALL'])
print(email_uids)

Your output may look something like this:

What a messy list of numbers. Each email is represented by a Unique Identification Number (UID) that is given in ascending order and will be unique to your account. These UIDs are in effect the emails and we can pass them to various methods to fetch email data. They are returned by the search() method as a list.

In the example above the ['ALL'] IMAP search key was used with the search() method. This returns all messages in the folder selected when we used the select_folder() method.

There’s a number of IMAP search keys available for use to narrow down your results if everything is a little to broad for your tastes. For example, if we wanted to pull only emails received since May 3rd 2017 we could use the following code:

email_uids = imap_obj.search(['SINCE', datetime.date(2017, 5, 3)])
print(email_uids)

You could also use BEFORE instead of SINCE or combine them like so to get all emails between May 3rd 2017-May 3rd 2018:

email_uids = imap_obj.search(['SINCE', datetime.date(2017, 5, 3), 'BEFORE', datetime.date(2018, 5, 4)])
print(email_uids)

If you’ve looked at Chapter 16 of Automate the Boring Stuff you may notice that it uses a different date format, specifically day, month, year with hyphen delimiters. 03-MAY-2017 is an example. This date format no longer works with newer releases of IMAPClient which is why we’re using datetime.date constructor.

Also, on the subject of dates while SINCE includes the day provided BEFORE does not. As a result I’ve had to use 'BEFORE', datetime.date(2018, 5, 4) to capture May 3rd.

But Wait, That’s Not All!

There’s even more search key goodness to make us of. Here’s a handy list from Automate the Boring Stuff:

Search Key Meaning
‘ALL’ Returns all messages in the folder . You may run in to imaplib size limits if you request all the messages in a large folder.
‘BEFORE date
‘ON date
‘SINCE date
These three search keys return, respectively, messages that were received by the IMAP server before, on, or after the given date .
‘SUBJECT’, ‘string
‘BODY’, ‘string
‘SINCE date
Returns messages where string is found in the subject, body, or either, respectively . If string has spaces in it, then enclose it with double quotes: ‘TEXT “search with spaces”‘ .
‘FROM’, ‘string
‘TO’, ‘string
‘CC’, ‘string
‘BCC’, ‘string
Returns all messages where string is found in the “from” emailaddress, “to” addresses, “cc” (carbon copy) addresses, or “bcc” (blind carbon copy) addresses, respectively . If there are multiple email addresses in string, then separate them with spaces and enclose them all with double quotes:  ‘CC “firstcc@example.com secondcc@example.com”‘ .
‘SEEN’
‘UNSEEN’
Returns all messages with and without the \Seen flag, respectively . An email obtains the \Seen flag if it has been accessed with a fetch() method call (described later) or if it is clicked when you’re checking your email in an email program or web browser . It’s more common to say the email has been “read” rather than “seen,” but they mean the same thing .
‘ANSWERED’
‘UNANSWRED’
Returns all messages with and without the \Answered flag, respectively . A message obtains the \Answered flag when it is replied to .
‘DELETED’
‘UNDELETED’
Returns all messages with and without the \Deleted flag, respectively . Email messages deleted with the delete_messages() method are given the \Deleted flag but are not permanently deleted until the expunge() method is called (see “Deleting Emails” on page 375). Note that some email providers, such as Gmail, automatically expunge emails .
‘DRAFT’
‘UNDRAFT’
Returns all messages with and without the \Draft flag, respectively . Draft messages are usually kept in a separate Drafts folder rather than in the INBOX folder .
‘FLAGGED’
‘UNFLAGED’
Returns all messages with and without the \Flagged flag, respectively . This flag is usually used to mark email messages as “Important” or “Urgent .
‘LARGER’ N
‘SMALLER’ N
Returns all messages larger or smaller than N bytes, respectively .
‘NOT’, ‘search-key Returns the messages that search-key would not have returned .
‘OR’, search-key1, search-key2 Returns the messages that match either the first or second search-key

Automate the Boring Stuff With Python by Al Sweigart is licensed under CC BY 3.
Edited several rows for changes in input syntax and also clarity. See IMAPClient documentation for version history.

Let’s get into some more examples! However, please note I haven’t tested every IMAP Key with every email provider.

email_uids = imap_obj.search(['ANSWERED', 'BEFORE', datetime.date(2015, 5, 3)])
print(email_uids)

email_uids = imap_obj.search(['LARGER', 100])
print(email_uids)

email_uids = imap_obj.search(['FROM', 'exampleperson@email.com'])
print("You have received", len(email_uids), "email from exampleperson@email.com")

email_uids = imap_obj.search(['OR', 'FROM', 'example1@email.com', 'FROM', "example2@email.com"])
print(email_uids)

The Finale?

Hopefully options shown in the previous section are helpful. For now I’m going to search my entire inbox as I want to have a full list of look over. I’ll go over two ways to go about this.

Method one:

emails_from_address = []

progress_uid = 1

for i in email_uids:
    print(progress_uid, "of", len(imap_obj.search(['ALL'])), end="\r")

    duplicate_temp = False

    temp_rawmessage = imap_obj.fetch([i], ['BODY[]', 'FLAGS'])
    temp_message = pyzmail.PyzMessage.factory(temp_rawmessage[i][b'BODY[]'])
            
    temp_fromraw = temp_message.get_addresses('from')
    temp_from_address = temp_fromraw[0][1]
 
    for i2 in emails_from_address:
        if temp_from_address == i2:
                duplicate_temp = True

    if duplicate_temp == False:
        emails_from_address.append(temp_from_address)

    progress_uid = progress_uid + 1

for i in emails_from_address:
    print(i)

imap_obj.logout()

In short, we’re iterating through every UID from the email_uid list. We echo the script’s current progress. duplicate_temp is initially false. After parsing the message we check the temp_from_address by looping through the email_from_address list. If it’s already in the list we don’t add it to the list. If it’s unique, we add it to the list and continue increasing our progress by one email. At the end we iterate through the unique emails and logout of the IMAP server.

temp_rawmessage = imap_obj.fetch([i], ['BODY[]', 'FLAGS'])

The fetch() method does what it says on the tin and fetches the actual email’s content. The first argument [i] is the UID we’re passing from email_uid list that we created earlier. The second is the['BODY[]'] which is the body of the email in all it’s RFC 822 glory as a defaultdict. We’ll look more at this in a moment.

temp_message = pyzmail.PyzMessage.factory(temp_rawmessage[i][b'BODY[]'])

Using the pyzmail module we can make the raw email data a PyzMessage object which allows us easily pull certain data from the email body, in this case we’re pulling the sender’s email address:

temp_fromraw = temp_message.get_addresses('from')
temp_from_address = temp_fromraw[0][1]

The get_addresses('from') method returns the from address as a list containing both the sender’s name and address. In this case I only wanted the address. You could also use (‘to’), (‘cc’), and (‘bcc’).

We can also get the subject and the body of the email using the following line of code:

# Subject
print(temp_message.get_subject())

#HTML body
print(temp_message.html_part.get_payload().decode(temp_message.html_part.charset))

# Text body
print(temp_message.text_part.get_payload().decode(temp_message.text_part.charset))

Getting the body of the email is evidently a little more involved. Emails can be sent as HTML, plaintext, or both. A solution could be doing something like this:

if temp_message.text_part != None:
    temp_email_body = temp_message.text_part.get_payload().decode(temp_message.text_part.charset)
else:
    temp_email_body = temp_message.html_part.get_payload().decode(temp_message.html_part.charset)

Method Two:

We can also use the email library to parse our emails into something intelligible for us mere mortals.

email_uid = 5

raw_msg = imap_obj.fetch([email_uid], 'RFC822')
email_message = email.message_from_bytes(raw_msg[email_uid][b'RFC822'])

In the above example we add 'RFC822' as our data perimeter the parameters in the fetch() method. Then we pass the fetched message, the UID, and the b' bytes prefix to the message_from_bytes() method.

Our email_message variable is a email.message.Message class which means we can request some information from it. For example:

print(email_message["from"])
print(email_message["to"])
print(email_message["subject"])

You could also loop through numerous emails using the following:

for uid, message_data in imap_obj.fetch(email_uids, 'RFC822').items():
    email_message = email.message_from_bytes(message_data[b'RFC822'])
    print(email_message["from"])
    print(email_message["to"])
    print(email_message["subject"])

When we call on the fetch() method it returns both the UID and the email data that we can conveniently pass through to the message_from_bytes().

I hope something I’ve gone over is helpful and sparked an interest in doing something with Python and emails.