Blogs Archives - Red Oak Strategic

Red Oak Strategic - Data Processing with Regular Expressions in Python 3

Written by Daina Andries | Jan 31, 2018 5:00:00 AM

Applying Regular Expressions

This is a tutorial on processing data with regular expressions using Python. It is also a reflection on the advantages and trade-offs that come into play when you use regular expressions.

Once you have identified and defined a set of patterns, you can strategically search and extract data from raw text according to those patterns. Regular expressions are powerful, but they are not right for every job. Gauging whether your regex pattern can scale and handle any and every scenario is hard. Regular expressions are also hard for humans to read and problematic when it comes to maintenance.

A cautionary quote from programmer Jamie Zawinski says it all:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

In other words, regular expressions are at their best when short, scalable, and used sparingly.

Situations where using regex definitely makes sense include jobs where you have a lot of unstructured text and your objective is to extract recurring, highly standardized patterns, such as email addresses, phone numbers, or server logs. In other cases, such as the one we are about to explore, the advisability of applying regular expressions is debatable.

This tutorial will walk through a series of iterations of regex expressions for grouping author names according to patterns using Python 3.6. While names often do follow conventional and standardized patterns, there are multiple variations to consider. At the end of the tutorial, we will also consider criteria for evaluating when using regex expressions is worthwhile for data processing.

Python 3 brings its own affordances to the table, in that it differs from Python 2 in the way it handles Unicode. As we will see below, handling Unicode will be important in the case of our dataset, since the dataset contains many non-ASCII characters.

 

References on Basics of Regex and Unicode

There is plenty of existing documentation on the basics of regular expressions online. Here is a quick reference for a refresher or an introduction to regex symbols, as well as the documentation for the Python 3.6 regex flavor. In addition, here is a primer on Unicode.

In the end, the best way to work through the docs is to practice and experiment.

 

Reformatting Author Names According to the Same Pattern

In the example below, we will be dealing with a sample of names of authors. The dataset we are processing comes from a database table and is therefore already structured; however, the names in the “Author” field are far from standardized. Sometimes a string contains multiple co-authors, while sometimes the names are inverted.

The ultimate goal: to pull and sort different name patterns in the raw text strings, so that each author name can eventually be reformatted according to a standardized pattern (e.g., First name, Middle name/initial, Last name, suffix, prefix) and separated from any co-authors by a delimiter.

Let’s connect to the database first:

In [1]:
import os
import json
import pandas as pd
import re
import pymysql
from sqlalchemy import create_engine

# Import SQL credentials
def find(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)

fpath = find('sql_credentials.json', '/Users')
jstr = open(fpath)
data = json.load(jstr)

# Save SQL credentials into variables
Host = data['Host']
User = data['User']
Password = data['Password']
Database = data['Database']

# engine_setup_query = 'mysql+pymysql://USERNAME:PASSWORD@HOSTNAME/DATABASE_NAME'
engine_setup_query = 'mysql+pymysql://%s:%s@%s/%s?charset=utf8' % (User, Password, Host, Database)
# Connect to database
engine = create_engine(engine_setup_query, encoding = 'utf-8')

Next, we’ll write a SQL query to retrieve entries where the character strings under ‘Author’ contain commas or ampersands(&), making it likely (but not certain) that there is more than one author.

In [2]:
mult_authors = pd.read_sql('SELECT * FROM Book_firstTestSuccess WHERE Author LIKE "%%,%%" OR Author LIKE "%%&%%"', 
                           con=engine)
print(mult_authors[['Title', 'Author']][40:51])
                                          Title  \
40                       Andersen's Fairy Tales   
41  The Canterbury Tales: the Man of Law's Tale   
42    The Canterbury Tales: The Pardoner's Tale   
43                     A Canticle for Leibowitz   
44                             The Future of Us   
45                                 Go Ask Alice   
46                           Idylls of the King   
47               In the Time of the Butterflies   
48    Middle School: The Worst Years of My Life   
49                       The Revenger's Tragedy   
50        The Sisterhood of the Traveling Pants   

                                               Author  
40                           Andersen, Hans Christian  
41                                  Chaucer, Geoffrey  
42                                  Chaucer, Geoffrey  
43                              Walter M. Miller, Jr.  
44                    Asher, Jay and Mackler, Carolyn  
45               Anonymous, edited by Beatrice Sparks  
46                              Alfred, Lord Tennyson  
47                                     Alvarez, Julia  
48                   James Patterson & Chris Tebbetts  
49  Thomas Middleton, previously attributed to Cyr...  
50                                     Brashares, Ann

The sample listing above shows our author names are far from standardized. Let’s extract the Author column from the pandas dataframe as a list and experiment with some regex patterns. For each string we extract, we’ll also remove all trailing whitespace from each string using the Python method strip().

In [3]:
authors = [author.strip() for author in mult_authors['Author']]

With some initial exploration we can identify different kinds of name patterns. There are inverted names, names separated by an ampersand, names followed by a suffix. These are all patterns that could be defined in regex: the trade-off is that they could easily overlap, depending on how we write the pattern.

 

Suffixes in Names

As we can see from our sample, sometimes our commas are not separating co-authors, but simply separating names from a suffix. We can very easily extract all the strings containing the suffix “Jr.” from the list.

In [4]:
for author in authors:
    pattern = re.compile('Jr\.|Jr', re.IGNORECASE)
    junior = pattern.search(author)
    if junior != None:
        print(author)
Bill A. Mesce Jr., Steven G. Szilagyi
A. B. Guthrie, Jr.
Walter M. Miller, Jr.
Martin Luther King, Jr.
Martin Luther King, Jr.
Horatio Alger, Jr.
Martin Luther King, Jr.
Richard Henry Dana, Jr.
E. J. Dionne Jr., Norman J. Ornstein and Thomas E. Mann
Walter M. Miller, Jr.
Bienvenido M. Noriega, Jr.
Bienvenido M. Noriega, Jr.
Bienvenido M. Noriega, Jr.
Bienvenido M. Noriega, Jr.
Lynn White, Jr.
Tom Blagden, JR.
Robert J. Schneller, Jr.
Sam Bass Warner, Jr.
Bienvenido M. Noriega, Jr.
Christopher H. Foreman, Jr.
Robert J. Schneller, Jr.
Robert J. Schneller, Jr.
Daniel H. Usner, Jr.
Lindsay G. Arthur, Jr.
Martin Luther King, Jr.
Martin Luther King, Jr.

Note the IGNORECASE flag added to the search method. Without this flag, we would not have had “Tom Blagden, JR.” in our results.

Let’s use the same approach to find authors who are doctors:

In [5]:
doctors = []
for author in authors:
    pattern = re.compile('M\.D\.|MD|Ph\.D|Ph\.D\.|PhD|Dr\.', re.IGNORECASE)
    doctor = pattern.search(author)
    if doctor != None:
        doctors.append(author)
        print(author)
Olivier  Ameisen, M.D.
Loren A. Olson, M.D.
Mary Pipher, Ph.D.
Jerome Groopman, M.D.
Suzanne Zoglio, Ph.D.
Abigail Brenner, MD
Christiane Northrup, M.D.
Dr. Ross Donaldson MD, MPH
His Holiness The Dalai Lama, Howard C. Cutler, M.D.
Andrew Weil, M.D., Rosie Daley
Andrew Newberg, M.D., Eugene d'Aquili
His Holiness The Dalai Lama, Howard C. Cutler, M.D.
Andrew Weil, M.D.
His Holiness The Dalai Lama, Howard C. Cutler, M.D.
Thomas J. Stanley, William D. Danko, Ph.D.

It’s worth noting if we didn’t place the backslash before each ‘.’, that symbol in regex would return any character occuring between the other characters specified. If we change ‘M.D.’ to ‘M.D.’ for example, we will get results like ‘Tim Bauerschmidt, Ramie Liddle’ due to the ‘mid’ in ‘Bauerschmidt’ (we have the flag IGNORECASE turned on) and ‘Jodi Picoult, Audra McDonald, Cassandra Campbell, Ari Fliakos’ due to the ‘McD’ in ‘McDonald’.

Now that we have the results of our search for doctors stored in a separate list, let’s apply a pattern to each string in the doctors list to separate the names of the doctors out from their degrees (as well as determine if there is more than one name in the string). To do this, we’ll switch from search() to the findall() method, which will search each string for repeating instances of the same pattern and return them in a list.

In [6]:
for doctor in doctors:
    pattern = r"""
        \w+          # one or more word characters
        \s+          # one or more whitespaces
        [\w.']+      # one or more of any word character or apostrophe
        [\s?\w'.]+   # optional whitespace and one or more of an word character, apostrophe or period
        """
    regex = re.compile(pattern, re.VERBOSE)
    name = regex.findall(doctor)
    print(name)
['Olivier  Ameisen']
['Loren A. Olson']
['Mary Pipher']
['Jerome Groopman']
['Suzanne Zoglio']
['Abigail Brenner']
['Christiane Northrup']
['Ross Donaldson MD']
['His Holiness The Dalai Lama', 'Howard C. Cutler']
['Andrew Weil', 'Rosie Daley']
['Andrew Newberg', "Eugene d'Aquili"]
['His Holiness The Dalai Lama', 'Howard C. Cutler']
['Andrew Weil']
['His Holiness The Dalai Lama', 'Howard C. Cutler']
['Thomas J. Stanley', 'William D. Danko']

The only string that trips us up is “Ross Donaldson MD”, but the results are acceptable. We’ve captured both names that are a combination of first and last, as well as names with a middle name or initial, as well as a title longer than three words, “His Holiness The Dalai Lama”. We are able to handle this longer title due to the last part of the pattern ‘[\s?\w\’.]+’. This last part of the pattern essentially says that the characters inside the square brackets are the only permissible characters to capture. A whitespace (\s) is optional and need only be matched zero or one times (?). “Word characters” (\w) are acceptable and so are an apostrophe (\’) and/or a period (.). As long as we keep encountering this set of characters according to the pattern rules, we keep pushing until we can’t find anymore or we hit a character not included in the brackets (such as a comma).

Can we apply the same pattern to our juniors and achieve a similar result?

In [7]:
juniors = []
for author in authors:
    pattern = re.compile('Jr\.|Jr', re.IGNORECASE)
    junior = pattern.search(author)
    if junior != None:
        juniors.append(author)

for junior in juniors:
    pattern = re.compile("\w+\s+[\w.']+[\s?\w'.]+")
    name = pattern.findall(junior)
    print(junior)
    print(name)
Bill A. Mesce Jr., Steven G. Szilagyi
['Bill A. Mesce Jr.', 'Steven G. Szilagyi']
A. B. Guthrie, Jr.
[]
Walter M. Miller, Jr.
['Walter M. Miller']
Martin Luther King, Jr.
['Martin Luther King']
Martin Luther King, Jr.
['Martin Luther King']
Horatio Alger, Jr.
['Horatio Alger']
Martin Luther King, Jr.
['Martin Luther King']
Richard Henry Dana, Jr.
['Richard Henry Dana']
E. J. Dionne Jr., Norman J. Ornstein and Thomas E. Mann
['Dionne Jr.', 'Norman J. Ornstein and Thomas E. Mann']
Walter M. Miller, Jr.
['Walter M. Miller']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Lynn White, Jr.
['Lynn White']
Tom Blagden, JR.
['Tom Blagden']
Robert J. Schneller, Jr.
['Robert J. Schneller']
Sam Bass Warner, Jr.
['Sam Bass Warner']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Christopher H. Foreman, Jr.
['Christopher H. Foreman']
Robert J. Schneller, Jr.
['Robert J. Schneller']
Robert J. Schneller, Jr.
['Robert J. Schneller']
Daniel H. Usner, Jr.
['Daniel H. Usner']
Lindsay G. Arthur, Jr.
['Lindsay G. Arthur']
Martin Luther King, Jr.
['Martin Luther King']
Martin Luther King, Jr.
['Martin Luther King']

As the results above show, the answer is…not quite. A few issues come to light here:

(1) Our pattern doesn’t account for the possibility that a name might have two first initials and a last name.

(2) The last part of our pattern that keeps pushing through spaces and “word characters” allowing us to capture longer names like “His Holiness The Dalai Lama” does not distinguish between words that are part of a name and a word like “and”.

(3) If there is no comma between the suffix and the name, we will capture the suffix as well. If we are really bent on dropping the suffix, then this is a real problem, but our goal right now is really just to separate co-authors by name, so we are going to consider this result acceptable for now. 

Let’s fix the first issue:

In [8]:
for junior in juniors:
    pattern = re.compile("[\w.']+\s+[\w.']+[\s?\w'.]+")
    name = pattern.findall(junior)
    print(junior)
    print(name)
Bill A. Mesce Jr., Steven G. Szilagyi
['Bill A. Mesce Jr.', 'Steven G. Szilagyi']
A. B. Guthrie, Jr.
['A. B. Guthrie']
Walter M. Miller, Jr.
['Walter M. Miller']
Martin Luther King, Jr.
['Martin Luther King']
Martin Luther King, Jr.
['Martin Luther King']
Horatio Alger, Jr.
['Horatio Alger']
Martin Luther King, Jr.
['Martin Luther King']
Richard Henry Dana, Jr.
['Richard Henry Dana']
E. J. Dionne Jr., Norman J. Ornstein and Thomas E. Mann
['E. J. Dionne Jr.', 'Norman J. Ornstein and Thomas E. Mann']
Walter M. Miller, Jr.
['Walter M. Miller']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Lynn White, Jr.
['Lynn White']
Tom Blagden, JR.
['Tom Blagden']
Robert J. Schneller, Jr.
['Robert J. Schneller']
Sam Bass Warner, Jr.
['Sam Bass Warner']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Christopher H. Foreman, Jr.
['Christopher H. Foreman']
Robert J. Schneller, Jr.
['Robert J. Schneller']
Robert J. Schneller, Jr.
['Robert J. Schneller']
Daniel H. Usner, Jr.
['Daniel H. Usner']
Lindsay G. Arthur, Jr.
['Lindsay G. Arthur']
Martin Luther King, Jr.
['Martin Luther King']
Martin Luther King, Jr.
['Martin Luther King']

At this point, we could continue trying to write the perfect regex that covers it all, or we could use one of the many methods Python has to offer and combine it with our regex. Python has very useful string methods like split() and replace() that provide a quick fix for our ‘and’ issue. One important thing to include in our string to replace are the whitespaces around the word, however. If we forget the spaces, we will replace occurences of ‘and’ like the ‘and’ in ‘George Sand’ with a comma and a space.

In [9]:
for junior in juniors:
    junior = junior.replace(' and ', ', ')
    pattern = re.compile("[\w.']+\s+[\w.']+[\s?\w'.]+")
    name = pattern.findall(junior)
    print(junior)
    print(name)
Bill A. Mesce Jr., Steven G. Szilagyi
['Bill A. Mesce Jr.', 'Steven G. Szilagyi']
A. B. Guthrie, Jr.
['A. B. Guthrie']
Walter M. Miller, Jr.
['Walter M. Miller']
Martin Luther King, Jr.
['Martin Luther King']
Martin Luther King, Jr.
['Martin Luther King']
Horatio Alger, Jr.
['Horatio Alger']
Martin Luther King, Jr.
['Martin Luther King']
Richard Henry Dana, Jr.
['Richard Henry Dana']
E. J. Dionne Jr., Norman J. Ornstein, Thomas E. Mann
['E. J. Dionne Jr.', 'Norman J. Ornstein', 'Thomas E. Mann']
Walter M. Miller, Jr.
['Walter M. Miller']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Lynn White, Jr.
['Lynn White']
Tom Blagden, JR.
['Tom Blagden']
Robert J. Schneller, Jr.
['Robert J. Schneller']
Sam Bass Warner, Jr.
['Sam Bass Warner']
Bienvenido M. Noriega, Jr.
['Bienvenido M. Noriega']
Christopher H. Foreman, Jr.
['Christopher H. Foreman']
Robert J. Schneller, Jr.
['Robert J. Schneller']
Robert J. Schneller, Jr.
['Robert J. Schneller']
Daniel H. Usner, Jr.
['Daniel H. Usner']
Lindsay G. Arthur, Jr.
['Lindsay G. Arthur']
Martin Luther King, Jr.
['Martin Luther King']
Martin Luther King, Jr.
['Martin Luther King']

Since our goal is ultimately to separate out the author names in the strings, this combination succeeds.

Before laying this to rest, we could ask again, in reverse, will this modified pattern work for our doctors? The answer is yes, it’s satisfactory for our purposes. While we will return ‘Dr. Ross Donaldson MD’ instead of ‘Ross Donaldson MD’ above, this still satisfies our goal of identifying one name (as opposed to confusing one name for two, or two for one). The overarching issue we were wrangling with also was more suffix-related, because suffixes will often be separated from a name by a comma. A prefix like ‘Dr.’ or a suffix that is not set off by a comma in this case are not likely going to throw us off.

In [10]:
for doctor in doctors:
    doctor = doctor.replace(' and ', ', ')
    pattern = re.compile("[\w.']+\s+[\w.']+[\s?\w'.]+")
    name = pattern.findall(doctor)
    print(doctor)
    print(name)
Olivier  Ameisen, M.D.
['Olivier  Ameisen']
Loren A. Olson, M.D.
['Loren A. Olson']
Mary Pipher, Ph.D.
['Mary Pipher']
Jerome Groopman, M.D.
['Jerome Groopman']
Suzanne Zoglio, Ph.D.
['Suzanne Zoglio']
Abigail Brenner, MD
['Abigail Brenner']
Christiane Northrup, M.D.
['Christiane Northrup']
Dr. Ross Donaldson MD, MPH
['Dr. Ross Donaldson MD']
His Holiness The Dalai Lama, Howard C. Cutler, M.D.
['His Holiness The Dalai Lama', 'Howard C. Cutler']
Andrew Weil, M.D., Rosie Daley
['Andrew Weil', 'Rosie Daley']
Andrew Newberg, M.D., Eugene d'Aquili
['Andrew Newberg', "Eugene d'Aquili"]
His Holiness The Dalai Lama, Howard C. Cutler, M.D.
['His Holiness The Dalai Lama', 'Howard C. Cutler']
Andrew Weil, M.D.
['Andrew Weil']
His Holiness The Dalai Lama, Howard C. Cutler, M.D.
['His Holiness The Dalai Lama', 'Howard C. Cutler']
Thomas J. Stanley, William D. Danko, Ph.D.
['Thomas J. Stanley', 'William D. Danko']

Name Inversion

As could be seen from our sample of names taken from the original dataframe, our dataset includes inverted names as well. This is another conventional pattern for writing names that we can separate out with regex.

In [11]:
inversions = []
for author in authors:
    pattern = r"""
            ^(\w+)       # starts with a group of one or more word characters
            ,\s*         # comma with zero or more whitespaces
            ([.\w\s]+)$  # ends with group of a set of whitespaces and word characters repeated one or more times
            """
    regex = re.compile(pattern, re.VERBOSE | re.IGNORECASE)
    inversion = regex.search(author)
    if inversion != None:
        print(inversion.group())
        inversions.append(inversion.group())
Andersen, Hans Christian
Chaucer, Geoffrey
Chaucer, Geoffrey
Anonymous, edited by Beatrice Sparks
Alfred, Lord Tennyson
Alvarez, Julia
Brashares, Ann
Sonya, Unrein
Hughes, John
Ho, John
Reed,Todd
Hughes, John
Schmitz, Barbara
天童荒太, Arata Tendo
Gorga, Gemma
Alape, Arturo
Domènech, Laia
Marguerite, Porete
Upstairs, Downstairs

Results look good for the pattern, other than that we picked up an anonymous writer followed by a comma and an editor along the way. We could deal with this in the regular expression itself, or we could take note of this as yet another possibility and deal with it at another stage of the data cleaning process, in keeping with the principle of using regex in moderation. “Upstairs, Downstairs” is also not a name, but this is the fault of the data and not our regex pattern.

Python 3 and Non-ASCII Characters

The ‘\w’ symbol is key for capturing non-ASCII characters in Python 3. You can also capture them with the non-whitespace symbol ‘\S’, but this will include other non-word characters like commas. If we had put in a character set [A-Za-z] we would have missed names like ‘Domènech, Laia. 

天童荒太, Arata Tendo’, on the other hand, is an apposition, not an inversion. By eliminating whitespace and all other ASCII characters, we eliminate the likelihood of picking up Romance or Slavic languages that are a mix of ASCII and non-ASCII characters, and can limit ourselves to retrieving just special character sets for languages like Japanese and Arabic, and set these aside for special processing as well. As can be observed from the results below, these are likely to be appositions, or versions of the same name written in different ways.

In [12]:
special_char = []
for author in authors:
    pattern = r"""
            [^()a-z\s,.\-\d&;:'] # exclude all char inside brackets, including whitespaces up to special char
            [\w]+                # pick up one or more unicode/ special char, stop at excluded char set
            [^()a-z\s,.\-\d&;:'] # exclude all char inside brackets, including whitespaces until end
            """
    regex = re.compile(pattern, re.VERBOSE | re.IGNORECASE)
    char = regex.findall(author)
    if char != []:
        print(char)
        print(author)
        special_char.append(author)
['أمين', 'الزاوي']
Amin Zaoui, أمين الزاوي
['天童荒太']
天童荒太, Arata Tendo

Mining with More General Patterns

In the case of this tutorial’s exercise, the suffix patterns could be defined at a higher level of abstraction. We began the process by experimenting with expressions based on a sampling of names. We knew there were names with the suffix “Jr.” so we mined for juniors, but there were a lot of suffixes that we missed using this approach.

In [13]:
generalized_suffix = []
for author in authors:
    pattern = re.compile(',\s*([\w\.]+)
A. B. Guthrie, Jr.
Walter M. Miller, Jr.
Alexandre Dumas, père
Martin Luther King, Jr.
Martin Luther King, Jr.
Horatio Alger, Jr.
Martin Luther King, Jr.
Richard Henry Dana, Jr.
Chaucer, Geoffrey
Chaucer, Geoffrey
Walter M. Miller, Jr.
Asher, Jay and Mackler, Carolyn
Alvarez, Julia
Brashares, Ann
Delany, Delany, Hearth
Sonya, Unrein
Bienvenido M. Noriega, Jr.
Bienvenido M. Noriega, Jr.
Bienvenido M. Noriega, Jr.
Bienvenido M. Noriega, Jr.
Hughes, John
Lynn White, Jr.
Larry Alboher, D.C.
Tom Blagden, JR.
Phaik Gan, Lim
Ho, John
Joan Plummer Russell, P.
Robert J. Schneller, Jr.
Patty Ptak Kogutek, Ed.D.
Reed,Todd
John Whiteclay Chambers, II
Sam Bass Warner, Jr.
Bienvenido M. Noriega, Jr.
Curriculum Planning & Development Division, Ministry of Education, Singapore
Christopher H. Foreman, Jr.
Hughes, John
Schmitz, Barbara
Robert J. Schneller, Jr.
Robert J. Schneller, Jr.
Daniel H. Usner, Jr.
Gorga, Gemma
Garcia i Cornellà, Dolors
Diane P, Lando
Alape, Arturo
Domènech, Laia
Olivier  Ameisen, M.D.
Loren A. Olson, M.D.
Mary Pipher, Ph.D.
Marvin Thomas, MSW
Jerome Groopman, M.D.
Suzanne Zoglio, Ph.D.
Lindsay G. Arthur, Jr.
Abigail Brenner, MD
Christiane Northrup, M.D.
Marguerite, Porete
Martin Luther King, Jr.
Martin Luther King, Jr.
Upstairs, Downstairs
Dr. Ross Donaldson MD, MPH
His Holiness The Dalai Lama, Howard C. Cutler, M.D.
Wann, de Graaf, Naylor
His Holiness The Dalai Lama, Howard C. Cutler, M.D.
Andrew Weil, M.D.
His Holiness The Dalai Lama, Howard C. Cutler, M.D.
Thomas J. Stanley, William D. Danko, Ph.D

By searching for PhDs, MDs, and Jrs only, we eliminated a lot of other less common suffixes like “MSW”, “Ed.D.”, “D.C.”, and “père”.

A more generalized, abstract pattern will pick them up; the trade-off is it picks up inversions too!

In [14]:
suffixes = []
inversions = []
remainder = []
for general in generalized_suffix:
    general = general.replace(' & ', '; ').replace(' and ', '; ').split(';')
    if len(general) > 1 :
        remainder.append(general)
    else:
        pattern = r"""
                ^([\w]+)     # starts with a group of one or more word characters
                ,\s*         # comma with zero or more whitespaces
                ([.\w\s]+)$  # ends with group of a set of whitespaces and word characters repeated one or more times
                """
        regex = re.compile(pattern, re.VERBOSE | re.IGNORECASE)
        inversion = regex.findall(general[0])
        if inversion != []:
            inversions.append(inversion)
        else:
            suffixes.append(general)
    

print('Inversions: ', '\n')
for inversion in inversions:
    print(list(inversion[0]))
print('\n','Suffixes :', '\n')   
for suffix in suffixes:
    print(suffix)
print('\n', 'Left Over:', '\n')
for remaining in remainder:
    print(remaining)
Inversions:  

['Chaucer', 'Geoffrey']
['Chaucer', 'Geoffrey']
['Alvarez', 'Julia']
['Brashares', 'Ann']
['Sonya', 'Unrein']
['Hughes', 'John']
['Ho', 'John']
['Reed', 'Todd']
['Hughes', 'John']
['Schmitz', 'Barbara']
['Gorga', 'Gemma']
['Alape', 'Arturo']
['Domènech', 'Laia']
['Marguerite', 'Porete']
['Upstairs', 'Downstairs']

 Suffixes : 

['A. B. Guthrie, Jr.']
['Walter M. Miller, Jr.']
['Alexandre Dumas, père']
['Martin Luther King, Jr.']
['Martin Luther King, Jr.']
['Horatio Alger, Jr.']
['Martin Luther King, Jr.']
['Richard Henry Dana, Jr.']
['Walter M. Miller, Jr.']
['Delany, Delany, Hearth']
['Bienvenido M. Noriega, Jr.']
['Bienvenido M. Noriega, Jr.']
['Bienvenido M. Noriega, Jr.']
['Bienvenido M. Noriega, Jr.']
['Lynn White, Jr.']
['Larry Alboher, D.C.']
['Tom Blagden, JR.']
['Phaik Gan, Lim']
['Joan Plummer Russell, P.']
['Robert J. Schneller, Jr.']
['Patty Ptak Kogutek, Ed.D.']
['John Whiteclay Chambers, II']
['Sam Bass Warner, Jr.']
['Bienvenido M. Noriega, Jr.']
['Christopher H. Foreman, Jr.']
['Robert J. Schneller, Jr.']
['Robert J. Schneller, Jr.']
['Daniel H. Usner, Jr.']
['Garcia i Cornellà, Dolors']
['Diane P, Lando']
['Olivier  Ameisen, M.D.']
['Loren A. Olson, M.D.']
['Mary Pipher, Ph.D.']
['Marvin Thomas, MSW']
['Jerome Groopman, M.D.']
['Suzanne Zoglio, Ph.D.']
['Lindsay G. Arthur, Jr.']
['Abigail Brenner, MD']
['Christiane Northrup, M.D.']
['Martin Luther King, Jr.']
['Martin Luther King, Jr.']
['Dr. Ross Donaldson MD, MPH']
['His Holiness The Dalai Lama, Howard C. Cutler, M.D.']
['Wann, de Graaf, Naylor']
['His Holiness The Dalai Lama, Howard C. Cutler, M.D.']
['Andrew Weil, M.D.']
['His Holiness The Dalai Lama, Howard C. Cutler, M.D.']
['Thomas J. Stanley, William D. Danko, Ph.D.']

 Left Over: 

['Asher, Jay', ' Mackler, Carolyn']
['Curriculum Planning', ' Development Division, Ministry of Education, Singapore']

The results are not perfect. The suffix pattern captures some inverted names like “Phaik Gan, Lim” and “Garcia i Cornellà, Dolors”, since the last name contains whitespaces. These names are basically impossible to distinguish from a name with a suffix using our generalized pattern. The pattern for both is the same.

Is It Worth It? General Takeaways for Data Processing

To summarize what we’ve done: we’ve identified categories of name patterns, and experimented with handling suffixes and prefixes, inverted names, and appositions including non-ASCII characters. We’ve also seen the limitations of regex patterns: they can be perfectly correct but still return unwanted results that follow the same pattern. Now for the bigger question, is regex right for the job?

I would say “yes” only under the following conditions:

(1) The dataset is large enough to make manual cleanup tedious, but small enough for manual checks.

(2) The data can be divided into non-overlapping “buckets” or categories. The buckets should not be too numerous.

(3) The job will be repeated on a regular basis.

If these conditions apply, having a few regex patterns ready to chisel away and organize incoming data into buckets will probably help speed up the data cleaning process. While the regex patterns above could be argued to be too specific to the given data, a few more iterations would yield insight into which patterns are reusable, scalable, and worth keeping, and which ones should be thrown out or revised to be more generalized.

In our case with the author names, regex could be part of the solution, but definitely not THE solution by any means. The dataset is small enough (only about 300 names), and there are discernible categories. Nevertheless, these categories contained a lot of variations and these variations could often overlap. It would not be practical or advisable to build a process relying on regular expressions alone.

, re.IGNORECASE)
suffix = pattern.search(author)
if suffix != None:
generalized_suffix.append(author)
print(author)

By searching for PhDs, MDs, and Jrs only, we eliminated a lot of other less common suffixes like “MSW”, “Ed.D.”, “D.C.”, and “père”.

A more generalized, abstract pattern will pick them up; the trade-off is it picks up inversions too!

In [14]:

The results are not perfect. The suffix pattern captures some inverted names like “Phaik Gan, Lim” and “Garcia i Cornellà, Dolors”, since the last name contains whitespaces. These names are basically impossible to distinguish from a name with a suffix using our generalized pattern. The pattern for both is the same.

Is It Worth It? General Takeaways for Data Processing

To summarize what we’ve done: we’ve identified categories of name patterns, and experimented with handling suffixes and prefixes, inverted names, and appositions including non-ASCII characters. We’ve also seen the limitations of regex patterns: they can be perfectly correct but still return unwanted results that follow the same pattern. Now for the bigger question, is regex right for the job?

I would say “yes” only under the following conditions:

(1) The dataset is large enough to make manual cleanup tedious, but small enough for manual checks.

(2) The data can be divided into non-overlapping “buckets” or categories. The buckets should not be too numerous.

(3) The job will be repeated on a regular basis.

If these conditions apply, having a few regex patterns ready to chisel away and organize incoming data into buckets will probably help speed up the data cleaning process. While the regex patterns above could be argued to be too specific to the given data, a few more iterations would yield insight into which patterns are reusable, scalable, and worth keeping, and which ones should be thrown out or revised to be more generalized.

In our case with the author names, regex could be part of the solution, but definitely not THE solution by any means. The dataset is small enough (only about 300 names), and there are discernible categories. Nevertheless, these categories contained a lot of variations and these variations could often overlap. It would not be practical or advisable to build a process relying on regular expressions alone.