This is a tutorial on processing data with regular expressions using Python. It is also a reflection on the advantages and trade-offs that come into play when you use regular expressions.
Once you have identified and defined a set of patterns, you can strategically search and extract data from raw text according to those patterns. Regular expressions are powerful, but they are not right for every job. Gauging whether your regex pattern can scale and handle any and every scenario is hard. Regular expressions are also hard for humans to read and problematic when it comes to maintenance.
A cautionary quote from programmer Jamie Zawinski says it all:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
In other words, regular expressions are at their best when short, scalable, and used sparingly.
Situations where using regex definitely makes sense include jobs where you have a lot of unstructured text and your objective is to extract recurring, highly standardized patterns, such as email addresses, phone numbers, or server logs. In other cases, such as the one we are about to explore, the advisability of applying regular expressions is debatable.
This tutorial will walk through a series of iterations of regex expressions for grouping author names according to patterns using Python 3.6. While names often do follow conventional and standardized patterns, there are multiple variations to consider. At the end of the tutorial, we will also consider criteria for evaluating when using regex expressions is worthwhile for data processing.
Python 3 brings its own affordances to the table, in that it differs from Python 2 in the way it handles Unicode. As we will see below, handling Unicode will be important in the case of our dataset, since the dataset contains many non-ASCII characters.
There is plenty of existing documentation on the basics of regular expressions online. Here is a quick reference for a refresher or an introduction to regex symbols, as well as the documentation for the Python 3.6 regex flavor. In addition, here is a primer on Unicode.
In the end, the best way to work through the docs is to practice and experiment.
In the example below, we will be dealing with a sample of names of authors. The dataset we are processing comes from a database table and is therefore already structured; however, the names in the “Author” field are far from standardized. Sometimes a string contains multiple co-authors, while sometimes the names are inverted.
The ultimate goal: to pull and sort different name patterns in the raw text strings, so that each author name can eventually be reformatted according to a standardized pattern (e.g., First name, Middle name/initial, Last name, suffix, prefix) and separated from any co-authors by a delimiter.
Let’s connect to the database first:
import os import json import pandas as pd import re import pymysql from sqlalchemy import create_engine # Import SQL credentials def find(name, path): for root, dirs, files in os.walk(path): if name in files: return os.path.join(root, name) fpath = find('sql_credentials.json', '/Users') jstr = open(fpath) data = json.load(jstr) # Save SQL credentials into variables Host = data['Host'] User = data['User'] Password = data['Password'] Database = data['Database'] # engine_setup_query = 'mysql+pymysql://USERNAME:PASSWORD@HOSTNAME/DATABASE_NAME' engine_setup_query = 'mysql+pymysql://%s:%s@%s/%s?charset=utf8' % (User, Password, Host, Database) # Connect to database engine = create_engine(engine_setup_query, encoding = 'utf-8')
Next, we’ll write a SQL query to retrieve entries where the character strings under ‘Author’ contain commas or ampersands(&), making it likely (but not certain) that there is more than one author.
mult_authors = pd.read_sql('SELECT * FROM Book_firstTestSuccess WHERE Author LIKE "%%,%%" OR Author LIKE "%%&%%"', con=engine) print(mult_authors[['Title', 'Author']][40:51])
Title \ 40 Andersen's Fairy Tales 41 The Canterbury Tales: the Man of Law's Tale 42 The Canterbury Tales: The Pardoner's Tale 43 A Canticle for Leibowitz 44 The Future of Us 45 Go Ask Alice 46 Idylls of the King 47 In the Time of the Butterflies 48 Middle School: The Worst Years of My Life 49 The Revenger's Tragedy 50 The Sisterhood of the Traveling Pants Author 40 Andersen, Hans Christian 41 Chaucer, Geoffrey 42 Chaucer, Geoffrey 43 Walter M. Miller, Jr. 44 Asher, Jay and Mackler, Carolyn 45 Anonymous, edited by Beatrice Sparks 46 Alfred, Lord Tennyson 47 Alvarez, Julia 48 James Patterson & Chris Tebbetts 49 Thomas Middleton, previously attributed to Cyr... 50 Brashares, Ann
The sample listing above shows our author names are far from standardized. Let’s extract the Author column from the pandas dataframe as a list and experiment with some regex patterns. For each string we extract, we’ll also remove all trailing whitespace from each string using the Python method strip().
authors = [author.strip() for author in mult_authors['Author']]
With some initial exploration we can identify different kinds of name patterns. There are inverted names, names separated by an ampersand, names followed by a suffix. These are all patterns that could be defined in regex: the trade-off is that they could easily overlap, depending on how we write the pattern.
As we can see from our sample, sometimes our commas are not separating co-authors, but simply separating names from a suffix. We can very easily extract all the strings containing the suffix “Jr.” from the list.
for author in authors: pattern = re.compile('Jr\.|Jr', re.IGNORECASE) junior = pattern.search(author) if junior != None: print(author)
Bill A. Mesce Jr., Steven G. Szilagyi A. B. Guthrie, Jr. Walter M. Miller, Jr. Martin Luther King, Jr. Martin Luther King, Jr. Horatio Alger, Jr. Martin Luther King, Jr. Richard Henry Dana, Jr. E. J. Dionne Jr., Norman J. Ornstein and Thomas E. Mann Walter M. Miller, Jr. Bienvenido M. Noriega, Jr. Bienvenido M. Noriega, Jr. Bienvenido M. Noriega, Jr. Bienvenido M. Noriega, Jr. Lynn White, Jr. Tom Blagden, JR. Robert J. Schneller, Jr. Sam Bass Warner, Jr. Bienvenido M. Noriega, Jr. Christopher H. Foreman, Jr. Robert J. Schneller, Jr. Robert J. Schneller, Jr. Daniel H. Usner, Jr. Lindsay G. Arthur, Jr. Martin Luther King, Jr. Martin Luther King, Jr.
Note the IGNORECASE flag added to the search method. Without this flag, we would not have had “Tom Blagden, JR.” in our results.
Let’s use the same approach to find authors who are doctors:
doctors = [] for author in authors: pattern = re.compile('M\.D\.|MD|Ph\.D|Ph\.D\.|PhD|Dr\.', re.IGNORECASE) doctor = pattern.search(author) if doctor != None: doctors.append(author) print(author)
Olivier Ameisen, M.D. Loren A. Olson, M.D. Mary Pipher, Ph.D. Jerome Groopman, M.D. Suzanne Zoglio, Ph.D. Abigail Brenner, MD Christiane Northrup, M.D. Dr. Ross Donaldson MD, MPH His Holiness The Dalai Lama, Howard C. Cutler, M.D. Andrew Weil, M.D., Rosie Daley Andrew Newberg, M.D., Eugene d'Aquili His Holiness The Dalai Lama, Howard C. Cutler, M.D. Andrew Weil, M.D. His Holiness The Dalai Lama, Howard C. Cutler, M.D. Thomas J. Stanley, William D. Danko, Ph.D.
It’s worth noting if we didn’t place the backslash before each ‘.’, that symbol in regex would return any character occuring between the other characters specified. If we change ‘M.D.’ to ‘M.D.’ for example, we will get results like ‘Tim Bauerschmidt, Ramie Liddle’ due to the ‘mid’ in ‘Bauerschmidt’ (we have the flag IGNORECASE turned on) and ‘Jodi Picoult, Audra McDonald, Cassandra Campbell, Ari Fliakos’ due to the ‘McD’ in ‘McDonald’.
Now that we have the results of our search for doctors stored in a separate list, let’s apply a pattern to each string in the doctors list to separate the names of the doctors out from their degrees (as well as determine if there is more than one name in the string). To do this, we’ll switch from search() to the findall() method, which will search each string for repeating instances of the same pattern and return them in a list.
for doctor in doctors: pattern = r""" \w+ # one or more word characters \s+ # one or more whitespaces [\w.']+ # one or more of any word character or apostrophe [\s?\w'.]+ # optional whitespace and one or more of an word character, apostrophe or period """ regex = re.compile(pattern, re.VERBOSE) name = regex.findall(doctor) print(name)
['Olivier Ameisen'] ['Loren A. Olson'] ['Mary Pipher'] ['Jerome Groopman'] ['Suzanne Zoglio'] ['Abigail Brenner'] ['Christiane Northrup'] ['Ross Donaldson MD'] ['His Holiness The Dalai Lama', 'Howard C. Cutler'] ['Andrew Weil', 'Rosie Daley'] ['Andrew Newberg', "Eugene d'Aquili"] ['His Holiness The Dalai Lama', 'Howard C. Cutler'] ['Andrew Weil'] ['His Holiness The Dalai Lama', 'Howard C. Cutler'] ['Thomas J. Stanley', 'William D. Danko']
The only string that trips us up is “Ross Donaldson MD”, but the results are acceptable. We’ve captured both names that are a combination of first and last, as well as names with a middle name or initial, as well as a title longer than three words, “His Holiness The Dalai Lama”. We are able to handle this longer title due to the last part of the pattern ‘[\s?\w\’.]+’. This last part of the pattern essentially says that the characters inside the square brackets are the only permissible characters to capture. A whitespace (\s) is optional and need only be matched zero or one times (?). “Word characters” (\w) are acceptable and so are an apostrophe (\’) and/or a period (.). As long as we keep encountering this set of characters according to the pattern rules, we keep pushing until we can’t find anymore or we hit a character not included in the brackets (such as a comma).
Can we apply the same pattern to our juniors and achieve a similar result?
juniors = [] for author in authors: pattern = re.compile('Jr\.|Jr', re.IGNORECASE) junior = pattern.search(author) if junior != None: juniors.append(author) for junior in juniors: pattern = re.compile("\w+\s+[\w.']+[\s?\w'.]+") name = pattern.findall(junior) print(junior) print(name)
Bill A. Mesce Jr., Steven G. Szilagyi ['Bill A. Mesce Jr.', 'Steven G. Szilagyi'] A. B. Guthrie, Jr. [] Walter M. Miller, Jr. ['Walter M. Miller'] Martin Luther King, Jr. ['Martin Luther King'] Martin Luther King, Jr. ['Martin Luther King'] Horatio Alger, Jr. ['Horatio Alger'] Martin Luther King, Jr. ['Martin Luther King'] Richard Henry Dana, Jr. ['Richard Henry Dana'] E. J. Dionne Jr., Norman J. Ornstein and Thomas E. Mann ['Dionne Jr.', 'Norman J. Ornstein and Thomas E. Mann'] Walter M. Miller, Jr. ['Walter M. Miller'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Lynn White, Jr. ['Lynn White'] Tom Blagden, JR. ['Tom Blagden'] Robert J. Schneller, Jr. ['Robert J. Schneller'] Sam Bass Warner, Jr. ['Sam Bass Warner'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Christopher H. Foreman, Jr. ['Christopher H. Foreman'] Robert J. Schneller, Jr. ['Robert J. Schneller'] Robert J. Schneller, Jr. ['Robert J. Schneller'] Daniel H. Usner, Jr. ['Daniel H. Usner'] Lindsay G. Arthur, Jr. ['Lindsay G. Arthur'] Martin Luther King, Jr. ['Martin Luther King'] Martin Luther King, Jr. ['Martin Luther King']
As the results above show, the answer is…not quite. A few issues come to light here:
(1) Our pattern doesn’t account for the possibility that a name might have two first initials and a last name.
(2) The last part of our pattern that keeps pushing through spaces and “word characters” allowing us to capture longer names like “His Holiness The Dalai Lama” does not distinguish between words that are part of a name and a word like “and”.
(3) If there is no comma between the suffix and the name, we will capture the suffix as well. If we are really bent on dropping the suffix, then this is a real problem, but our goal right now is really just to separate co-authors by name, so we are going to consider this result acceptable for now.
Let’s fix the first issue:
for junior in juniors: pattern = re.compile("[\w.']+\s+[\w.']+[\s?\w'.]+") name = pattern.findall(junior) print(junior) print(name)
Bill A. Mesce Jr., Steven G. Szilagyi ['Bill A. Mesce Jr.', 'Steven G. Szilagyi'] A. B. Guthrie, Jr. ['A. B. Guthrie'] Walter M. Miller, Jr. ['Walter M. Miller'] Martin Luther King, Jr. ['Martin Luther King'] Martin Luther King, Jr. ['Martin Luther King'] Horatio Alger, Jr. ['Horatio Alger'] Martin Luther King, Jr. ['Martin Luther King'] Richard Henry Dana, Jr. ['Richard Henry Dana'] E. J. Dionne Jr., Norman J. Ornstein and Thomas E. Mann ['E. J. Dionne Jr.', 'Norman J. Ornstein and Thomas E. Mann'] Walter M. Miller, Jr. ['Walter M. Miller'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Lynn White, Jr. ['Lynn White'] Tom Blagden, JR. ['Tom Blagden'] Robert J. Schneller, Jr. ['Robert J. Schneller'] Sam Bass Warner, Jr. ['Sam Bass Warner'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Christopher H. Foreman, Jr. ['Christopher H. Foreman'] Robert J. Schneller, Jr. ['Robert J. Schneller'] Robert J. Schneller, Jr. ['Robert J. Schneller'] Daniel H. Usner, Jr. ['Daniel H. Usner'] Lindsay G. Arthur, Jr. ['Lindsay G. Arthur'] Martin Luther King, Jr. ['Martin Luther King'] Martin Luther King, Jr. ['Martin Luther King']
At this point, we could continue trying to write the perfect regex that covers it all, or we could use one of the many methods Python has to offer and combine it with our regex. Python has very useful string methods like split() and replace() that provide a quick fix for our ‘and’ issue. One important thing to include in our string to replace are the whitespaces around the word, however. If we forget the spaces, we will replace occurences of ‘and’ like the ‘and’ in ‘George Sand’ with a comma and a space.
for junior in juniors: junior = junior.replace(' and ', ', ') pattern = re.compile("[\w.']+\s+[\w.']+[\s?\w'.]+") name = pattern.findall(junior) print(junior) print(name)
Bill A. Mesce Jr., Steven G. Szilagyi ['Bill A. Mesce Jr.', 'Steven G. Szilagyi'] A. B. Guthrie, Jr. ['A. B. Guthrie'] Walter M. Miller, Jr. ['Walter M. Miller'] Martin Luther King, Jr. ['Martin Luther King'] Martin Luther King, Jr. ['Martin Luther King'] Horatio Alger, Jr. ['Horatio Alger'] Martin Luther King, Jr. ['Martin Luther King'] Richard Henry Dana, Jr. ['Richard Henry Dana'] E. J. Dionne Jr., Norman J. Ornstein, Thomas E. Mann ['E. J. Dionne Jr.', 'Norman J. Ornstein', 'Thomas E. Mann'] Walter M. Miller, Jr. ['Walter M. Miller'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Lynn White, Jr. ['Lynn White'] Tom Blagden, JR. ['Tom Blagden'] Robert J. Schneller, Jr. ['Robert J. Schneller'] Sam Bass Warner, Jr. ['Sam Bass Warner'] Bienvenido M. Noriega, Jr. ['Bienvenido M. Noriega'] Christopher H. Foreman, Jr. ['Christopher H. Foreman'] Robert J. Schneller, Jr. ['Robert J. Schneller'] Robert J. Schneller, Jr. ['Robert J. Schneller'] Daniel H. Usner, Jr. ['Daniel H. Usner'] Lindsay G. Arthur, Jr. ['Lindsay G. Arthur'] Martin Luther King, Jr. ['Martin Luther King'] Martin Luther King, Jr. ['Martin Luther King']
Since our goal is ultimately to separate out the author names in the strings, this combination succeeds.
Before laying this to rest, we could ask again, in reverse, will this modified pattern work for our doctors? The answer is yes, it’s satisfactory for our purposes. While we will return ‘Dr. Ross Donaldson MD’ instead of ‘Ross Donaldson MD’ above, this still satisfies our goal of identifying one name (as opposed to confusing one name for two, or two for one). The overarching issue we were wrangling with also was more suffix-related, because suffixes will often be separated from a name by a comma. A prefix like ‘Dr.’ or a suffix that is not set off by a comma in this case are not likely going to throw us off.
for doctor in doctors: doctor = doctor.replace(' and ', ', ') pattern = re.compile("[\w.']+\s+[\w.']+[\s?\w'.]+") name = pattern.findall(doctor) print(doctor) print(name)
Olivier Ameisen, M.D. ['Olivier Ameisen'] Loren A. Olson, M.D. ['Loren A. Olson'] Mary Pipher, Ph.D. ['Mary Pipher'] Jerome Groopman, M.D. ['Jerome Groopman'] Suzanne Zoglio, Ph.D. ['Suzanne Zoglio'] Abigail Brenner, MD ['Abigail Brenner'] Christiane Northrup, M.D. ['Christiane Northrup'] Dr. Ross Donaldson MD, MPH ['Dr. Ross Donaldson MD'] His Holiness The Dalai Lama, Howard C. Cutler, M.D. ['His Holiness The Dalai Lama', 'Howard C. Cutler'] Andrew Weil, M.D., Rosie Daley ['Andrew Weil', 'Rosie Daley'] Andrew Newberg, M.D., Eugene d'Aquili ['Andrew Newberg', "Eugene d'Aquili"] His Holiness The Dalai Lama, Howard C. Cutler, M.D. ['His Holiness The Dalai Lama', 'Howard C. Cutler'] Andrew Weil, M.D. ['Andrew Weil'] His Holiness The Dalai Lama, Howard C. Cutler, M.D. ['His Holiness The Dalai Lama', 'Howard C. Cutler'] Thomas J. Stanley, William D. Danko, Ph.D. ['Thomas J. Stanley', 'William D. Danko']
As could be seen from our sample of names taken from the original dataframe, our dataset includes inverted names as well. This is another conventional pattern for writing names that we can separate out with regex.
inversions = [] for author in authors: pattern = r""" ^(\w+) # starts with a group of one or more word characters ,\s* # comma with zero or more whitespaces ([.\w\s]+)$ # ends with group of a set of whitespaces and word characters repeated one or more times """ regex = re.compile(pattern, re.VERBOSE | re.IGNORECASE) inversion = regex.search(author) if inversion != None: print(inversion.group()) inversions.append(inversion.group())
Andersen, Hans Christian Chaucer, Geoffrey Chaucer, Geoffrey Anonymous, edited by Beatrice Sparks Alfred, Lord Tennyson Alvarez, Julia Brashares, Ann Sonya, Unrein Hughes, John Ho, John Reed,Todd Hughes, John Schmitz, Barbara 天童荒太, Arata Tendo Gorga, Gemma Alape, Arturo Domènech, Laia Marguerite, Porete Upstairs, Downstairs
Results look good for the pattern, other than that we picked up an anonymous writer followed by a comma and an editor along the way. We could deal with this in the regular expression itself, or we could take note of this as yet another possibility and deal with it at another stage of the data cleaning process, in keeping with the principle of using regex in moderation. “Upstairs, Downstairs” is also not a name, but this is the fault of the data and not our regex pattern.
The ‘\w’ symbol is key for capturing non-ASCII characters in Python 3. You can also capture them with the non-whitespace symbol ‘\S’, but this will include other non-word characters like commas. If we had put in a character set [A-Za-z] we would have missed names like ‘Domènech, Laia.
天童荒太, Arata Tendo’, on the other hand, is an apposition, not an inversion. By eliminating whitespace and all other ASCII characters, we eliminate the likelihood of picking up Romance or Slavic languages that are a mix of ASCII and non-ASCII characters, and can limit ourselves to retrieving just special character sets for languages like Japanese and Arabic, and set these aside for special processing as well. As can be observed from the results below, these are likely to be appositions, or versions of the same name written in different ways.
special_char = [] for author in authors: pattern = r""" [^()a-z\s,.\-\d&;:'] # exclude all char inside brackets, including whitespaces up to special char [\w]+ # pick up one or more unicode/ special char, stop at excluded char set [^()a-z\s,.\-\d&;:'] # exclude all char inside brackets, including whitespaces until end """ regex = re.compile(pattern, re.VERBOSE | re.IGNORECASE) char = regex.findall(author) if char != []: print(char) print(author) special_char.append(author)
['أمين', 'الزاوي'] Amin Zaoui, أمين الزاوي ['天童荒太'] 天童荒太, Arata Tendo
In the case of this tutorial’s exercise, the suffix patterns could be defined at a higher level of abstraction. We began the process by experimenting with expressions based on a sampling of names. We knew there were names with the suffix “Jr.” so we mined for juniors, but there were a lot of suffixes that we missed using this approach.
generalized_suffix = [] for author in authors: pattern = re.compile(',\s*([\w\.]+)
A. B. Guthrie, Jr. Walter M. Miller, Jr. Alexandre Dumas, père Martin Luther King, Jr. Martin Luther King, Jr. Horatio Alger, Jr. Martin Luther King, Jr. Richard Henry Dana, Jr. Chaucer, Geoffrey Chaucer, Geoffrey Walter M. Miller, Jr. Asher, Jay and Mackler, Carolyn Alvarez, Julia Brashares, Ann Delany, Delany, Hearth Sonya, Unrein Bienvenido M. Noriega, Jr. Bienvenido M. Noriega, Jr. Bienvenido M. Noriega, Jr. Bienvenido M. Noriega, Jr. Hughes, John Lynn White, Jr. Larry Alboher, D.C. Tom Blagden, JR. Phaik Gan, Lim Ho, John Joan Plummer Russell, P. Robert J. Schneller, Jr. Patty Ptak Kogutek, Ed.D. Reed,Todd John Whiteclay Chambers, II Sam Bass Warner, Jr. Bienvenido M. Noriega, Jr. Curriculum Planning & Development Division, Ministry of Education, Singapore Christopher H. Foreman, Jr. Hughes, John Schmitz, Barbara Robert J. Schneller, Jr. Robert J. Schneller, Jr. Daniel H. Usner, Jr. Gorga, Gemma Garcia i Cornellà, Dolors Diane P, Lando Alape, Arturo Domènech, Laia Olivier Ameisen, M.D. Loren A. Olson, M.D. Mary Pipher, Ph.D. Marvin Thomas, MSW Jerome Groopman, M.D. Suzanne Zoglio, Ph.D. Lindsay G. Arthur, Jr. Abigail Brenner, MD Christiane Northrup, M.D. Marguerite, Porete Martin Luther King, Jr. Martin Luther King, Jr. Upstairs, Downstairs Dr. Ross Donaldson MD, MPH His Holiness The Dalai Lama, Howard C. Cutler, M.D. Wann, de Graaf, Naylor His Holiness The Dalai Lama, Howard C. Cutler, M.D. Andrew Weil, M.D. His Holiness The Dalai Lama, Howard C. Cutler, M.D. Thomas J. Stanley, William D. Danko, Ph.D
By searching for PhDs, MDs, and Jrs only, we eliminated a lot of other less common suffixes like “MSW”, “Ed.D.”, “D.C.”, and “père”.
A more generalized, abstract pattern will pick them up; the trade-off is it picks up inversions too!
suffixes = [] inversions = [] remainder = [] for general in generalized_suffix: general = general.replace(' & ', '; ').replace(' and ', '; ').split(';') if len(general) > 1 : remainder.append(general) else: pattern = r""" ^([\w]+) # starts with a group of one or more word characters ,\s* # comma with zero or more whitespaces ([.\w\s]+)$ # ends with group of a set of whitespaces and word characters repeated one or more times """ regex = re.compile(pattern, re.VERBOSE | re.IGNORECASE) inversion = regex.findall(general[0]) if inversion != []: inversions.append(inversion) else: suffixes.append(general) print('Inversions: ', '\n') for inversion in inversions: print(list(inversion[0])) print('\n','Suffixes :', '\n') for suffix in suffixes: print(suffix) print('\n', 'Left Over:', '\n') for remaining in remainder: print(remaining)
Inversions: ['Chaucer', 'Geoffrey'] ['Chaucer', 'Geoffrey'] ['Alvarez', 'Julia'] ['Brashares', 'Ann'] ['Sonya', 'Unrein'] ['Hughes', 'John'] ['Ho', 'John'] ['Reed', 'Todd'] ['Hughes', 'John'] ['Schmitz', 'Barbara'] ['Gorga', 'Gemma'] ['Alape', 'Arturo'] ['Domènech', 'Laia'] ['Marguerite', 'Porete'] ['Upstairs', 'Downstairs'] Suffixes : ['A. B. Guthrie, Jr.'] ['Walter M. Miller, Jr.'] ['Alexandre Dumas, père'] ['Martin Luther King, Jr.'] ['Martin Luther King, Jr.'] ['Horatio Alger, Jr.'] ['Martin Luther King, Jr.'] ['Richard Henry Dana, Jr.'] ['Walter M. Miller, Jr.'] ['Delany, Delany, Hearth'] ['Bienvenido M. Noriega, Jr.'] ['Bienvenido M. Noriega, Jr.'] ['Bienvenido M. Noriega, Jr.'] ['Bienvenido M. Noriega, Jr.'] ['Lynn White, Jr.'] ['Larry Alboher, D.C.'] ['Tom Blagden, JR.'] ['Phaik Gan, Lim'] ['Joan Plummer Russell, P.'] ['Robert J. Schneller, Jr.'] ['Patty Ptak Kogutek, Ed.D.'] ['John Whiteclay Chambers, II'] ['Sam Bass Warner, Jr.'] ['Bienvenido M. Noriega, Jr.'] ['Christopher H. Foreman, Jr.'] ['Robert J. Schneller, Jr.'] ['Robert J. Schneller, Jr.'] ['Daniel H. Usner, Jr.'] ['Garcia i Cornellà, Dolors'] ['Diane P, Lando'] ['Olivier Ameisen, M.D.'] ['Loren A. Olson, M.D.'] ['Mary Pipher, Ph.D.'] ['Marvin Thomas, MSW'] ['Jerome Groopman, M.D.'] ['Suzanne Zoglio, Ph.D.'] ['Lindsay G. Arthur, Jr.'] ['Abigail Brenner, MD'] ['Christiane Northrup, M.D.'] ['Martin Luther King, Jr.'] ['Martin Luther King, Jr.'] ['Dr. Ross Donaldson MD, MPH'] ['His Holiness The Dalai Lama, Howard C. Cutler, M.D.'] ['Wann, de Graaf, Naylor'] ['His Holiness The Dalai Lama, Howard C. Cutler, M.D.'] ['Andrew Weil, M.D.'] ['His Holiness The Dalai Lama, Howard C. Cutler, M.D.'] ['Thomas J. Stanley, William D. Danko, Ph.D.'] Left Over: ['Asher, Jay', ' Mackler, Carolyn'] ['Curriculum Planning', ' Development Division, Ministry of Education, Singapore']
The results are not perfect. The suffix pattern captures some inverted names like “Phaik Gan, Lim” and “Garcia i Cornellà, Dolors”, since the last name contains whitespaces. These names are basically impossible to distinguish from a name with a suffix using our generalized pattern. The pattern for both is the same.
To summarize what we’ve done: we’ve identified categories of name patterns, and experimented with handling suffixes and prefixes, inverted names, and appositions including non-ASCII characters. We’ve also seen the limitations of regex patterns: they can be perfectly correct but still return unwanted results that follow the same pattern. Now for the bigger question, is regex right for the job?
I would say “yes” only under the following conditions:
(1) The dataset is large enough to make manual cleanup tedious, but small enough for manual checks.
(2) The data can be divided into non-overlapping “buckets” or categories. The buckets should not be too numerous.
(3) The job will be repeated on a regular basis.
If these conditions apply, having a few regex patterns ready to chisel away and organize incoming data into buckets will probably help speed up the data cleaning process. While the regex patterns above could be argued to be too specific to the given data, a few more iterations would yield insight into which patterns are reusable, scalable, and worth keeping, and which ones should be thrown out or revised to be more generalized.
In our case with the author names, regex could be part of the solution, but definitely not THE solution by any means. The dataset is small enough (only about 300 names), and there are discernible categories. Nevertheless, these categories contained a lot of variations and these variations could often overlap. It would not be practical or advisable to build a process relying on regular expressions alone.
, re.IGNORECASE)
suffix = pattern.search(author)
if suffix != None:
generalized_suffix.append(author)
print(author)
By searching for PhDs, MDs, and Jrs only, we eliminated a lot of other less common suffixes like “MSW”, “Ed.D.”, “D.C.”, and “père”.
A more generalized, abstract pattern will pick them up; the trade-off is it picks up inversions too!
The results are not perfect. The suffix pattern captures some inverted names like “Phaik Gan, Lim” and “Garcia i Cornellà, Dolors”, since the last name contains whitespaces. These names are basically impossible to distinguish from a name with a suffix using our generalized pattern. The pattern for both is the same.
To summarize what we’ve done: we’ve identified categories of name patterns, and experimented with handling suffixes and prefixes, inverted names, and appositions including non-ASCII characters. We’ve also seen the limitations of regex patterns: they can be perfectly correct but still return unwanted results that follow the same pattern. Now for the bigger question, is regex right for the job?
I would say “yes” only under the following conditions:
(1) The dataset is large enough to make manual cleanup tedious, but small enough for manual checks.
(2) The data can be divided into non-overlapping “buckets” or categories. The buckets should not be too numerous.
(3) The job will be repeated on a regular basis.
If these conditions apply, having a few regex patterns ready to chisel away and organize incoming data into buckets will probably help speed up the data cleaning process. While the regex patterns above could be argued to be too specific to the given data, a few more iterations would yield insight into which patterns are reusable, scalable, and worth keeping, and which ones should be thrown out or revised to be more generalized.
In our case with the author names, regex could be part of the solution, but definitely not THE solution by any means. The dataset is small enough (only about 300 names), and there are discernible categories. Nevertheless, these categories contained a lot of variations and these variations could often overlap. It would not be practical or advisable to build a process relying on regular expressions alone.