Pattern Matching - Extraction

  • pulling substrings that match a pattern our of a string
  • String contain groups of patterns

Extraction Examples

  • extract email address from a customer message
  • extract links or content from a html tag
  • look for valid textual input from a survey
  • brand name information from a product description

Extraction Groups

  • a group is defined by enclosing a pattern in parenthesis
  • a pattern con contain several groups
  • Groups are referred to by the order they were defined (starting at 1 )
  • Group zero refers to the entire string

Grouping Characters

# starts and ends a group of regex patterns
# starts and ends a named group name of regex patterns, group2,...)
# return a string of the match specified by group
# group can be either an integer index or a string
# if no arguments are passed, defaults to group zero

Match Object Methods
# return a tuple of all the subgroups of the match

Match Groups Example 1

import re
pattern = '(\w*)ware'
        # prefix ware
for word in ['software','hardware','dinnerware']:
    #3 words we want to extract from
    match =, word)
    if match:
        print(, match.groups())
        # oprint the first group [] and the entire word [match.groups()]
# soft software
# hard hardware
# dinner dinnerware

Match Groups Example 2

import re
pattern = '([A-z ]+)\s?(\d+)'
        # first group letter, second of numbers, in between optional white space character
for word in ['S100','Porsch911','dinnerware', 'Cyperdyne 101']:
    #products we want to extract from
    match =, word)
    if match:
        print(f'Brand: {}\tModel: {match.groups(2)}')
        # refers to the part with the letters 
# Brand: S Model: 100
# Brand: Porsche Model: 911
# Brand: Cyperdyne Model: 101

Named Groups

  • instead of number we can also use names of groups
  • group() accepts the string name as an argument
  • Defining a group anme consists of adding the following string after the opening parenthesis: ?P

Named Match Groups Example

import re
pattern = '(?P<brand>[A-z ]+)\s?(?P<model>\d+)'
for word in ['S100','Porsch911','dinnerware', 'Cyperdyne 101']:
    match =, word)
    if match:
        print(f'Brand: {"brand")}\tModel: {match.groups("model")}')
# Brand: S Model: 100
# Brand: Porsche Model: 911
# Brand: Cyperdyne Model: 101