Python Text Processing

  • a string is like a list, each character is indexed
  • text_variable[index] to access a character
  • strings know their length, use len(string_name)
  • can iterate through a string by index
  • Examples of functions you can call on strings using x = 'this is a Test '
    • 'must'know:
      • split--> x.split(' ') ['this', 'is', 'a', 'test']
      • upper--> x.upper() 'THIS IS A TEST '
      • lower--> x.lower() 'this is a test
      • replace--> x.replace('is', 'lol') 'thlol lol a Test '
      • find--> x.find('is') 2
      • strip--> x.strip() 'this is a Test
    • 'good' to know:
      • startswith--> x.startswith('th') True
      • endswith--> x.endswith('end') False
      • title--> x.title() 'This Is A Test '
      • isalpha--> x.isalpha() False
      • isdigit--> '521'.isdigit() True
      • isspace--> ' '.isspace() True (works with spaces, tabs, or newlines)
  • strings are not good at being modified, so usually you create a new string to work with them
    • example:
      • raw_string = 'My phone number is 6508675309. Please call!'
        def just_number(str):
            only_number = ''        # use to build new string rather than trying to delete from existing string
            for ch in str:
                if ch.isdigit():
                    only_number = only_number + ch
        return only_number    
        prints 6508675309
  • characters are just a giant enumeration (An enumeration is a complete, ordered listing of all the items in a collection. The term is commonly used in mathematics and computer science to refer to a listing of all of the elements of a set.)
    • big look up table
      • ASCII
      • ASCII2
      • Unicode (bigger ASCII)
    • 'A' -> 'Z' are sequential
    • 'a' -> 'b' are sequential
    • '0' -> '9' are sequential
    • ord(ch) gives us the number associated with the character
    • functions which take strings (same example x as above):
      • len--> len(x) 15
      • ord--> ord('A') 65
      • hash--> hash(x) 2466759895439727657
      • < --> 'abc' < 'zabc' True
      • == --> x == 'this is a Test True
      • in--> 'his' in x True
  • Python strings are immutable
    • once a string has been created you cannot set characters
    • to change a string:
      • create a new string holding the new value you want to give it via concatenation
        • see earlier example with the function that only returned the numbers
      • reassigning the string variable (that's allowed)
        • example:
          x = 'abc'
          x[1] = 'z'        # TypeError: 'str' object does not support item assignment
          x = 'azc'         # can reassign the string
      • often build up new string through concatenation
        • example:
          def main():
            s1 = 'CS106'
            s2 = 'A'
            s3 = 'I got an ' + s2 + ' in ' + s1 +s2
          prints I got an A in CS106A
    • important consequence: if you pass a string to a function, you are guaranteed your string won't be changed
  • many string algorithms use the 'loop and construct' method
    • 3 examples that give the same result:
      • def reverse_string(str):
            result = ''
            for i in range(len(str)):
                result = str[i] + result
            return result
      • def reverse_string_v2(str):
            result = ''
            for ch in str:
                result = ch + result
            return result
      • def reverse_string_v3(str):
            This uses the slice operator in a special way. With no start, no end,
            and a delta of -1, slice reverses.
            return strt[::-1]

Copyright © 2022