3.12 Lesson 3 Practice Exercises

3.12 Lesson 3 Practice Exercises mjg8

Again, we are going to close out the lesson with a few practice exercises that focus on the new Python concepts introduced in this lesson (regular expressions and higher order functions) as well as on working with tabular data with pandas as a preparation for this lesson's homework assignment. In the homework assignment, you are also going to use geopandas, the Esri ArcGIS for Python API, and GDAL/OGR again to get some more practice with these libraries, too. What was said in the introduction to the practice exercises of Lesson 2 holds here as well: don't worry if you have troubles finding the perfect solution on your own. Studying the solutions carefully is another way of learning and improving your skills. The solutions of the three practice exercises pages can again be found in the following subsections.

Practice Exercise 1: Regular Expressions (see Section 3.3)

Write a function that tests whether an entered string is a valid date using the format "YYYY-MM-DD". The function takes the string to test as a parameter and then returns True or False. The YYYY can be any 4-digit number, but the MM needs to be a valid 2-digit number for a month (with a leading 0 for January to September). The DD needs to be a number between 01 and 31 but you don’t have to check whether this is a valid number for the given month. Your function should use a single regular expression to solve this task.

Here are a few examples you can test your implementation with:

"1977-01-01"  -> True 

"1977-00-01"  -> False (00 not a valid month) 

"1977-23-01"  -> False (23 not a valid month) 

"1977-12-31"  -> True 

"1977-11-01asdf"  -> False (you need to make sure there are no additional characters after the date) 

"asdf1977-11-01"  -> False (you need to make sure there are no additional characters before the date) 

"9872-12-31"  -> True 

"0000-12-33"  -> False (33 is not a valid day) 

"0000-12-00"  -> False (00 not a valid day) 

"9872-15-31"  -> False (15 is not a valid month)

Practice Exercise 2: Higher Order Functions (see Section 3.4)

We mentioned that the higher-order function reduce(...) can be used to do things like testing whether all elements in a list of Booleans are True. This exercise has three parts:

  1. Given list l containing only Boolean values as elements (e.g. l = [ True, False, True ]), use reduce(…) to test whether all elements in l are True? What would you need to change to test if at least one element is True? (Hint: you will have to figure out what the right logical operator to use is and then look at how it’s called in the Python module operator; then figure out what the right initial value for the third parameter of reduce(...) is.)
  2. Now instead of a list of Booleans, you have a list of integer numbers (e.g. l =[-4, 2, 1, -6 ]). Use a combination of map(…) and reduce(…) to check whether or not all numbers in the list are positive numbers (> 0).
  3. Implement reduce(...) yourself and test it with the example from part 1. Your function myReduce(…) should have the three parameters f (function), l (list), and i (initial value). It should consist of a for-loop that goes through the elements of the list and it is not allowed to use any other higher order function (in particular not the actual reduce(...) function).

Practice Exercise 3: Pandas (see Section 3.8)

Below is an imaginary list of students and scores for three different assignments.

Students' Scores for Assignments 1, 2, and 3
Name Assignment 1 Assignment 2 Assignment 3
1 Mike 7 10 5.5
2 Lisa 6.5 9 8
3 George 4 3 7
4 Maria 7 9.5 4
5 Frank 5 5 5

Create a pandas data frame for this data (e.g. in a fresh Jupyter notebook). The column and row labels should be as in the table above.

Now, use pandas operations to add a new column to that data frame and assign it the average score over all assignments for each row.

Next, perform the following subsetting operations using pandas filtering with Boolean indexing:

  1. Get all students with an Assignment 1 score < 7 (show all columns)
  2. Get all students with Assignment 1 and Assignment 2 scores both > 6 (show all columns)
  3. Get all students with at least one score < 5 over all assignments (show all columns) 

    (Hint: an alternative to using the logical or (|) over all columns with scores is to call the .min(…) method of a data frame with the parameter "axis = 1" to get the minimum value over all columns for each row. This can be used here to first create a vector with the minimum score over all three assignments and then create a Boolean vector from it based on whether or not the value is <5. You can then use this vector for the Boolean indexing operation.)
     
  4. Get all students whose names start with 'M' and only show the name and average score columns

    (Hint: there is also a method called .map(…) that you can use to apply a function or lambda expression to a pandas data frame (or individual column). The result is a new data frame with the results of applying the function/expression to each cell in the data frame. This can be used here to create a Boolean vector based on whether or not the name starts with ‘M’ (string method startswith(…)). This vector can then be used for the Boolean indexing operation. Then you just have to select the columns of interest with the last part of the statement).
     
  5. Finally, sort the table by name.

Lesson 3 Exercise 1 Solution

Lesson 3 Exercise 1 Solution mrs110
import re 

datePattern = re.compile('\d\d\d\d-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$') 

def isValidDate(s): 
    return datePattern.match(s) != None 

Explanation: Since we are using match(…) to compare the compiled pattern in variable datePattern to the string in parameter s given to our function isValidDate(…), we don’t have to worry about additional characters before the start of the date because match(…) will always try to match the pattern to the start of the string. However, we use $ as the last character in our pattern to make sure there are no additional characters following the date. That means the pattern has the form

“…-…-…$”

where the dots have to be replaced with some regular expression notation for the year, month, and day parts. The year part is easy, since we allow for any 4-digit number here. So we can use \d\d\d\d here, or alternatively \d{4,4} (remember that \d stands for the predefined class of all digits).

For the month, we need to distinguish two cases: either it is a 0 followed by one of the digits 1-9 (but not another 0) or a 1 followed by one of the digits 0-2. We therefore write this part as a case distinction (…|…) with the left part 0[1-9] representing the first option and the second part 1[0-2] representing the second option.

For the day, we need to distinguish three cases: (1) a 0 followed by one of the digits 1-9, (2) a 1 or 2 followed by any digit, or (3) a 3 followed by a 0 or a 1. Therefore we use a case-distinction with three options (…|…|…) for this part. The first part 0[1-9] is for option (1), the second part [12]\d for option (2), and the third part 3[01] for the third option.

Lesson 3 Exercise 2 Solution

Lesson 3 Exercise 2 Solution mrs110

Part 1:

import operator 

from functools import reduce 

l = [True, False, True] 

r = reduce(operator.and_, l, True) 

print(r)  #  output will be False in this case 

To check whether or not at least one element is True, the call has to be changed to:

r = reduce(operator.or_, l, False) 

Part 2:

import operator 

from functools import reduce 

l = [-4, 2, 1, -6 ] 

r = reduce(operator.and_, map(lambda n: n > 0, l), True) 

print(r)   # will print False in this case 

We use map(…) with a lambda expression for checking whether or not an individual element from the list is >0. Then we apply the reduce(…) version from part 1 to the resulting list of Boolean values we get from map(…) to check whether or not all elements are True.

Part 3:

import operator

l = [True, False, True] 

def myReduce(f, l, i): 
	intermediateResult = i 
	for element in l: 
		intermediateResult = f(intermediateResult, element) 
	return intermediateResult 

r = myReduce(operator.and_, l, True) 
print(r)  #  output will be False in this case

Maybe you were expecting that an implementation of reduce would be more complicated, but it’s actually quite simple. We set up a variable to always contain the intermediate result while working through the elements in the list and initialize it with the initial value provided in the third parameter i. When looping through the elements, we always apply the function given in parameter f to the intermediate result and the element itself and update the intermediate result variable with the result of this operation. At the end, we return the value of this variable as the result of the entire reduce operation.

Lesson 3 Exercise 3 Solution

Lesson 3 Exercise 3 Solution mrs110
import pandas as pd 

# create the data frame from a list of tuples 
data = pd.DataFrame( [('Mike',7,10,5.5),
     ('Lisa', 6.5, 9, 8),
     ('George', 4, 3, 7),
     ('Maria', 7, 9.5, 4),
     ('Frank', 5, 5, 5) ] )
     
# set column names
data.columns = ['Name', 'Assignment 1', 'Assignment 2', 'Assignment 3']

# set row names
data.index = range(1,len(data)+1)

# show table 
print(data)
 
# add column with averages
data['Average'] = (data['Assignment 1'] + data['Assignment 2'] + data['Assignment 3']) / 3
 
# part a (all students with a1 score < 7)
print(data[ data['Assignment 1'] < 7])
 
# part b (all students with a1 and a2 score > 6)
print(data[ (data['Assignment 1'] > 6) & (data['Assignment 2'] > 6)])

# part c (at least one assignment < 5) 
print( data[ data[ ['Assignment 1', 'Assignment 2', 'Assignment 3'] ].min(axis = 1) < 5 ] )
 
# part d (name starts with M, only Name and Average columns)
print(data [ data [ 'Name' ].map(lambda x: x.startswith('M')) ] [ ['Name','Average'] ])

# sort by Name 
print(data.sort_values(by = ['Name']))

If any of these steps is unclear to you, please ask for further explanation on the forums.