Lesson 1: Python, ArcGIS Pro, and Multiprocessing

Lesson 1: Python, ArcGIS Pro, and Multiprocessing mjg8 Mon, 03/03/2008 - 14:41

1.1 Overview and Checklist

1.1 Overview and Checklist jmk649 Thu, 09/12/2024 - 23:43

Lesson 1 is two weeks in length. The goal is to get back into Python programming with esri's arcpy package and ArcGIS Pro. After some refresher topics that include import, loops, and debugging, we will cover the concepts of parallel programming and multiprocessing and how they can be used to speed up time-consuming computations. For the assignment, you will apply multiprocessing to clip several featureclasses in parallel and convert the script into a script tool that you can execute within Pro's geoprocessing environment.

As optional materials, we include some discussions on integrated development environments (IDEs) available for Python, Profiling, and Version Control management. The lessons in this course contain quite a lot of content, so feel absolutely free to skip these optional sections; you can always come back to check them out later.

Please refer to the Calendar for specific time frames and due dates. To finish this lesson, you must complete the activities listed below.

Steps for Completing Lesson 1
Step	Activity	Access/Directions
1	Engage with Lesson 1 Content	Begin with 1.3 import, loops revisited, and some syntactic sugar
2	Programming Assignment and Reflection	Submit your modified code versions, the ArcGIS Pro toolbox (aprx) containing your Script Tool, write-up with reflections, and code explanation
3	Quiz 1	Complete the Lesson 1 Quiz.
4	Questions/Comments	Remember to visit the Lesson 1 Discussion Forum to post/answer any questions or comments pertaining to Lesson 1

List of Lesson 1 Downloads

All downloads and full instructions are available and included in the respective lesson sections and listed below.

Data

USA.gdb.zip
For section 1.6: DEM raster data (PASDA) and script

1.2 Optional - The Integrated Developer Environment

1.2 Optional - The Integrated Developer Environment jmk649 Tue, 09/10/2024 - 20:35

An Integrated Developer Environment (IDE) is essential for the developer experience and can often influence the developers perspective of the programming language. A developer trying to learn a programming language using Microsoft Word will have a drastically different experience from another developer that is using a purposed IDE like PyCharm. A good IDE is one that lets you focus on your project, requires minimal management or setup to use, provides excellent debugging support, and is capable of code hints, intellisense, and documentation lookup at a minimum.

Since a prerequisite for this class is previous knowledge and experience of programming, we are assuming that you have found and settled into a favorite or familiar IDE and you can continue to use that IDE for the class. For those that are still searching for their favorite, have limited Python development experience, or are coming from other languages that have purposed IDE's, we provided a list of several of the more popular IDE's used for Python that you may explore and consider. Each have their pros and cons and if you want to add to the list, please do so in the IDE discussion board in Canvas. We will frequently update this page from the discussion to help those that are just starting pick an IDE.

IDLE (Integrated Development and Learning Environment)

Included with Python: Comes pre-installed with Python, so no additional installation is needed.
Simple Interface: Easy to use, making it suitable for beginners.
Basic Features: Provides essential features like syntax highlighting, basic debugging, and an interactive shell.
Limited Functionality: Lacks advanced features found in other IDEs, making it less suitable for large projects.

PyCharm

Comprehensive Features: Offers a wide range of features including code completion, debugging, and version control.
- Powerful debugging tools.
- Integrated versioning.
Intelligent Code Assistance: Provides smart code navigation and refactoring.
Integrated Tools: Includes tools for databases, web development, and more.
Resource Intensive: Requires a powerful computer to run smoothly.
Community and Professional Editions: The Community edition is free but lacks some advanced features available in the Professional edition.
- There is a free student version that includes 17 Professional edition IDE's and tools.

Visual Studio Code

Lightweight and Fast: A code editor that is quick to install and run.
Extensible: Supports a wide variety of extensions for different programming languages and tools.
- Requires third party extensions to provide/perform functionality.
- Requires manual setup of a debugging configuration for debugging.
- Extensions providing support are limited to supported Python versions.
Integrated Terminal: Allows you to run commands directly from the editor.
Built-in Git Support: Excellent integration with Git and other version control systems.
Free: Completely free and open-source.

PyScripter

Windows Only: Designed specifically for Windows.
Lightweight: Not very resource-intensive.
Comprehensive Python Support: Offers features like debugging, code completion, and syntax highlighting.
- Variable Explore in the debugger truncates long values with ... which can make it difficult to troubleshoot.
- Requires third party extensions to provide/perform functionality.
Limited Cross-Platform Support: Not available on Mac or Linux.
Python Environments: May have issues registering python environments created in Pro.
Python Version Dependant: 64-bit versions are limited to supported Python 3.x versions (will not work with Python 2.7)

Spyder

Scientific Computing: Designed with data science and scientific computing in mind.
Integrated with Anaconda: Comes as part of the Anaconda distribution, which is popular among data scientists.
Interactive Console: Allows for interactive testing and debugging.
Variable Explorer: Provides a graphical interface to inspect variables.
Free and Open-Source: Available at no cost. There is a standalone version as well as a version installable through conda or pip.
Can Be Unstable/ Unreliable: Suffers from compatibility issues if the environment package dependencies conflict with the versions needed for the project. Often buggy when used with environments created outside of conda, such as Pro's.

NotePad++

Text Editor: Primarily a text editor, not a full-fledged IDE.
Lightweight: Extremely fast and uses minimal system resources.
Syntax Highlighting: Supports syntax highlighting for many programming languages.
Limited Features: Lacks many advanced features like built in debugging and project management.
- Requires third party extensions to provide/perform functionality.
Free: Completely free and open-source.

1.3 import, loops revisited, and some syntactic sugar

1.3 import, loops revisited, and some syntactic sugar mjg8 Mon, 03/03/2008 - 14:41

To warm up a bit, let’s briefly revisit a few Python features that you are already familiar with but for which there exist some forms or details that you may not yet know, starting with the Python “import” command. We are also going to expand on, or introduce a few Python constructs that can be used to simplify code logic and clarify complex application process flow.

It is highly recommended that you try out these examples yourself and experiment with them to get a better understanding.

1.3.1 import

1.3.1 import mrs110 Mon, 04/23/2018 - 10:18

The form of the “import” command that you definitely should already know is

import <module name>

e.g.,

import arcpy

What happens here is that the module (either a module from the standard library, a module that is part of another package you installed, or simply another .py file in your project directory) is loaded, unless it has already been loaded before, and the name of the module becomes part of the namespace of the script that contains the import command. As a result, you can now access all variables, functions, or classes defined in the imported module, by writing

<module name>.<variable or function name>

e.g.,

arcpy.Describe()

You can also use the import command like this instead:

import arcpy as ap

This form introduces a new alias for the module name, typically to save some typing when the module name is rather long, and instead of writing

arcpy.Describe()

you would now use the ap to reference arcpy.

ap.Describe()

Another approach of using “import” is to directly add content of a module (again either variables, functions, or classes) to the namespace of the importing Python script. This is done by using the form "from … import …" as in the following example:

from arcpy import Describe, Point, Polygon 

...

Describe()

The difference is that now you can use the imported names directly in our code without having to use the module name (or an alias) as a prefix as it is done in line 5 of the example code. However, be aware that if you are importing multiple modules, this can easily lead to name conflicts if for instance, two modules contain functions with the same name. It can also make your code a little more difficult to read compared to:

  arcpy.Describe(...)

This helps you (or another programmer) recognize that you’re using something defined in arcpy and not some other package that contains a Describe.

You can also use

from arcpy import *

to import all variables, functions and class names from a module into your script if you don’t want to list all those you actually need. However, this can increase the likelihood of a name conflict and increase application size and overhead.

Lastly, you can import the script into itself so it essentially creates a preserved namespaced copy of the script. This is especially useful in multiprocessing contexts where function references need to be fully qualified and picklable. By self-importing, the script ensures that the multiprocessing subprocesses reference functions via the module namespace (myscript.function_name) rather than __main__.function_name, which can lead to issues within the ArcGIS Geoprocessing environment as a script tool, on Windows, or when using the multiprocessing setting 'spawn' start method.

if __name__ == '__main__':
    import myscript. # import script into itself to preserve module.function namespacing for multiprocessing.

    myscript.worker(...)

1.3.2 loops and flow control statements

1.3.2 loops and flow control statements mrs110 Thu, 04/19/2018 - 14:47

Let’s quickly revisit loops in Python. There are two kinds of loops in Python, the for loop and the while loop. You should know that the for loop is typically used when the goal is to go through a given set, or list of items, or do something a certain number of times. In the first case, the for loop typically looks like this

for item in list: 
    # do something with item

while in the second case, the for loop is often used together with the range(…), len(...), or enumerate(...) functions to determine how many times the loop body should be executed:

for i in range(50):  
	# do something 50 times

In contrast, the while loop has a condition that is checked before each iteration and if the condition becomes False, the loop is terminated and the code execution continues to the next line after the loop body.

import random 

r = random.randrange(100) # produce random number between 0 and 99 
attempt_count = 1 

while r != 11: 
    attempt_count += 1 
    r = random.randrange(100) 
print(f'This took {attempt_count} attempts')

Flow control statements

There are two flow control statements, break and continue, that can be used in combination with either a for loop or a while loop. The break command will automatically terminate the execution of the current loop and continue with the next line of code outside of the loop. If the loop is part of a nested loop, only the inner (nested) loop will be terminated and the outer loop will progress to its next execution. This means we can rewrite the program from above using a for loop rather than a while loop like this:

import random 

attempt_count = 0 

for i in range(1000):  
    r = random.randrange(100) 
    attempt_count += 1 
    
    if r == 11: 
        break  # terminate loop and continue after it 

print(f'This took {attempt_count} attempts')

When the random number produced in the loop body is 11, the conditional if statement will equate to True. The break command will be executed and the program execution immediately leaves the loop and continues with the print statement after it.

When a continue command is encountered within the body of a loop, the current execution of the loop body is also immediately stopped. In contrast to the break command, the execution then continues with the next iteration of the loop body. Of course, the next iteration is only started if the while condition is still True in the case of a while loop. In the case of a for loop, it will continue if there are still remaining items in the iterable that we are looping through. To demonstrate this, the following code goes through a list of numbers and prints only those numbers that are divisible by 3 (without remainder).

l = [3, 7, 99, 54, 3, 11, 123, 444] 

for n in l: 
    if n % 3 != 0:   # test whether n is not divisible by 3 without remainder 
        continue 

    print(n)

This code uses the built-in modulo operator % to get the remainder of the division of n and 3 in line 5. If this remainder is not 0, the continue command is executed and the next item in the list is tested. If the condition is False (meaning the number is divisible by 3 without a remainder), the execution continues as normal after the if-statement and prints the number.

If you have experience with programming languages other than Python, you may know that some languages have a "do … while" loop construct where the condition is only tested after each time the loop body has been executed so that the loop body is always executed at least once. Since we first need to create a random number before the condition can be tested, this example would actually be a little bit shorter and clearer using a do-while loop. Python does not have a built in do-while loop, but it can be simulated using a combination of while and break:

import random

attempt_count = 0  

while True: 
    r = random.randrange(100) 
    attempt_count += 1 

    if r == 11: 
        break 

print(f'This took {attempt_count} attempts')

A while loop with the condition True will in principle run forever. However, since we have the if-statement with the break, the execution will be terminated as soon as the random number generator rolls an 11.

As you saw in these examples, there are often multiple ways in which the loop constructs for, while, and control commands break, continue, and if-else can be combined to achieve the same result. The one flow control statement we did not discuss in this section is pass, which is similar to continue but serves as a placeholder where a statement is syntactically required but no action is needed. pass is demonstrated in lesson 4, so we do not discuss it here.

1.3.3 Expressions and the Ternary Operator

1.3.3 Expressions and the Ternary Operator jmk649 Tue, 09/10/2024 - 20:46

Expressions

Expressions should be familiar to us all and for Python, these mathematical symbols are referred to as binary operators. In Python, a binary operator is an operator that works with two values, called operands. For example, in the expression 4 + 2, the + symbol is a binary operator that takes the two operands 4 and 2 and produces the result 6.

Binary operators can be used to form more complex expressions that evaluate to different kinds of results. For instance, arithmetic expressions evaluate to numbers, while boolean expressions evaluate to either True or False.

Here’s an example of an arithmetic expression that uses subtraction (-) and multiplication (*):

x = 25 – 2 * 3

All Python operators are organized into different precedence classes, determining in which order the operators are applied when the expression is evaluated unless parentheses are used to explicitly change the order of evaluation. This operator precedence table shows the classes from lowest to highest precedence. The operator * for multiplication has a higher precedence than the – operator for subtraction, so the multiplication will be performed first and the result of the overall expression assigned to variable x is 19.

Here is an example for a boolean expression:

x = y > 12 and z == 3

The boolean expression (y > 12 and z == 3) on the right side of the assignment operator contains three binary operators. The two comparison operators are: > and == and these take two numbers and return a boolean value. The logical and operator takes two boolean values and returns a new boolean (True only if both input values are True, False otherwise). The precedence of and is lower than that of the two comparison operators, so the and will be evaluated last. If y has the value 6, and z the value 3, the value assigned to variable x by this expression will be False because the comparison on the left side of the and evaluates to False.

Another way of writing the above expression that highlights the operators can be done using ( ):

x = (y > 12) and (z == 3)

Ternary Operator

In addition to these binary operators, Python has a ternary operator. This is an operator that takes three operands as input in the format:

 x if c else y

x, y, and c here are the three operands, and the if and else are the familiar decision operators demarcating the operands. While x and y can be values or expressions, the condition c needs to be a boolean value or expression. The operator (if and else) looks at the condition c and if c is True it evaluates to x, else it evaluates to y. So for example, in the following line of code:

 p = 1 if x > 12 else 0

variable p will be assigned the value 1 if x is larger than 12, else p will be assigned the value 0. The ternary if-else operator is very similar to what we can do with an if or if-else statement, but with less code. For example, It could be written as:

p = 1
if x > 12:
    p = 0

The “x if c else y” operator is an example of a language construct that does not add anything principally new to the language, but enables writing things more compactly or more elegantly. That’s why such constructs are often called syntactic sugar. The nice thing about “x if c else y” is that in contrast to the if-else statement, it is an operator that evaluates to a value, and can be embedded directly within more complex expressions. For example, this one line uses the operator twice:

newValue = 25 + (10 if oldValue < 20 else 44) / 100 + (5 if useOffset else 0)

Using an if-else statement for this expression would have required at least five lines of code- which is perfectly ok! The ternary construct works well if the result can be one of the two values. If you have more than two possibilities, you will need to utilize a different decision structure such as an if-elif-else, an object literal, or a switch case structure.

1.3.4 Optional - Match & Object Literal

1.3.4 Optional - Match & Object Literal jmk649 Sat, 12/20/2025 - 12:26

Match

This section provides some advanced constructs and is provided as additional information that may be useful as we get more familiar and comfortable with what we can do with Python. Other coding languages include a switch/case construct that executes or assigns values based on a condition. Python introduced this as ‘match’ in Python version 3.10 but it can also done with a dictionary and the built in dict.get() method. This construct replaces multiple elifs in the if/elif/else structure and provides an explicit means of setting values.

For example, what if we wanted to set a variable that could have 3 more possible values? The long way would be to create an if, elif, else like so:

p = 0

for x in [1, 13, 12, 6]:
    if x == 1:
        p = One
    elif x == 13:
        p = Two
    elif x == 12:
        p = Three

    print(p)

Output
One
Two
Three

The elifs can get long depending on the number of possibilities and can become difficult to read or keep track of the conditionals. Using match, you can control the flow of the program by explicitly setting cases and the desired code that should be executed if that case matches the condition.

An example is provided below:

command = 'Hello, Geog 489!'

match command:
    case Hello, Geog 489!:
        print('Hello to you too!')
    case 'Goodbye, World!':
        print('See you later')
    case other:
        print('No match found')

Output
Hello to you too!

‘Hello, Geog 489’ is a string assigned to the variable command. The interpreter will compare the incoming variable against the cases. When there is a True result, a ‘match’ between the incoming object and one of the cases, the code within the case scope will execute. In the example, the first case equaled the command, resulting in the Hello to you too! printing. Applied to the previous example:

for x in [1, 13, 12, 6]:
    match x:
        case 1:
            p = 'One'
        case 13:
            p = 'Two'
        case 12:
            p = 'Three'
        case other:
            p = 'No match found'

    print(p)

Output
One
Two
Three
No match found

A variation of the Match construct can be created with a dictionary. With the dict.get(…) dictionary lookup method, you can also include a default value if one of the values does not match any of the keys in a much more concise way:

possible_values_dict = {1: 'One', 13: 'Two', 12: 'Three'}
for x in [1, 13, 12, 6]:
    print(possible_values_dict.get(x, 'No match found'))

Output
One
Two
Three
No match found

In the example above, 1, 13, and 12 are keys in the dictionary and their values were returned for the print statement. Since 6 is not present in the dictionary, the result is the default value of ‘No match found’. This default value return is helpful when compared to the dict[‘key’] retrieval method since it does not raise a KeyError Exception and stopping the script or requiring that added code to written to handle the KeyError as shown below.

possible_values_dict = {1: 'One', 13: 'Two', 12: 'Three'}
for x in [1, 13, 12, 6]:
    print(possible_values_dict[x])

Output
One
Two
Three
Traceback (most recent call last):
File "C:\...\CourseCode.py", line 20, in <module>
    print(possible_values_dict[x])
    ~~~~~~~~~~~~~~~~~~~~^^^
    KeyError: 6

Dictionaries are a very powerful data structure in Python and can even be used to execute functions as values using the .get(…) construct above. For example, let’s say we have different tasks that we want to run depending on a string value. This construct will look like the code below:

task = monthly
getTask = {'daily': lambda: get_daily_tasks(),
            'monthly': lambda: get_monthly_tasks(),
           'taskSet': lambda: get_all_tasks()}

getTask.get(task)()

The .get() method will return the lambda for the matching key passed in. The empty () after the .get(task) then executes the function that was returned in the .get(task) call. .get() takes a second, default parameter that is returned if there is no key match. You can set the second parameter to be a function, or a value.

getTask.get(task, get_all_tasks)()

If the first parameter (key) is not found, it will return the function set in the default parameter for execution. Be careful to keep the returned value the same type or else you may get an Execution error.

1.4 Functions revisited

1.4 Functions revisited jed124 Mon, 12/08/2008 - 13:43

From GEOG 485 or similar previous experience, you should be familiar with defining simple functions that take a set of input parameters and potentially return some value. When calling such a function that requires parameters, you have to provide values (or expressions that evaluate to some value) for each of these parameters. These values are then accessible under the names of the respective parameters that makes up the body of the function declaration.

However, from working with different tool functions provided by arcpy or other Python modules, there can also be optional parameters. You can use the names of such parameters to explicitly provide a value for them when calling the function, (also known as named parameters), or include empty values in the position of parameters that you do not want to use and values in the positions that you want to provide (known as positional parameters).

In this section, we will show you how to write functions that take an arbitrary number of required and optional parameters and we will discuss some in more detail about passing different kinds of values as parameters to a function.

1.4.1 Functions with keyword arguments

1.4.1 Functions with keyword arguments jmk649 Tue, 09/10/2024 - 20:51

Many examples from the arcpy documentation's examples use positional parameter method, which is simply providing the parameter values in the order they appear in the function declaration and the value will be assigned to that parameter name. The first positional parameter will be assigned the first value given within the parentheses (…) when the function is called, and so on. To skip a parameter, you will often use the two quotes ' "" ' or None so that parameter is set and you can set subsequent parameters to keep the positional assignments matching. Below is a simple function with two required parameters. The first parameter for providing the last name of a person, and the second parameter for providing a form of address to demonstrate:

def greet(lastName, formOfAddress):
      return f'Hello {formOfAddress} {lastName}!'
      
print(greet('Smith', 'Mrs.'))

Output:
Hello Mrs. Smith!

Note how the first value used in the function call "Smith" in line 6 is assigned to the first positional parameter lastName and the second value "Mrs." to the second positional parameter formOfAddress.

The parameter list of a function definition can also contain one or more optional parameters. These are created by using a parameter name and setting a default value in the function declaration.

Calling a function with explicitly setting values to the parameter name is referred to as keyword argument assignment using the same notation:

Here is a new version of our greet function that now supports English and Spanish, but with English being the default language:

def greet(lastName, formOfAddress, language='English'):
      greetings = {'English': 'Hello', 'Spanish': 'Hola'}

      return f'{greetings[language]} {formOfAddress} {lastName}!'

print(greet('Smith', 'Mrs.'))
print(greet('Rodriguez', 'Sr.', language='Spanish'))

Output:

Hello Mrs. Smith!
Hola Sr. Rodriguez!

Compare the two different ways in which the function is called in lines 8 and 10. In line 8, we do not provide a value for the "language" parameter so the default value "English" is used when looking up the proper greeting in the dictionary stored in variable greetings. In the second version in line 10, the value ‘Spanish’ is provided for the keyword argument "language", so this is used instead of the default value and the person is greeted with "Hola" instead of "Hello". Keyword arguments can be used like positional arguments meaning the second call could also have been:

print(greet('Rodriguez', 'Sr.', 'Spanish'))

without the “language=” before the value.

Things get more interesting when there are several keyword arguments, so let’s add another one for the time of day:

def greet(lastName, formOfAddress, language = 'English', timeOfDay = 'morning'):
      greetings = {'English': {'morning': 'Good morning', 'afternoon': 'Good afternoon'},
                   'Spanish': {'morning': 'Buenos dias', 'afternoon': 'Buenas tardes'}}

      return f'{greetings[language][timeOfDay]}, {formOfAddress} {lastName}!'

print(greet('Smith', 'Mrs.'))

print(greet('Rodriguez', 'Sr.', language='Spanish', timeOfDay='afternoon'))

Output:

Good morning, Mrs. Smith!
Buenas tardes, Sr. Rodriguez!

Since we now have four different forms of greetings depending on two parameters (language and time of day), we now store these in a dictionary in variable greetings that for each key (= language) contains another dictionary for the different times of day. For simplicity reasons, we left it at two times of day, namely “morning” and “afternoon.” In line 7, we then first use the variable language as the key to get the inner dictionary based on the given language and then directly follow up with using variable timeOfDay as the key for the inner dictionary.

The two ways we are calling the function in this example are the two extreme cases of (a) providing none of the keyword arguments, in which case default values will be used for both of them (line 10), and (b) providing values for both of them (line 12). However, we could now also just provide a value for the time of day if we want to greet an English person in the afternoon:

print(greet('Rogers', 'Mrs.', timeOfDay='afternoon'))

Output:

Good afternoon, Mrs. Rogers!

This is an example in which we have to use the prefix “timeOfDay=” because if we leave it out, it will be treated like a positional parameter and used for the parameter ‘language’ instead which will result in an error when looking up the value in the dictionary of languages. For similar reasons, keyword arguments must always come after the positional arguments in the definition of a function and in the call. However, when calling the function, the order of the keyword arguments doesn’t matter, so we can switch the order of ‘language’ and ‘timeOfDay’ in this example:

print(greet('Rodriguez', 'Sr.', timeOfDay='afternoon', language='Spanish'))

It is also possible to have function definitions that only use optional keyword arguments in Python.

1.4.2 Functions with an arbitrary number of parameters

1.4.2 Functions with an arbitrary number of parameters jmk649 Tue, 09/10/2024 - 20:55

Let us continue with the “greet” example, but let’s modify it to be a bit simpler again with a single parameter for picking the language, and instead of using last name and form of address we just go with first names. However, we now want to be able to not only greet a single person but arbitrarily many persons, like this:

greet('English', 'Jim', 'Michelle')

Output:

Hello Jim!
Hello Michelle!

greet('Spanish', 'Jim', 'Michelle', 'Sam')

Output:

Hola Jim!
Hola Michelle!
Hola Sam!

To achieve this, the parameter list of the function needs to end with a special parameter that has a * symbol in front of its name. If you look at the code below, you will see that this parameter is treated like a list in the body of the function:

def greet(language, *names):
     greetings = {'English': 'Hello', 'Spanish': 'Hola'}
     for n in names:
          print(f'{greetings[language]} {n}!')

What happens is that all values given to that function from the one corresponding to the parameter with the * on will be placed in a list and assigned to that parameter. This way you can provide as many parameters as you want with the call and the function code can iterate through them in a loop. Please note that for this example we changed things so that the function directly prints out the greetings rather than returning a string.

We also changed language to a positional parameter because if you want to use keyword arguments in combination with an arbitrary number of parameters, you need to write the function in a different way. You then need to provide another special parameter starting with two stars ** and that parameter will be assigned a dictionary with all the keyword arguments provided when the function is called. Here is how this would look if we make language a keyword parameter again:

def greet(*names, **kwargs):
     greetings = {'English': 'Hello', 'Spanish': 'Hola'}
     language = kwargs['language'] if 'language' in kwargs else 'English'

     for n in names:
          print(f'{greetings[language]} {n}!')

If we call this function as

greet('Jim', 'Michelle')

the output will be:

Hello Jim!
Hello Michelle!

And if we use

greet('Jim', 'Michelle', 'Sam', language = 'Spanish')

we get:

Hola Jim!
Hola Michelle!
Hola Sam!

All non-keyword parameters are again collected in a list and assigned to variable names. All keyword parameters are placed in a dictionary using the name appearing before the equal sign as the key, and the dictionary is assigned to variable kwargs. To really make the ‘language’ keyword argument optional, we have added line 5 in which we check if something is stored under the key ‘language’ in the dictionary (this is an example of using the ternary "... if ... else ..." operator). If yes, we use the stored value and assign it to variable language, else we instead use ‘English’ as the default value. In line 9, language is then used to get the correct greeting from the dictionary in variable greetings while looping through the name list in variable names.

1.4.3 Variables: local vs. global, mutable vs. immutable

1.4.3 Variables: local vs. global, mutable vs. immutable mrs110 Thu, 05/03/2018 - 12:39

When making the transition from a beginner to an intermediate or advanced Python programmer, it is important to understand the intricacies of variable scope. First of all, we can distinguish between global and local variables within a Python script. Global variables are defined outside of any function, within the if __name__ == "__main__": block, or by the use of global in front of the variable. They can be accessed from anywhere in the script and they exist and keep their values as long as the script is loaded.

In contrast, local variables are defined inside a function, a loop, decision structure, and can only be accessed with the body of that function or process. In the case of functions, the local variables are destroyed once the function is done executing.

Here are a few examples to illustrate the concepts of global and local variables and how to use them in Python.

def doSomething(x):      # parameter x is a local variable of the function
     count = 1000 * x    # local variable count is introduced
     return count 

y = 10            # global variable y is introduced 

print(doSomething(y)) 
print(count)      # this will result in an error 
print(x)          # this will also result in an error

This example introduces one global variable, y, and two local variables, x and count, both part of the function doSomething(…). x is a parameter of the function, while count is introduced in the body of the function in line 3. When this function is called in line 11, the local variable x is created and assigned the value that is currently stored in global variable y, so the integer number 10. Then the body of the function is executed. In line 3, an assignment is made to variable count. Since this variable hasn’t been introduced in the function body before, a new local variable will now be created and assigned the value 10000. After executing the return statement in line 5, both x and count will be discarded. Hence, the two print statements at the end of the code would lead to errors because they try to access variables that do not exist anymore.

Now let’s change the example to the following:

def doSomething(): 
     count = 1000 * y    # global variable y is accessed here
     return count
     
y = 10          

print(doSomething())

This example shows that global variable y can also be directly accessed from within the function doSomething(): When Python encounters a variable name that is neither the name of a parameter of that function nor has been introduced via an assignment previously in the body of that function, it will look for that variable among the global variables. However, the first version using a parameter instead is usually preferable because then the code in the function doesn’t depend on how you name and use variables outside of it. That makes it much easier to, for instance, re-use the same function in different projects.

So maybe you are wondering whether it is also possible to change the value of a global variable from within a function, not just read its value? One attempt to achieve this could be the following:

 def doSomething(): 
     count = 1000  

     y = 5 

     return count * y 

y = 10 

print(doSomething()) 
print(y)      # output will still be 10 here

However, if you run the code, you will see that last line still produces the output 10, so the global variable y hasn't been changed by the assignment in line 5. That is because the rule is that if a name is encountered on the left side of an assignment in a function, it will be considered a local variable. Since this is the first time an assignment to y is made in the body of the function, a new local variable with that name is created at that point that will overshadow the global variable with the same name until the end of the function has been reached. Instead, you explicitly have to tell Python that a variable name should be interpreted as the name of a global variable by using the keyword ‘global’, like this:

def doSomething(): 
     count = 1000 

     global y      # tells Python to treat y as the name of global variable

     y = 5         # as a result, global variable y is assigned a new value here

     return count * y 

y = 10 

print(doSomething()) 
print(y)       # output will now be 5 here

In line 5, we are telling Python that y in this function should refer to the global variable y. As a result, the assignment in line 7 changes the value of the global variable called y and the output of the last line will be 5. While it's good to know how these things work in Python, we again want to emphasize that accessing global variables from within functions should be avoided as much as possible. Passing values via parameters and returning values is usually preferable because it keeps different parts of the code as independent of each other as possible and provides insight into what data that function is working with.

So after talking about global vs. local variables, what is the issue with mutable vs. immutable mentioned in the heading? There is an important difference in passing values to a function depending on whether the value is from a mutable or immutable data type. All values of primitive data types like numbers and boolean values in Python are immutable, meaning you cannot change any part of them. On the other hand, we have mutable data types like lists and dictionaries for which it is possible to change their parts: You can, for instance, change one of the elements in a list or what is stored under a particular key in a given dictionary without creating a completely new object.

What about strings and tuples? You may think these are mutable objects, but they are actually immutable. While you can access a single character from a string or element from a tuple, you will get an error message if you try to change it by using it on the left side of the equal sign in an assignment. Moreover, when you use a string method like replace(…) to replace all occurrences of a character by another one, the method cannot change the string object in memory for which it was called but has to construct a new string object and return that to the caller.

Why is that important to know in the context of writing functions? Because mutable and immutable data types are treated differently when provided as a parameter to functions as shown in the following two examples:

def changeIt(x): 
     x = 5   # this does not change the value assigned to y

y = 3 

changeIt(y) 
print(y)     # will print out 3

As we already discussed above, the parameter x is treated as a local variable in the function body. We can think of it as being assigned a copy of the value that variable y contains when the function is called. As a result, the value of the global variable y doesn’t change and the output produced by the last line is 3. But it only works like this for immutable objects, like numbers in this case! Let’s do the same thing for a list:

def changeIt(x): 
     x[0] = 5   # this will change the list y refers to
 
y = [3, 5, 7] 

changeIt(y)     
print(y)        # output will be [5, 5, 7]

The output [5, 5, 7] produced by the print statement in the last line shows that the assignment in line 3 changed the list object that is stored in global variable y. How is that possible? Well, for values of mutable data types like lists, assigning the value to function parameter x cannot be conceived as creating a copy of that value and, as a result, having the value appear twice in memory. Instead, x is set up to refer to the same list object in memory as y. Therefore, any change made with the help of either variable x or y will change the same list object in memory. When variable x is discarded when the function body has been executed, variable y will still refer to that modified list object. Maybe you have already heard the terms “call-by-value” and “call-by-reference” in the context of assigning values to function parameters in other programming languages. What happens for immutable data types in Python works like “call-by-value,” while what happens to mutable data types works like “call-by-reference.” If you feel like learning more about the details of these concepts, check out this article on Parameter Passing.

While the reasons behind these different mechanisms are very technical and related to efficiency, this means it is actually possible to write functions that take parameters of mutable type as input and modify their content. This is common practice (in particular for class objects which are also mutable) and not generally considered bad style because it is based on function parameters and the code in the function body does not have to know anything about what happens outside of the function. Nevertheless, often returning a new object as the return value of the function rather than changing a mutable parameter is preferable. This brings us to the last part of this section.

1.4.4 Multiple return values

1.4.4 Multiple return values mrs110 Mon, 05/07/2018 - 09:36

A function can only have one return value, but quite often you want to return multiple items as the result of a function. To work with the single return value, we simply return a container such as a tuple, a list, or a dictionary with the items you want to return. For instance, returning four coordinates describing the bounding box of a polygon, we can return a tuple with the four coordinates. Python has a useful mechanism to help with unpacking this single return value by allowing us to assign the elements of a tuple (or other sequences like lists) to several variables in a single assignment. Given a tuple (12, 3, 2, 2) assigned to t, instead of writing:

t = (12, 3, 2, 2)

top = t[0] 

left = t[1] 

bottom = t[2] 

right = t[3]

You can write:

top, left, bottom, right = t

and it will have the exact same effect. The following example illustrates how this can be used with a function that returns a tuple of multiple return values. For simplicity, the function computeBoundingBox() in this example only returns a fixed tuple rather than computing the actual tuple values from a polygon given as input parameter.

def computeBoundingBox(): 
     return (12,3,41,32) 


top, left, bottom, right = computeBoundingBox() # assigns the four elements of the returned tuple to individual variables

print(top)    # output: 12

Or as a dictionary, which can also provide some dynamic insights into calculated results to help control the flow:

def computeBoundingBox(x): 
     if x==1:
     	return {'top':12, 'left':3,' bottom':41, 'right':32, 'success':True} 
     else:
        return {'top':12, 'left':3, 'bottom':41, 'right':32, 'success':False} 


bbox = computeBoundingBox(1) # returns the result as a dictionary

if bbox.get('success'):  # if the process succeeds
    print(bbox['top'])    # output: 12

1.4.5 The if name == "main": conditional

1.4.5 The if __name__ == "__main__": conditional jmk649 Tue, 09/10/2024 - 21:04

The name variable

You may have seen this line of code in other scripts or from PyScripter's new Python file template and wonder what it is for.

def main():
    pass
    
if __name__ == '__main__':
    main()

The conditional if __name__ == "__main__": plays a very important role when we are importing other functions as modules and is used to determine whether code should be executed or not. I don't expect you to fully understand this concept right away; the important thing here is to know that it exists and serves an important purpose.

A note on the terminology used in this section: When we say the script is "ran/run directly" or is "directly executed", we are referring to the starting script from which all other functionality and process begins. It is the script being run, started, passed to the interpreter, c:/path-to-env/python.exe "script.py", etc.,.

When a Python script is executed, the interpreter process sets a few special variables, and one of them is __name__. If the script is being ran directly, __name__ is set to "__main__". If the script is being imported as a module in another script, __name__ is set to the name of the script/module.

Why use if name == "main":?

This construct allows you to include code that you want to run only when the script is executed directly, and not executed when the script is imported by another script as a module. It is especially useful for including script function test code, setting exclusive script function arguments for one-off executions, and ensuring multiprocessing works correctly.

Consider the following two scripts:

`script_A.py`

import sys
def greet():
    print("Hello from script_A!")

def update(script_name):
    print(f"updated: {script_name}")

if __name__ == "__main__":
    print("Running script_A directly.")
    greet()
    update(__name__)

    # Print the sys modules to view the set names
    for module_name, module in sys.modules.items():
        if hasattr(module, '__name__'):  # Check if the module has a __name__ attribute
            print(f"Script A Module Name: {module_name}, __name__: {module.__name__}")

`script_B.py`

import sys
import script_A

print("Importing script_A at the top level...")
script_A.greet()

if __name__ == "__main__":
    print("Running script_B directly.")
    script_A.update(f"from {__name__}")

    # Print the script's sys modules to view the set names
    for module_name, module in sys.modules.items():
        if hasattr(module, '__name__'):  # Check if the module has a __name__ attribute
            print(f"Script B Module Name: {module_name}, __name__: {module.__name__}")

When script_A.py is executed directly, it produces the following output (Note that this is a shortened list, yours will be much longer):

Running script_A directly.
Hello from script_A!
updated: script_A
...
Script A Module Name: abc, __name__: abc
Script A Module Name: io, __name__: io
Script A Module Name: __main__, __name__: __main__
Script A Module Name: _stat, __name__: _stat
Script A Module Name: stat, __name__: stat
...

When script_B.py is executed directly, it imports script_A and sets its __name__ to its module name, producing this output:

Importing script_A...
Hello from script_A!
Running script_B directly.
updated: __main__
...
Module Name: script_A, __name__: script_A
Module Name: script_B, __name__: __main__
...

This naming process by the interpreter at runtime creates the compartmentalization of the imported module's functions within the main script and is the underlying mechanism that maps the <module>.<function> syntax we can use. Functions defined in the main script do not need the <module>. prefix since the interpreter automatically maps them as __main__.<function>. This naming prevents confusion if both scripts have functions with the same name since the functions at runtime are now __main__.greet() and script_A.greet().

Multiprocessing Safety

Using the if __name__ == "__main__": block is critical when working with multiprocessing, which we will be discussing in more detail in a later section so do not stress too much about what the multiprocessing code is doing, but focus on where the code is within the script. Consider this script:

`safe_multiprocessing.py`

import multiprocessing

def worker_function(number):
    print(f"Worker {number} is working.")

if __name__ == "__main__":
    print("Starting multiprocessing safely.")
    processes = []
    for i in range(3):
        process = multiprocessing.Process(target=worker_function, args=(i,))
        processes.append(process)
        process.start()

    for process in processes:
        process.join()

Running this script produces the following output, as the multiprocessing code executes safely within the __main__ block since code in this block is executed once:

Starting multiprocessing safely.
Worker 0 is working.
Worker 1 is working.
Worker 2 is working.

If we wanted to use the safe_multiprocessing's worker_function method in another script, we would be able to do safely without executing the multiprocessing code block since the script's __name__ would be set to 'safe_multiprocessing' when imported into the other script. The imported script's (safe_multiprocessing in this case) if __name__ == '__main__': would equate to False and prevent code within it from executing. We will see another variant of using a function within this conditional block in section 1.6.5.1, when we convert our HiHo Cherry-O game into a multprocessing script.

Common Pitfalls

If you omit the if __name__ == "__main__": block in a multiprocessing script, you may encounter infinite recursion or a RuntimeError. For example:

import multiprocessing

top_level_variable = 2 # top-level code that sets each time the script is loaded.

def worker_function(number):
    # protected code that will only execute when the function is called.
    print(f"Worker {number} is working.")

# top-level code that will execute each time the script is loaded into the interpreter (not safe)
processes = []
for i in range(3):
    process = multiprocessing.Process(target=worker_function, args=(i,))
    processes.append(process)
    process.start()

for process in processes:
    process.join()

Try reading through this script as if you were the interpreter. Each time a Process loads worker_function in line 12, it will load the entire script (as __main__.worker_function) and will again execute the top-level code (lines 1 and 3), load the function on line 5, and execute the top-level code on line 10 and 11. When you reach line 12, you will start reading at the top when you encounter the target=worker_function, method call. Do you ever reach any line under the method call in line 12? Running this script will cause Python to repeatedly re-import the script and execute the unprotected top-level code each time it imports, leading to an infinite loop or crash since it can't get past spawning new Processes at the line 12. Hopefully this will show you what could happen if and when code isn't protected.

Even if the script is imported as a module, any top level code (code not within a function, classes, or conditional) will execute when it is imported in during execution. Using the example above, the top_level_variable on line 3 is set for each Process spawned, but the print(f"Worker {number} is working.") within the function worker_function(..) on line 7 will not because it is never called as written. Method calls at the top-level will get executed each time the script is read in as well so if we deliberately execute a call worker_function(1) at the top-level, above the line 12 (for i in range(3):), it will execute once for each Process spawned.

Something to watch out for is if you forget to comment out top-level test code in scripts designed to be imported as modules, it may create a logical error by changing variables and paths due to the cascading Method Resolution Order (MRO). Simply, the incoming variables passed by the main calling script may be re-set to the values assigned in the imported script. For example, adding top_level_variable = 5 right before the process = multiprocessing.Process(target=worker_function, args=(i,)) call on line 12, will get reset to 2 each time the script is loaded.

This can be avoided if you safeguard top-level code by using the if __name__ == "__main__" construct as shown above in safe_multiprocessing.py. Reading through this script as if you were the interpreter, the code within the conditional block is executed once since the spawned Process script's __name__ at this point will not be '__main__', and will not execute in the Process because if__name__ == "__main__" will be False.

Even if unintended execution doesn’t cause issues, not using __name__ == "__main__" makes the script harder to maintain in the sense of compartmentalization and a growing code base. A good coding principle to follow is to structure your scripts with reusable, callable code (functions, classes) at the top, with direct executable code guarded inside if __name__ == "__main__": for testing or setting variables if the script is executed directly. This may seem like a counterintuitive structure to the way Python reads in scripts (top-to-bottom), but the variables created in the if __name__ == "__main__": block are set at the global level and accessible to the functions and classes if the script is executed directly.

Key Takeaways

The if __name__ == "__main__": construct separates script functionality from module functionality.
It provides reusable code in functions and Classes, allowing for either standalone execution or as a module.
It prevents unintended execution of code during import and ensures safe multiprocessing.

1.5 Working with Python and arcpy in ArcGIS Pro

1.5 Working with Python and arcpy in ArcGIS Pro jmk649 Tue, 09/10/2024 - 21:06

Now that we’re all warmed up with some Python, we’ll start getting familiar with Python 3 in ArcGIS Pro by exploring how we write code and deploy tools just like we did when we started out in GEOG 485.

We’ll cover the conda environment that ArcGIS Pro uses in more detail in Lesson 2, but for now it might be helpful to think of conda as a box or container that Python 3 and all of its parts sit inside. In order to access Python 3, we’ll need to open the conda box, and to do that we will need a command prompt with administrator privileges.

Since version 2.3 of ArcGIS Pro, it is not possible to modify the default Python environment anymore (read the details). If you already have a working Pro + PyScripter setup (e.g. from GEOG 485) and it is at least Pro version 3.1, you can keep using this installation for this class. Else I'd recommend you work with the newest version, so you will first have to create a clone of Pro's default Python environment and make it the active environment of ArcGIS. In the past, students sometimes had problems with the cloning operation that we were able to solve by running Pro in admin mode.

Therefore, we recommend that before performing the following steps, you exit Pro and restart it in admin mode by doing a right-click -> Run as administrator. Then go back to "Project" -> "Python", click on "Manage Environments", and then click on "Clone Default" in the Manage Environments dialog that opens up. Installing the clone will take some time (you can watch the individual packages being installed within the "Manage Environments" window, and you may be prompted to restart ArcGIS Pro to effect your changes); when it's done, the new environment "arcgispro-py3-clone" (or whatever you choose to call it - but we'll be assuming it's called the default name) can be activated by clicking on the button on the left.

Do so and also note down the path where the cloned environment has been installed appearing below the name. You can highlight the path under the environment name and ctrl+c to copy it to your clipboard. It should be something like C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone or C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3-clone. Then click the OK button.

Important: In the past, the cloned environment would most likely become unusable when you update Pro to a newer main version (e.g. from 2.9 to 3.x or 3.1 to 3.x). Once you have cloned the environment successfully, please do not update your Pro installation before the end of the class, unless you are willing to do the cloning again. There is a function in V3.x and later of Pro that tries to update your active environment, but might not always work as expected and can leave your curated environments invalid.

Now back at the package manager, the new Python environment should appear under "Active Environment" (where arcgispro-py3 is shown in the figure below). This might take 15+ minutes so you'll need to be patient while the dependencies resolves and installs. You can hover over the process and it will open a command window that you can watch as it works through everything.

A screenshot of the Package Manager interface showing Python library management details.

The package manager will show you a list of packages that will have to be installed and ask you to agree to the terms and conditions. After doing that, the installation will start and probably take a while. You may also get get a "User Access Control" window popup asking if you want conda_uac.exe to make changes to your device; it is OK to choose Yes.

The next step is to map your IDE to the python.exe in the cloned environment. Each IDE is a little different and if you are unsure how it is done, there are instructions found on the web by doing a simple search 'Map python.exe to <your ide>'. Use the path you noted in the earlier step to map to your cloned environment.

Next up, we'll go over creating a script tool in Pro.

1.5.1 Making a Script Tool

1.5.1 Making a Script Tool mrs110 Thu, 04/19/2018 - 13:12

We'll use a simple raster calculation process for our script tool that finds all cells over 3500 meters in an elevation raster and makes a new raster that codes all those cells as 1. Remaining values in the new raster are coded as 0. By now, you’re probably familiar with this type of “map algebra” operation which is common in site selection and other GIS scenarios.

Just in case you’ve forgotten, the expression Raster(inRaster) tells arcpy that it needs to treat your inRaster variable as a raster dataset so that you can perform map algebra on it. If you didn't do this, the script would treat inRaster as just a literal string of characters (the path) instead of a raster dataset.

# This script uses map algebra to find values in an elevation raster greater than 3500 (meters). 

import arcpy 
from arcpy.sa import * 

# Specify the input raster 
inRaster = "C:/Data/Elevation/foxlake" 
cutoffElevation = 3500 

# Check out the Spatial Analyst extension 
arcpy.CheckOutExtension("Spatial") 

# Make a map algebra expression and save the resulting raster 
outRaster = Raster(inRaster) > cutoffElevation 
outRaster.save("C:/Data/Elevation/foxlake_hi_10") 

# Check in the Spatial Analyst extension now that you're done 
arcpy.CheckInExtension("Spatial")

You can probably easily work out what this script is doing but, just in case, the main points to remember on this script are:

Notice the lines of code that check out the Spatial Analyst extension before doing any map algebra and check it back in after finishing. Because each line of code takes some time to run, avoid putting unnecessary code between checkout and checkin. This allows others in your organization to use the extension if licenses are limited. The extension automatically gets checked back in when your script ends, thus some of the Esri code examples you will see do not check it in. However, it is a good practice to explicitly check it in, just in case you have some long code that needs to execute afterward, or in case your script crashes and against your intentions "hangs onto" the license.
inRaster begins as a string, but is then used to create aRaster object once you run Raster(inRaster). A Raster object is a special object used for working with raster datasets in ArcGIS. It's not available in just any Python script: you can use it only if you import the arcpy module at the top of your script.
cutoffElevation is a number variable that you declare early in your script and then use later on when you build the map algebra expression for your outRaster.
The expression outRaster = Raster(inRaster) > cutoffElevation is saying, in plain terms, "Make a new raster and call it outRaster. Do this by taking all the cells of the raster dataset at the path of inRaster that are greater than the number assigned to the variable cutoffElevation."
outRaster is also a Raster object, but you have to call the method outRaster.save() in order to make it permanent on disk. The save() method takes one argument, which is the path to which you want to save.

Copy the code above into a file called Lesson1A.py (or similar as long as it has a .py extension) in your favorite IDE or text editor and then save it.

Next, we'll convert the script to a Tool.

1.5.1.1 Converting the script to a tool

1.5.1.1 Converting the script to a tool mrs110 Thu, 04/19/2018 - 13:12

Now, let’s convert this script to a script tool in ArcGIS Pro to familiarize ourselves with the process, and we’ll examine the differences between ArcGIS Desktop and ArcGIS Pro when it comes to working with script tools (hint: there aren’t any other than the interface looking slightly different).

We’ll get started by opening ArcGIS Pro. You will be prompted to sign in (use your Penn State ArcGIS Online account, which you should already have) and create a project when Pro starts.

Signing in to ArcGIS Pro is an important, new development for running code in Pro as compared to Desktop. As you may be aware, Pro operates with a different licensing structure, such that it will regularly "phone home" to Esri's license servers to check that you have a valid license. With Desktop, once you had installed it and set up your license, you could run it for the 12 months the license was valid, online or offline, without any issues. As Pro will regularly check in with Esri, we need to be mindful that if our code stops working due to an extension not being licensed error or due to a more generic licensing issue, we should check that Pro is still signed in. For nearly everyone, this won't be an issue as you'll generally be using Pro on an Internet connected computer, and you won't notice the licensing checks. If you take your computer offline for an extended period, you will need to investigate Esri's offline licensing options.

Projects are Pro’s way of keeping all of your maps, layouts, tasks, data, toolboxes etc. organized. If you’re coming from Desktop, think of it as an MXD with a few more features (such as allowing multiple layouts for your maps).

Figure 1.2 Pro Login

Credit: ArcGIS Pro

Choose to create a new project using the Blank template, give it a meaningful name and put it in a folder appropriate for your local machine (things will look slightly different in version 3.0 of Pro: simply click on the Map option under New Project there if you are using that version).

Figure 1.3 Creating a project in Pro

Credit: ArcGIS Pro

You will then have Pro running with your own toolbox already created. In the figure below, I’ve clicked on the Toolboxes to expand it to show the toolbox, which has the same name as my project.

Figure 1.4 Pro toolbox window

Credit: ArcGIS Pro

If we right-click on our toolbox we can choose to create a New > Script.

screenshot of right click pop up with NEW tab expanding into model, script and toolset

Figure 1.5 Creating a new script tool in Pro

Credit: ArcGIS Pro

Figure 1.6 Script tool interface in Pro

Credit: ArcGIS Pro

A window will pop up allowing us to enter a name for our script (“Lesson1A”) and a label for our script (“Geog 489 Lesson 1A”), and then we’ll use the file browse icon to locate the script file we saved earlier. In new versions of Pro (2.9 and 3.0), the script file now has to be selected in a new tab called "Execution" located below "Parameters". If your script isn’t showing up in that folder, or you get a message that says “Container is empty” press F5 on your keyboard to refresh the view.

We won’t choose to “Import Script” or define any parameters (yet) or investigate validation (yet). When we click OK, we’ll have our script tool created in Pro. We’re not going to run our script tool (yet) as it’s currently expecting to find the foxlake DEM data in C:\data\elevation and write the results back to that folder, which is not very convenient. It also has the hardcoded cutoff of 3500 embedded in the code. You can download the FoxLake DEM.

To make the script more user-friendly, we’re going to make a few changes to allow us to pick the location of the input and output files, as well as allow the user to input the cutoff value. Later we’ll also use validation to check whether that cutoff value falls inside the range of values present in the raster and, if not, we’ll change it.

We can edit our script from within Pro, but if we do that it opens in Notepad, which isn’t the best environment for coding. You can use Notepad if you like, but I’d suggest opening the script again in your favorite text editor (I like Notepad++) or just using spyder.

If you want, you can change this preferred editor by modifying Pro’s geoprocessing options (see http://pro.arcgis.com/en/pro-app/help/analysis/geoprocessing/basics/geoprocessing-options.htm). To access these options in Pro, click Home -> Options -> Geoprocessing Options. Here, you can also choose an option to automatically validate tools and scripts for Pro compatibility (so you don’t need to run the Analyze Tools for Pro manually each time).

We're going to make a few changes to our code now, swapping out the hardcoded paths in lines 8 and 17 and the hardcoded cutoffElevation value in line 9. We’re also setting up an outPath variable in line 10 and setting it to arcpy.env.workspace.

You might recall from GEOG 485 or your other experience with Desktop that the default workspace in Desktop is usually default.gdb in your user path. Pro is smarter than that and sets the default workspace to be the geodatabase of your project. We’ll take advantage of that to put our output raster into our project workspace. Note the difference in the type of parameter we’re using in lines 8 & 9. It’s ok for us to get the path as Text, but we don’t want to get the number in cutoffElevation as Text because we need it to be a number.

To simplify the programming, we’ll specify a different parameter type in Pro and let that be passed through to our script. To make that happen, we’ll use GetParameter instead of GetParameterAsText.

# This script uses map algebra to find values in an 
#  elevation raster greater than 3500 (meters). 

import arcpy 
from arcpy.sa import * 

# Specify the input raster 
inRaster = arcpy.GetParameterAsText(0)
cutoffElevation = arcpy.GetParameter(1)
outPath = arcpy.env.workspace

# Check out the Spatial Analyst extension 
arcpy.CheckOutExtension("Spatial") 

# Make a map algebra expression and save the resulting raster 
outRaster = Raster(inRaster) > cutoffElevation 
outRaster.save(outPath+"/foxlake_hi_10")

# Check in the Spatial Analyst extension now that you're done 
arcpy.CheckInExtension("Spatial")

Once you have made those changes, save the file and we’ll go back to our script tool in Pro and update it to use the parameters we’ve just defined. Right click on the script tool within the toolbox and choose Properties and then click Parameters. The first parameter we defined (remember Python counts from 0) was the path to our input raster (inRaster), so let's set that up. Click in the text box under Label and type “Input Raster” and when you click into Name you’ll see that Name is already automatically populated for you. Next, click the Data Type (currently String) and change it to “Raster Dataset” and we’ll leave the other values with their defaults.

Click the next Label text box below your first parameter (currently numbered with a *) and type “Cutoff Value” and change the Data Type to Long (which is a type of number) and we’ll keep the rest of the defaults here too. The final version should look as in the figure below.

screenshot of parameters described in surrounding text

Figure 1.7 Parameters of the new script tool

Credit: ArcGIS Pro

Click OK and then we’ll run the tool to test the changes we made by double-clicking it. Use the file icon alongside our Input Raster parameter to navigate to your foxlake raster (which is the FoxLake digital elevation model (DEM) in your Lesson 1 data folder) and then enter 3500 into the cutoff value parameter and click OK to run the tool.

screenshot of geoprocessing parameters. Input raster: foxlake, cutoff value:3500

Figure 1.8 Populated parameters in Pro

Credit: ArcGIS Pro

The tool should have executed without errors and placed a raster called foxlake_hi_10 into your project geodatabase.

If it doesn’t work the first time, verify that:

you have supplied the correct input and output paths;
your path name contains forward slashes (/) or double backslashes (\\), not single backslashes (\);
the Spatial Analyst Extension is available. To check this, go Project -> Licensing and check under Esri Extensions;
you do not have any of the datasets open in ArcGIS;
the output data does not exist yet. If you want to be able to overwrite the output, you need to add the line  "arcpy.env.overwriteOutput = True."  This line can be placed immediately after " import arcpy."

1.5.1.2 Optional - Adding tool validation code

1.5.1.2 Optional - Adding tool validation code mrs110 Thu, 04/19/2018 - 13:14

Now let’s expand on the user friendliness of the tool by using the validator methods to ensure that our cutoff value falls within the minimum and maximum values of our raster (otherwise performing the analysis is a waste of resources).

The purpose of the validation process is to allow us to have some customizable behavior depending on what values we have in our tool parameters. For example, we might want to make sure a value is within a range as in this case (although we could do that within our code as well), or we might want to offer a user different options if they provide a point feature class instead of a polygon feature class, or different options if they select a different type of field (e.g. a string vs. a numeric type).

The Esri help for Tool Validation gives a longer list of uses and also explains the difference between internal validation (what Desktop & Pro do for us already) and the validation that we are going to do here which works in concert with that internal validation.

You will notice in the help that Esri specifically tells us not to do what I’m doing in this example – running geoprocessing tools. The reason for this is they generally take a long time to run. In this case, however, we’re using a very simple tool which gets the minimum & maximum raster values and therefore executes very quickly. We wouldn’t want to run an intersection or a buffer operation for example in the ToolValidator, but for something very small and fast such as this value checking, I would argue that it’s ok to break Esri’s rule. You will probably also note that Esri hints that it’s ok to do this by using Describe to get the properties of a feature class and we’re not really doing anything different except we’re getting the properties of a raster.

So how do we do it? Go back to your tool (either in the Toolbox for your Project, Results, or the Recent Tools section of the Geoprocessing sidebar), right click and choose Properties and then Validation.

You will notice that we have a pre-written, Esri-provided class definition here. We will talk about how class definitions look in Python in Lesson 4 but the comments in this code should give you an idea of what the different parts are for. We’ll populate this template with the lines of code that we need. For now, it is sufficient to understand that different methods (initializeParameters(), updateParameters(), etc.) are defined that will be called by the script tool dialog to perform the operations described in the documentation strings following each line starting with def.

Take the code below and use it to overwrite what is in your ToolValidator:

import arcpy 

class ToolValidator(object): 
    """Class for validating a tool's parameter values and controlling 
    the behavior of the tool's dialog.""" 

    def __init__(self): 
        """Setup arcpy and the list of tool parameters."""  
        self.params = arcpy.GetParameterInfo() 

    def initializeParameters(self):  
        """Refine the properties of a tool's parameters. This method is  
        called when the tool is opened.""" 
 
    def updateParameters(self): 
        """Modify the values and properties of parameters before internal 
        validation is performed. This method is called whenever a parameter 
        has been changed."""  
 
    def updateMessages(self): 
        """Modify the messages created by internal validation for each tool 
        parameter. This method is called after internal validation.""" 
        ## Remove any existing messages  
        self.params[1].clearMessage() 
  
    if self.params[1].value is not None: 
      ## Get the raster path/name from the first [0] parameter as text
      inRaster1 = self.params[0].valueAsText
      ## calculate the minimum value of the raster and store in a variable
      elevMINResult = arcpy.GetRasterProperties_management(inRaster1, "MINIMUM")
      ## calculate the maximum value of the raster and store in a variable
      elevMAXResult = arcpy.GetRasterProperties_management(inRaster1, "MAXIMUM")
      ## convert those values to floating points
      elevMin = float(elevMINResult.getOutput(0))
      elevMax = float(elevMAXResult.getOutput(0)) 
     
      ## calculate a new cutoff value if the original wasn't suitable but only if the user hasn't specified a value.  
      if self.params[1].value < elevMin or self.params[1].value > elevMax:
        cutoffValue = elevMin + ((elevMax-elevMin)/100*90)
        self.params[1].value = cutoffValue
        self.params[1].setWarningMessage("Cutoff Value was outside the range of ["+str(elevMin)+","+str(elevMax)+"] supplied raster so a 90% value was calculated")

Our logic here is to take the raster supplied by the user and determine the min and max values so that we can evaluate whether the cutoff value supplied by the user falls within that range. If that is not the case, we're going to do a simple mathematical calculation to find the value 90% of the way between the min and max values and suggest that as a default to the user (by putting it into the parameter). We’ll also display a warning message to the user telling them that the value has been adjusted and why their original value doesn’t work.

As you look over the code, you’ll see that all of the work is being done in the bottom function updateMessages(). This function is called after the updateParameters() and the internal arcpy validation code have been executed. It is mainly intended for modifying the warning or error messages produced by the internal validation code. The reason why we are putting all our validation code here is because we want to produce the warning message and there is no entirely simple way to do this if we already perform the validation and potentially automatic adjustment of the cutoff value in updateParameters() instead. Here is what happens in the updateMessages() function:

We start by cleaning up any previous messages self.params[1].clearMessages() (line 24). Then we check if the user has entered a value into the cutoffValue parameter (self.params[1]) on line 26. If they haven't, we don’t do anything (for efficiency). If the user has entered a value (i.e., the value is not None) then we get the raster name from the first parameter (self.params[0]) and we extract it as text (because we want the content to use as a path) on line 28. Then we’ll call the arcpy GetRasterProperties function twice, once to get the min value (line 30) and again to get the max value (on line 32) of the raster. We’ll then convert those values to floating point numbers (lines 34 & 35).

Once we’ve done that, we do a little bit of checking to see if the value the user supplied is within the range of the raster. If it is not, then we will do some simple math to calculate a value that falls 90% of the way into the range and then update the parameter (self.params[1].value) with the number we calculated (line 40 and 41). Finally, in line 42, we produce the warning message informing the users of the automatic value adjustment.

Now let’s test our Validator. Click OK and return to your script in the Toolbox, Results or Geoprocessing window. Run the script again. Insert the name of the input raster again. If you didn’t make any mistakes entering the code there won’t be a red X by the Input Raster. If you did make a mistake, an error message will be displayed there, showing you the usual arcpy / geoprocessing error message and the line of code that the error is occurring on. If you have to do any debugging, exit the script, return to the Toolbox, right click the script and go back to the Tool Validator and correct the error. Repeat as many times as necessary.

If there were no errors, we should test out our validation by putting a value into our Cutoff Value parameter that we know to be outside the range of our data. If you choose a value < 2798 or > 3884, you should see a yellow warning triangle appear that displays our error message, and you will also note that the value in Cutoff Value has been updated to our 90% value.

screenshot of error message because cutoff value 3775 was outside range

Figure 1.9 Tool Validator error message

Credit: ArcGIS Pro

We can change the value to one we know works within the range (e.g. 3500), and now the tool should run.

1.6 Performance and how it can be improved

1.6 Performance and how it can be improved mrs110 Wed, 04/25/2018 - 15:04

Now that we are back into the groove of writing arcpy code and creating script tools, we want to look at a topic that didn't play a big role in our introductory class, GEOG 485, but that is very important to programming. We are going to address the question of how we can improve the performance and reliability of our Python code when dealing with more complicated tasks that require a larger number of operations on a greater number of datasets and/or more memory. To do this, we’re going to look at both 64-bit processing and multiprocessing. We’re going to start investigating these topics using a simple raster script to process LiDAR data from Penn State’s main campus and surrounding area. In later sections, we will also look at a vector data example using different data sets for Pennsylvania.

The raster data consists of 808 tiles which are all individually zipped, 550MB zipped in total. The individual .zip files can be downloaded from PASDA directly.

Previously, PASDA provided access via FTP, but unfortunately that ability has been removed. However, we recommend you use a little Python script we put together that uses BeautifulSoup (which we'll look at more in Lesson 2) to download the files. The script will also automatically extract the individual .zip files. For this, you have to do the following:

Download and unzip the script.
At the beginning of the main function, two folders are specified, the first one for storing the .zip files and the second for storing the extracted raster files. These are currently set to C:\temp and C:\temp\unzipped, but you may not have the permissions to write to these folders. We therefore recommend that you edit these variables to have the LiDAR files downloaded and extracted to folders within your Windows user's home directory, e.g. C:\Users\<username>\Documents\489 and C:\Users\<username>\Documents\489\unzipped (assuming that the folder 489 already exist in your Documents folder). The script uses a wildcard on line 66 that tells Python to only download tile files with 227 in the file name, not all of them. This is ok for running the code examples in this lesson but if you want to do test runs with more or all the files, you can also edit this line so that it reads wildcard_pattern = "zip" (because "zip" exists in all the filenames)
Run the script, and you should see the downloaded .zip files and extracted raster files appear in the specified target folders.

Doing any GIS processing with these LiDAR files is definitely a task to be handled by scripting, and any performance benefits we can gain when we’re processing that many tiles will be worthwhile. The question you might be asking is why don’t we just join all of the tiles together and process them at once - we’d run out of memory very fast and if something goes wrong we need to start over. Processing small tiles we can do one (or a few) at a time using less memory and if one tile fails we still have all of the others and just need to restart that tile.

Below is our simple raster script which gets our list of tiles and then for every tile in the list we fill the DEM, create a flow direction and flow accumulation raster to then derive a stream raster (to determine where the water might flow), and lastly we convert the stream raster to polygon or polyline feature classes. This is a simplified version of the sort of analysis you might undertake to prepare data prior to performing a flood study. I’ve restricted the processing to a subset of those tiles for testing and performance reasons, using only tiles with 227 in the name, but more tiles can be included by modifying the wild card list in line 19.

If you used the download script above, you already have the downloaded raster files ready. You can move them to a new folder or keep them where they are. In any case, you will need to make sure that the workspace in the script below points to the folder containing the extracted raster files (line 9). If you obtained the raster files in some other way, you may have to unzip them to a folder first.

Let’s look over the code now.

# Setup _very_ simple timing. 
import time 
process_start_time = time.time() 

import arcpy 
from arcpy.sa import * 

arcpy.env.overwriteOutput = True 
arcpy.env.workspace = r'C:\489\PSU_LiDAR' 

## If our rasters aren't in our filter list then drop them from our list. 
def filter_list(fileList,filterList): 
    return[i for i in fileList if any(j in i for j in filterList)] 

# Ordinarily we would want all of the rasters I'm filtering by a small set for testing & efficiency 
# I did this by manually looking up the tile index for the LiDAR and determining an area of interest 
# tiles ending in 227, 228, 230, 231, 232, 233, 235, 236 

wildCardList = set(['227']) ##,'228','230','231','232','233','235','236']) 
# Get a list of rasters in my folder 
rasters = arcpy.ListRasters("*") 
new_rasters = filter_list(rasters,wildCardList) 

# for all of our rasters 
for raster in new_rasters:
    raster_start_time = time.time() 
    # Now that we have our list of rasters 
    ## Note also for performance we're not saving any of the intermediate rasters - they will exist only in memory 
    ## Fill the DEM to remove any sinks 
    try: 
        FilledRaster = Fill(raster) 
        ## Calculate the Flow Direction (how water runs across the surface) 
        FlowDirRaster = FlowDirection(FilledRaster) 
        ## Calculate the Flow Accumulation (where the water accumulates in the surface) 
        FlowAccRaster = FlowAccumulation(FlowDirRaster) 
        ## Convert the Flow Accumulation to a Stream Network 
        ## We're setting an arbitray threshold of 100 cells flowing into another cell to set it as part of our stream 
        ## http://pro.arcgis.com/en/pro-app/tool-reference/spatial-analyst/identifying-stream-networks.htm 
        Streams = Con(FlowAccRaster,1,"","Value > 100") 
        ## Convert the Raster Stream network to a feature class 
        output_Polyline = raster.replace(".img",".shp") 
        arcpy.CheckOutExtension("Spatial") 
        arcpy.sa.StreamToFeature(Streams,FlowDirRaster,output_Polyline) 
        arcpy.CheckInExtension("Spatial") 
    except: 
        print ("Errors occured") 
        print (arcpy.GetMessages()) 
        arcpy.AddMessage ("Errors occurred") 
        arcpy.AddMessage(arcpy.GetMessages()) 
 
# Output how long the whole process took. 
arcpy.AddMessage("--- %s seconds ---" % (time.time() - process_start_time)) 
print ("--- %s seconds ---" % (time.time() - process_start_time))

We have set up some very simple timing functionality in this script using the time() function defined in the module time of the Python standard library. The function gives you the current time and, by calling it at the beginning and end of the program and then taking the difference in the very last line of the script, we get an idea of the runtime of the script.

Later in the lesson, we will go into more detail about properly profiling code, where we will examine the performance of a whole program as well as individual instructions. For now, we just want an estimate of the execution time. Of course, it’s not going to be very precise as it will depend on what else you’re doing on your PC at the same time, and we would need to run a number of iterations to remove any inconsistencies (such as the delay when arcpy loads for the first time etc.). On my PC, that code runs in around 40 seconds. Your results will vary depending on many factors related to the performance of your PC (we'll review some of them in the Speed Limiters section) but you should test out the code to get an idea of the baseline performance of the algorithm on your PC.

In lines 12 and 13, we have a simple function to filter our list of rasters to just those we want to work with (centered on the PSU campus). This function might look a little different to what you have seen before - that's because we're using list comprehension, which we'll examine in more detail in Lesson 2. So don't worry about understanding how exactly this works at the moment. It basically says to return a list with only those file names from the original list that contain one of the numbers in the wild card list.

We set up some environment variables, our wildcard list (used by our function for filtering) at line 19 - where you will notice I have commented out some of the list for speed during testing, and then we get our list of rasters, filter it and for those rasters left and we iterate through them with the central for-loop in line 25 performing our spatial analysis tasks mentioned earlier. There is some basic error checking wrapped around the tasks (which is also reporting running times if anything goes wrong) and then lastly there is a message and print function with the total time. I’ve included both print and AddMessage just in case you wanted to test the code as a script tool in ArcGIS.

Feel free to run the script now and see what total computation time you get from the print statement in the last line of the code.

Once we’ve examined the theory of 64-bit processing and parallelization and worked through a simple example using the Hi-ho Cherry-O game from GEOG 485, we’ll come back to the raster example above and convert it to running in parallel using the Python multiprocessing package instead of sequentially and we will further look at an example of multiprocessing using vector data.

1.6.1 32-bit vs. 64-bit processing

1.6.1 32-bit vs. 64-bit processing mrs110 Thu, 04/26/2018 - 12:12

32-bit software or hardware can only directly represent and operate with numbers up to 2^32 and, hence, can only address up to a maximum of 4GB of memory (that is 2^32 = 4294967296 bytes). If the file system of your operating system is limited to 32-bit integers as well, this also means you cannot have any single file larger than 4GB either in memory or on disk (you can still page or chain larger files together though).

64-bit architectures don’t have this limit. Instead you can access up to 16 terabytes of memory and this is actually only the limit of current chip architectures which "only" use 44 bits which will change over time as software and hardware architectures evolve. Technically with a 64-bit architecture you could access 16 Exabytes of memory (2^64) and while not wanting to paraphrase Bill Gates, that is probably more than we’ll need for the foreseeable future.

There most likely won't be any innate performance benefits to be gained by moving from 32-bit and 64-bit unless you need that extra memory. While in principle, you can move larger amounts of data per time between memory and CPU with 64-bit, this typically doesn't result in significantly improved execution times because of caching and other optimization techniques used by modern CPUs. However if we start using programming models where we run many tasks at once, you might want more than 4GB allocated to those processes. For example if you had 8 tasks that all needed 500MB of RAM each – that’s very close to the 4GB limit in total (500MB * 8 = 4000MB). If you had a machine with more processors (e.g. 64) you would very easily hit the 32-bit 4GB limit as you would only be able to allocate 62.5MB of RAM per processor from your code.

Even with hardware architectures and operating systems mainly being 64-bit these days, a lot of software still is only available as 32-bit versions. 64-bit operating systems are designed to be backwards compatible with 32-bit applications, and if there is no real expected benefit for a particular software, the developer of the software may just as well decide to stick with 32-bit and avoid the efforts and costs that it would take to make the change to 64-bit or even support multiple versions of the software. ArcGIS Desktop is an example of a software that was only available as 32-bit, but has since been replaced by ArcGIS Pro, which is 64-bit.

1.6.3 Parallel processing

1.6.3 Parallel processing mrs110 Thu, 04/26/2018 - 12:18

You have probably noticed if you have a relatively modern PC (anything from the last several years) that when you open Windows Task Manager (from the bottom of the list when you press CTRL-ALT-DEL) and you click the Performance tab and right click on the CPU graph and choose Change Graph to -> Logical Processors you have a number of processors (or cores) within your PC. These are actually “logical processors“ within your main processor but they function as though they were individual processors – and we’ll just refer to them as processors here for simplicity.

Screenshot of CPU in the taskmanager window under performance with four separate graphs

Figure 1.10 Logical Processors

Credit: ArcGIS Pro

Now because we have multiple processors, we can run multiple tasks in parallel at the same time instead of one at a time. There are two ways that we can run tasks at the same time – multithreaded and multiprocessing. We’ll look at the differences in each in the following but it’s important to know that arcpy doesn’t support multithreading but it does support multiprocessing. In addition, there is a third form of parallelisation called distributed computing, which involves distributing the task over multiple computers, that we will also briefly talk about.

1.6.3.1 Multithreading

1.6.3.1 Multithreading mrs110 Thu, 04/26/2018 - 12:18

Multithreading is based on the notion of "threads" for a number of tasks that are executed within the same memory space. The advantage of this is that because the memory is shared between the threads they can share information. This results in a much lower memory overhead because information doesn’t need to be duplicated between threads. The basic logic is that a single thread starts off a task and then multiple threads are spawned to undertake sub-tasks. At the conclusion of those sub-tasks all of the results are joined back together again. Those threads might run across multiple processors or all on the same one depending on how the operating system (e.g. Windows) chooses to prioritize the resources of your computer. In the example of the PC above which has 4 processors, a single-threaded program would only run on one processor while a multi-threaded program would run across all of them (or as many as necessary).

1.6.3.2 Multiprocessing

1.6.3.2 Multiprocessing mrs110 Thu, 04/26/2018 - 12:18

Multiprocessing achieves broadly the same goal as multi-threading which is to split the workload across all of the available processors in a PC. The difference is that multiprocessing tasks cannot communicate directly with each other as they each receive their own allocation of memory. That means there is a performance penalty as information that the processes need must be stored in each one. In the case of Python a new copy of python.exe (referred to as an instance) is created for each process that you launch with multiprocessing. The tasks to run in multiprocessing are usually organized into a pool of workers which is given a list of the tasks to be completed. The multiprocessing library will assign each task to a worker (which is usually a processor on your PC) and then once a worker completes a task the next one from the list will be assigned to that worker. That process is repeated across all of the workers so that as each finishes a task a new one will be assigned to them until there are no more tasks left to complete.

You might have heard of the MapReduce framework which underpins the Hadoop parallel processing approach. The use of the term map might be confusing to us as GIS folks as it has nothing to do with our normal concept of maps for displaying geographical information. Instead in this instance map means to take a function (as in a programming function) and apply it once to every item in a list (e.g. our list of rasters from the earlier example).

The reduce part of the name is similar as we apply a function to a list and combine the results of our function into a single result (e.g. a list from 1 – 10,000 which is our number of Hi-ho Cherry-O games and we want the number of turns for each game).

The two elements map and reduce work harmoniously to solve our parallel problems. The map part takes our one large task (which we have broken down into a number of smaller tasks and put into a list) and applies whatever function we give it to the list (one item in the list at a time) on each processor (which is called a worker). Once we have a result, that result is collected by the reduce part from each of the workers and brought back to the calling function. There is a more technical explanation in the Python documentation.

Multiprocessing in Python

Multiprocessing has been available in Python for some time and it’s a reasonably complicated concept so we will do our best to simplify it here. We’ll also provide a list of resources at the end of this section for you to continue exploring if you are interested. The multiprocessing package of Python is part of the standard library and has been available since around Python 2.6. The multiprocessing library is required if you want to implement multiprocessing and we import it into our code just like any other package using:

import multiprocessing

Using multiprocessing isn’t as simple as switching from 32-bit to 64-bit. It does require some careful thought about which processes we can run in parallel and which need to run sequentially. There are also issues about file sharing and file locking, performance penalties where sometimes multiprocessing is slower due to the time taken to setup and remove the multiprocessing pool, and some tasks that do not support multiprocessing. We’ll cover all of these issues in the following sections and then we’ll convert our simple, sequential raster processing example into a multiprocessing one to demonstrate all of these concepts.

1.6.3.3 Distributed processing

1.6.3.3 Distributed processing mrs110 Thu, 04/26/2018 - 12:19

Distributed processing is a type of parallel processing that instead of (just) using each processor in a single machine will use all of the processors across multiple machines. Of course, this requires that you have multiple machines to run your code on but with the rise of cloud computing architectures from providers such as Amazon, Google, and Microsoft this is getting more widespread and more affordable. We won’t cover the specifics of how to implement distributed processing in this class but we have provided a few links if you want to explore the theory in more detail.

In a nutshell what we are doing with distributed processing is taking our idea of multiprocessing on a single machine and instead of using the 4 or however many processors we might have available, we're accessing a number of machines over the internet and utilizing the processors in all of them. Hadoop is one method of achieving this and others include Amazon's Elastic Map Reduce, MongoDB and Cassandra. GEOG 865 has cloud computing as its main topic, so if you are interested in this, you may want to check it out.

1.6.4 Speed limiters

1.6.4 Speed limiters mrs110 Thu, 04/26/2018 - 12:20

With all of these approaches to speeding up our code, what are the elements which will cause bottlenecks and slow us down?

Well, there are a few – these include the time to set up each of the processes for multiprocessing. Remember earlier we mentioned that because each process doesn’t share memory it needs a copy of the data to use. This will need to be copied to a memory location. Also as each process runs its own Python.exe instance, it needs to be launched and arcpy needs to be imported for each instance (although fortunately, multiprocessing takes care of this for us). Still, all of that takes time to start so our code won’t appear to do much at first while it is doing this housekeeping - and if we're not starting a lot of processes then we won't see enough of a speed up in processing to make up for those start-up time costs.

Other things that can slow us down are the speed of our RAM. Access times for RAM used to be measured in nanoseconds but now are measured in megahertz (MHz). The method of calculating the speed isn’t especially important but if you’re moving large files around in RAM or performing calculations that require getting a number out of RAM, adding, subtracting, multiplying, etc. and then putting the result into another location in RAM and you’re doing that millions of times very, very small delays will quickly add up to seconds or minutes. Another speedbump is running out of RAM. While we can allocate more than 4GB per process using 64-bit programming, if we don’t have enough RAM to complete all of the tasks that we might launch then our operating system will start swapping between RAM (which is fast) and our hard disk (which isn’t – even if it’s one of the solid state types – SSDs).

Speaking of hard disks, it’s very likely that we’re loading and saving data to them and as our disks are slower than our RAM and our processors, that is going to cause a delay. The less we need to load and save data the better, so good multiprocessing practice is to keep as much data as possible in RAM (see the caveat above about running out of RAM). The speed of disks is governed by a couple of factors; the speed that the motor spins (unless it is an SSD), seek time and the amount of cache that the disk has. Here is how these elements all work together to speed up (or slow down) your code. The hard disk receives a request for data from the operating system, which it then goes looking for. This is the seek time referring to how long it takes the disk to position the read head over the segment of disk the data is located on, which is a function of motor speed as well. Then once the file is found, it needs to be loaded into memory – cache - and then this is sent through to the process that needed the data. When data is written back to your disk, the reverse process takes place. The cache is filled (as memory is faster than disks ) and then the cache is written to the disk. If the file is larger than the cache, the cache gets topped up as it starts to empty until the file is written. A slow spinning hard disk motor or a small amount of cache can both slow down this process.

It’s also possible that we’re loading data from across a network connection (e.g. from a database or remotely stored files) and that will also be slow due to network latency – the time it takes to get to and from the other device on the network with the request and the result.

We can also be slowed down by inefficient code, for example, using too many loops or an inefficient if / else / elif statement that we evaluate too many times or using a mathematical function that is slower than its alternatives. We'll examine these sorts of coding bottlenecks - or at least how to identify them when we look at code profiling later in the lesson.

1.6.5 First steps with Multiprocessing

1.6.5 First steps with Multiprocessing jmk649 Tue, 09/10/2024 - 21:12

From the brief description in the previous section, you might have realized that there are generally two broad types of tasks – those that are input/output (I/O) heavy which require a lot of data to be read, written or otherwise moved around; and those that are CPU (or processor) heavy that require a lot of calculations to be done. Because getting data is the slowest part of our operation, I/O heavy tasks do not demonstrate the same improvement in performance from multiprocessing as CPU heavy tasks. When there is more CPU based tasks to do, the benefit is seen when splitting that workload among a range of processors so that they can share the load.

The other thing that can slow us down is the output to the console or screen – although this isn’t really an issue in multiprocessing because printing to our output window can get messy. Think about two print statements executing at exactly the same time – you’re likely to get the content of both intermingled, leading to a very difficult to understand message or illogical order of messages. Even so, updating the screen with print statements comes with a cost.

To demonstrate this, try this sample piece of code that sums the numbers from 0-10000.

# Setup _very_ simple timing.
import time
start_time = time.time()

sum = 0
for i in range(0,10000):
    sum += i
    print(sum)

# Output how long the process took.
print (f"--- {time.time() - start_time}s seconds ---")

If I run it with the print function in the loop the code takes 0.046 seconds to run on my PC.

4278
4371
4465
4560
4656
4753
4851
4950
--- 0.04600026321411133 seconds ---

If I comment the print(sum) function out, the code runs in 0.0009 seconds.

--- 0.0009996891021728516 seconds ---

In Penn State's GEOG 485 course, we simulated 10,000 runs of the children's game Cherry-O to determine the average number of turns it takes. If we printed out the results, the code took a minute or more to run. If we skipped all but the final print statement the code ran in less than a second. We’ll revisit that Cherry-O example as we experiment with moving code from the single processor paradigm to multiprocessor. We’ll start with it as a simple, example and then move on to two GIS themed examples – one raster (using our raster calculation example from before) and one vector.

If you did not take GEOG 485, you may want to have a quick look at the Cherry-O description.

The following is the original Cherry-O code:

# Simulates 10K game of Hi Ho! Cherry-O
# Setup _very_ simple timing.
import time

start_time = time.time()
import random

spinnerChoices = [-1, -2, -3, -4, 2, 2, 10]
turns = 0
totalTurns = 0
cherriesOnTree = 10
games = 0

while games < 10001:
    # Take a turn as long as you have more than 0 cherries
    cherriesOnTree = 10
    turns = 0
    while cherriesOnTree > 0:
        # Spin the spinner
        spinIndex = random.randrange(0, 7)
        spinResult = spinnerChoices[spinIndex]
        # Print the spin result
        # print ("You spun " + str(spinResult) + ".")
        # Add or remove cherries based on the result
        cherriesOnTree += spinResult

        # Make sure the number of cherries is between 0 and 10
        if cherriesOnTree > 10:
            cherriesOnTree = 10
        elif cherriesOnTree < 0:
            cherriesOnTree = 0
            # Print the number of cherries on the tree
        # print ("You have " + str(cherriesOnTree) + " cherries on your tree.")
        turns += 1
    # Print the number of turns it took to win the game
    # print ("It took you " + str(turns) + " turns to win the game.")
    games += 1
    totalTurns += turns
print("totalTurns " + str(float(totalTurns) / games))
# lastline = raw_input(">")
# Output how long the process took.
print("--- %s seconds ---" % (time.time() - start_time))

We've added in our very simple timing from earlier and this example runs for me in about .33 seconds (without the intermediate print functions). That is reasonably fast and you might think we won't see a significant improvement from modifying the code to use multiprocessor mode but let's experiment.

The Cherry-O task is a good example of a CPU bound task. We’re limited only by the calculation speed of our random numbers, as there is no I/O being performed. It is also a parallel task, as none of the 10,000 runs of the game are dependent on each other. All we need to know is the average number of turns; there is no need to share any other information.

Our logic here could be to have a function Cherry_O(...) which plays the game and returns to our calling function the number of turns. We can add that value returned to a variable in the calling function and when we’re done divide by the number of games (e.g. 10,000) and we’ll have our average.

1.6.5.1 Converting from sequential to multiprocessing

1.6.5.1 Converting from sequential to multiprocessing mrs110 Thu, 04/26/2018 - 12:22

So with that in mind, let us examine how we can convert a simple program like Cherry-O from sequential to multiprocessing.

There are a couple of basic steps we need to add to our code in order to support multiprocessing. The first is that our code needs to import multiprocessing which is a Python library which as you will have guessed from the name enables multiprocessing support. We’ll add that as the first line of our code.

The second thing our code needs to have is a __main__ method defined. We’ll add that into our code at the very bottom with:

if __name__ == '__main__': 
    mp_handler()

With this, we make sure that the code in the body of the if-statement is only executed for the main process we start by running our script file in Python, not the subprocesses we will create when using multiprocessing, which also are loading this file. Otherwise, this would result in an infinite creation of subprocesses, subsubprocesses, and so on. Next, we need to have that mp_handler() function we are calling defined. This is the function that will set up our pool of processors and also assign (map) each of our tasks onto a worker (usually a processor) in that pool.

Our mp_handler() function is very simple. It has two main lines of code based on the multiprocessing module:

The first instantiates a pool with a number of workers (usually our number of processors or a number slightly less than our number of processors). There’s a function to determine how many processors we have, multiprocessing.cpu_count(), so that our code can take full advantage of whichever machine it is running on. That first line is:

with multiprocessing.Pool(multiprocessing.cpu_count()) as myPool:
   ... # code for setting up the pool of jobs

You have probably already seen this notation from working with arcpy cursors. This with ... as ... statement creates an object of the Pool class defined in the multiprocessing module and assigns it to variable myPool. The parameter given to it is the number of processors on my machine (which is the value that multiprocessing.cpu_count() is returning), so here we are making sure that all processor cores will be used. All code that uses the variable myPool (e.g., for setting up the pool of multiprocessing jobs) now needs to be indented relative to the "with" and the construct makes sure that everything is cleaned up afterwards. The same could be achieved with the following lines of code:

myPool = multiprocessing.Pool(multiprocessing.cpu_count())
... # code for setting up the pool of jobs
myPool.close()
myPool.join()

Here the Pool variable is created without the with ... as ... statement. As a result, the statements in the last two lines are needed for telling Python that we are done adding jobs to the pool and for cleaning up all sub-processes when we are done to free up resources. We prefer to use the version using the with ... as ... construct in this course.

The next line that we need in our code after the with ... as ... line is for adding tasks (also called jobs) to that pool:

    res = myPool.map(cherryO, range(10000))

What we have here is the name of another function, cherryO(), which is going to be doing the work of running a single game and returning the number of turns as the result. The second parameter given to map() contains the parameters that should be given to the calls of thecherryO() function as a simple list. So this is how we are passing data to process to the worker function in a multiprocessing application. In this case, the worker function cherryO() does not really need any input data to work with. What we are providing is simply the number of the game this call of the function is for, so we use the range from 0-9,999 for this. That means we will have to introduce a parameter into the definiton of the cherryO() function for playing a single game. While the function will not make any use of this parameter, the number of elements in the list (10000 in this case) will determine how many timescherryO() will be run in our multiprocessing pool and, hence, how many games will be played to determine the average number of turns. In the final version, we will replace the hard-coded number by a variable called numGames. Later in this part of the lesson, we will show you how you can use a different function called starmap(...) instead of map(...) that works for worker functions that do take more than one argument so that we can pass different parameters to it.

Python will now run the pool of calls of the cherryO() worker function by distributing them over the number of cores that we provided when creating the Pool object. The returned results, so the number of turns for each game played, will be collected in a single list and we store this list in variable res. We’ll average those turns per game to get an average using the Python library statistics and the function mean().

To prepare for the multiprocessing version, we’ll take our Cherry-O code from before and make a couple of small changes. We’ll define function cherryO() around this code (taking the game number as parameter as explained above) and we’ll remove the while loop that currently executes the code 10,000 times (our map range above will take care of that) and we’ll therefore need to “dedent“ the code.

Here’s what our revised function will look like :

def cherryO(game): 
    spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] 
    turns = 0 
    cherriesOnTree = 10 

    # Take a turn as long as you have more than 0 cherries 
    
    
    while cherriesOnTree > 0: 
        # Spin the spinner 
        spinIndex = random.randrange(0, 7) 
        spinResult = spinnerChoices[spinIndex]    
        # Print the spin result     
        #print ("You spun " + str(spinResult) + ".") 
        # Add or remove cherries based on the result 
        cherriesOnTree += spinResult     
        # Make sure the number of cherries is between 0 and 10    
        if cherriesOnTree > 10: 
            cherriesOnTree = 10 
        elif cherriesOnTree < 0: 
            cherriesOnTree = 0     
        # Print the number of cherries on the tree        
        #print ("You have " + str(cherriesOnTree) + " cherries on your tree.") 
        turns += 1 
    # return the number of turns it took to win the game 
    return(turns)

1.6.5.2 Putting it all together

1.6.5.2 Putting it all together mrs110 Thu, 04/26/2018 - 12:23

Now lets put it all together. We’ve made a couple of other changes to our code to define a variable at the very top called numGames = 10000 to define the size of our range.

# Simulates 10K game of Hi Ho! Cherry-O 
# Setup _very_ simple timing. 
import time
import os
import sys
start_time = time.time() 
import multiprocessing 
from statistics import mean 
import random 
numGames = 10000 

def cherryO(game): 
    spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] 
    turns = 0 
    cherriesOnTree = 10 

    # Take a turn as long as you have more than 0 cherries 
    
    while cherriesOnTree > 0: 
        # Spin the spinner 
        spinIndex = random.randrange(0, 7) 
        spinResult = spinnerChoices[spinIndex] 
        # Print the spin result     
        #print ("You spun " + str(spinResult) + ".") 
        # Add or remove cherries based on the result 
        cherriesOnTree += spinResult 
        # Make sure the number of cherries is between 0 and 10    
        if cherriesOnTree > 10: 
            cherriesOnTree = 10 
        elif cherriesOnTree < 0: 
            cherriesOnTree = 0 
        # Print the number of cherries on the tree        
        #print ("You have " + str(cherriesOnTree) + " cherries on your tree.")
        turns += 1 
    # return the number of turns it took to win the game 
    return(turns) 

def mp_handler():
    # Set the python exe. Make sure the pythonw.exe is used for running processes, even when this is run as a
    # script tool, or it may launch n number of Pro applications.
    multiprocessing.set_executable(os.path.join(sys.exec_prefix, 'pythonw.exe'))
    arcpy.AddMessage(f"Using {os.path.join(sys.exec_prefix, 'pythonw.exe')}")
    
    with multiprocessing.Pool(multiprocessing.cpu_count()) as myPool:
       ## The Map function part of the MapReduce is on the right of the = and the Reduce part on the left where we are aggregating the results to a list. 
       turns = myPool.map(cherryO,range(numGames)) 
    # Uncomment this line to print out the list of total turns (but note this will slow down your code's execution) 
    #print(turns) 
    # Use the statistics library function mean() to calculate the mean of turns 
    print(mean(turns)) 

if __name__ == '__main__': 
    mp_handler() 
    # Output how long the process took. 
    print ("--- %s seconds ---" % (time.time() - start_time))

You will also see that we have the list of results returned on the left side of the = before our map function (line 40). We’re taking all of the returned results and putting them into a list called turns (feel free to add a print or type statement here to check that it's a list). Once all of the workers have finished playing the games, we will use the Python library statistics function mean, which we imported at the very top of our code (right after multiprocessing) to calculate the mean of our list in variable turns. The call to mean() will act as our reduce as it takes our list and returns the single value that we're really interested in.

When you have finished writing the code in PyScripter, you can run it.

We will use the environment's Python Command prompt to run this script. There are two quick ways to start a command window in your environment:

In your Windows start menu:

search for "Python Command Prompt" and it should result in a "Best match". After opening, be sure to verify that it opened the environment you want to work in (details below).

Or, you can navigate to it by clicking the "All" to switch to the application list view.

Scroll down the list and expand the ArcGIS folder to list all ArcGIS applications installed.

Scroll down and open the Python Command Prompt.

This is a shortcut to open a command window in the activated python environment. Once opened, you should see the environment name in parentheses followed by the full path to the python environment.

We could dedicate an entire class to operating system commands that you can use in the command window but Microsoft has a good resource at this Windows Commands page for those who are interested.

We just need a couple of the commands listed there :

cd : change directory. We use this to move around our folders. Full help at this Commands/cd page.
dir : list the files and folders in my directory. Full help at this Commands/dir page.

We’ll change the directory to where we saved the code from above (e.g. mine is in c:\489\lesson1) with the following command:

cd c:\489\lesson1

Before you run the code for the first time, we suggest you change the number of games to a much smaller number (e.g. 5 or 10) just to check everything is working fine so you don’t spawn 10,000 Python instances that you need to kill off. In the event that something does go horribly wrong with your multiprocessing code, see the information about the Windows taskkill command below. To now run the Cherry-O script (which we saved under the name cherry-o.py) in the command window, we use the command:

python cherry-o.py

You should now get the output from the different print statements, in particular the average number of turns and the time it took to run the script. If everything went ok, set the number of games back to 10000 and run the script again.

It is useful to know that there is a Windows command that can kill off all of your Python processes quickly and easily. Imagine having to open Task Manager and manually kill them off, answer a prompt and then move to the next one! The easiest way to access the command is by pressing your Windows key, typing taskkill /im python.exe and hitting Enter, which will kill off every task called python.exe. It’s important to only use this when absolutely necessary, as it will usually also stop your IDE from running and any other Python processes that are legitimately running in the background. The full help for taskkill is at the Microsoft Windows IT Pro Center taskkill page.

Look closely at the images below, which show a four processor PC running the sequential and multiprocessing versions of the Cherry-O code. In the sequential version, you’ll see that the CPU usage is relatively low (around 50%) and there are two instances of Python running (one for the code and (at least) one for PyScripter).

In the multiprocessing version, the code was run from the command line instead (which is why it’s sitting within a Windows Command Processor task) and you can see the CPU usage is pegged at 100% as all of the processors are working as hard as they can and there are five instances of Python running.

This might seem odd as there are only four processors, so what is that extra instance doing? Four of the Python instances, the ones all working hard, are the workers, the fifth one that isn’t working hard is the master process which launched the workers – it is waiting for the results to come back from the workers. There isn’t another Python instance for PyScripter because I ran the code from the command prompt – therefore, PyScripter wasn’t running. We'll cover running code from the command prompt in the Profiling section.

screenshot in task manager oof sequential code

Figure 1.11 Cherry-O sequential code Task Manager Tasks

screenshot of task manager performance CPU workload (4 graphs)

Figure 1.12 Cherry-O sequential code Task Manager workload

screenshot of manager multiprocessing tasks

Figure 1.13 Cherry-O multiprocessing Task Manager Tasks

screenshot of task manager performance CPU (4 graphs no data in them)

Figure 1.14 Cherry-O multiprocessing Task Manager workload

On this four processor PC, this code runs in about 1 second and returns an answer of between 15 and 16. That is about three times slower than my sequential version which ran in 1/3 of a second. If instead I play 1M games instead of 10K games, the parallel version takes 20 seconds on average and my sequential version takes on average 52 seconds. If I run the game 100M times, the parallel version takes around 1,600 seconds (26 minutes) while the sequential version takes 2,646 seconds (44 minutes). The more games I play, the better the performance of the parallel version. Those results aren’t as fast as you might expect with 4 processors in the multiprocessor version but it is still around half the time taken. When we look at profiling our code a bit later in this lesson, we’ll examine why this code isn’t running 4x faster.

When moving the code to a much more powerful PC with 32 processors, there is a much more significant performance improvement. The parallel version plays 100M games in 273 seconds (< 5 minutes) while the sequential version takes 3136 seconds (52 minutes) which is about 11 times slower. Below you can see what the task manager looks like for the 32 core PC in sequential and multiprocessing mode. In sequential mode, only one of the processors is working hard – in the middle of the third row – while the others are either idle or doing the occasional, unrelated background task. It is a different story for the multiprocessor mode where the cores are all running at 100%. The spike you can see from 0 is when the code was started.

screenshot of task manager performance, CPU, 32 graphs

Figure 1.15 Cherry-O Seq_Server

screenshot of task manager performance, CPU, 32 graphs w/ sharp slopes

Figure 1.16 Cherry-O MP_Server

Let's examine some of the reasons for these speed differences. The 4-processor PC’s CPU runs at 3GHz while the 32-processor PC runs at 2.4GHz; the extra cycles that the 4-processor CPU can perform per second make it a little quicker at math. The reason the multiprocessor code runs much faster on the 32-processor PC than the 4-processor PC is straightforward enough –- there are 8 times as many processors (although it isn’t 8 times faster – but it is close at 6.4x (32 min / 5 min)). So while each individual processor is a little slower on the larger PC, because there are so many more, it catches up (but not quite to 8x faster due to each processor being a little slower).

Memory quantity isn’t really an issue here as the numbers being calculated are very small, but if we were doing bigger operations, the 4-processor PC with just 8GB of RAM would be slower than the 32-processor PC with 128GB. The memory in the 32-processor PC is also faster at 2.13 GHz versus 1.6GHz in the 4-processor PC.

So the takeaway message here is if you have a lot of tasks that are largely the same but independent of each other, you can save a significant amount of time utilizing all of the resources within your PC with the help of multiprocessing. The more powerful the PC, the more time that can potentially be saved. However, the caveat is that as already noted multiprocessing is generally only faster for CPU-bound processes, not I/O-bound ones.

1.6.5.3 Multiprocessing Variants

1.6.5.3 Multiprocessing Variants jmk649 Tue, 09/10/2024 - 21:17

Python includes several different methods for executing processes in parallel. Each method behaves a little differently, and it is important to know some of these differences in order to get the most performance gain out of multiprocessing, and to avoid inadvertently introducing logical errors into your code. The table below provides some comparisons of the available methods to help summarize their abilities. Some things you should think about while choosing an appropriate method for your task is if the method is blocking, accepts single or multiple argument functions, and how the order of results are returned.

Multiprocessing variants
Variant	Blocking	Ordered	Iterative	Accepts Multiple Arguments	Description
`Pool.map`	Yes	Yes	No	No	Applies a function to all items in the input iterable, returning results in order.
`Pool.map_async`	No	Yes	No	No	Similar to `Pool.map`, but returns a result object that can be checked later.
`Pool.imap`	No	Yes	Yes	No	Returns an iterator that yields results in order as they become available.
`Pool.imap_unordered`	No	No	Yes	No	Returns an iterator that yields results as they become available, order not guaranteed.
`Pool.starmap`	Yes	Yes	No	Yes	Applies a function to arguments provided as tuples, returning results in order.
`Pool.starmap_async`	No	Yes	No	Yes	Similar to `Pool.starmap`, but returns a result object that can be checked later.
`apply`	Yes	Yes	No	Yes	Runs a single callable function and blocks until the result is available.
`apply_async`	No	Yes	No	Yes	Runs a single callable function asynchronously and returns a result object.

For this class, we will focus on pool.starmap() and describe the pool.apply_async() to highlight some of their capabilities and implementations.

map

The method of multiprocessing that we will be using utilizes the map method that we covered earlier in the lesson as pool.starmap(), or you can think of it as ‘start the map() function’. The method starts a new process for each item in the list and holds the results from each process in a list.

Syntax: pool.map(func, iterable)
Purpose: Applies a function to all items in a given iterable (e.g., list) and returns a list of results.
Blocking: This method is blocking, meaning it waits for all processes to complete before moving on to the next line of code.
Synchronous: Processes tasks synchronously and in order.
Multiple Arguments: Designed to handle functions that take multiple arguments.
Usage: Often used when you have a list of tasks to perform and want the results in the same order as the input.

What if you wanted to run different functions in parallel? You can be more explicit by using the pool.apply_async() method to execute different functions in parallel. This multiprocessing variant is useful when performing various tasks that do not depend on maintaining return order, does not interfere with each other, or is dependent on results from other tasks. Some examples are copying a Featureclasses to multiple places, performing data operations, and executing maintenance routines and much more.

apply_async

Instead of using the map construct, you assign each process to a variable, start the task, and then call .get() when you are ready for the results.

Syntax: pool.apply_async(func, args=(), kwds={})
Purpose: Schedules a single function to be executed asynchronously.
Non-Blocking: This method is non-blocking, meaning it returns immediately with an ApplyResult object. It schedules the function to be executed and allows the main program to continue executing without waiting for the function to complete.
Asynchronous: Processes tasks asynchronously, allowing for other operations to continue while waiting for the result.
Multiple Arguments: Handles functions with a single argument or multiple arguments.
Usage: It is used when you need to execute a function asynchronously and do not need to wait for the result immediately. You can collect the results later using the .get() method on the ApplyResult object. The .get() method retrieves the result of the function call that was executed asynchronously. If the function has already completed, it returns the result immediately. If the function is still running, it waits until the function completes and then returns the result.

For example, this is how you can have three different functions working at the same time while you execute other processes. You can call get() on any of the task when your needs the result of the process, or you can put all tasks into a list. When you put them in a list, the time it takes to complete is based on the longest running process. Note here that the parameters need to be passed as a tuple. Single parameters need to be passed as (arg, ), but if you need to pass more than one parameter to the function, the tuple is (arg1, arg2, arg3).

with multiprocessing.Pool() as pool:
    p1 = pool.apply_async(functionA, (1param,)) # starts the functionA process
    p2 = pool.apply_async(functionB, (1param, 2param)) # starts the functionB process
    p3 = pool.apply_async(functionC, (1param,)) # starts the functionC process

    # run other code if desired while p1, p2, p3 executes.

    # we need the result from p3 so block further execution and wait for it to finish
    functionA_process = p3.get()
    ...

    # get the results from p1 and p2 from the processes as an ordered list.
    # when we call .get() on the task, it becomes blocking so it will wait here until the last process in the list is
    # done executing.
    order_results = [p1.get(), p2.get()]

When the list assigned to order_results is created and the .get() is called for each process, the results are stored in the list and can be retrieved by indexing or a loop.

# After the processes are complete, iterate over the results and check for errors.
    for r in order_results:
        if r['errorMsg'] != None:
            print(f'Task {r["name"]} Failed with: {r["errorMsg"]}'
        else: 
            ...

What if the process run time is directly related to the amount of data being processed? For example, performing a calculation of a street at intervals of 20 feet for a whole County. Most likely, the dataset will have a wide range of street lengths. Short street segments will take milliseconds to compute, but the longer streets (those that are miles long) may take several minutes to calculate. You may not want to wait until the longest running process is complete to start working with the results since your short streets will be waiting and a number of your processors could be sitting idle until the last long calculation is done. By using the pool.imap_unordered or pool.imap methods, you can get the best performance gain since they work as an iterator and will return completed processes as they finish. Until the job list is completed, the iterator will ensure that no processor will sit idle, allowing many of the quicker calculated processes to complete and return while the longer processes continue to calculate. The syntax should look familiar since it is a simple for loop:

    for i in pool.imap_unordered(split_street_function, range(10)):
        print(i)

We will focus on the pool.starmap() method for the examples and for the assignment since we will be simply applying the same function to a list of rasters or vectors. The multiple methods for multiprocessing are further described in the Python documentation here and it is worth reviewing/ comparing them further if you have extra time at the end of the lesson.

1.6.6 Arcpy multiprocessing examples

1.6.6 Arcpy multiprocessing examples mrs110 Thu, 04/26/2018 - 12:24

Let's look at a couple of examples using ArcGIS functions. There are a number of caveats or gotchas while using multiprocessing with ArcGIS and it is important to cover them up-front because they could result in hundreds of Pro sessions opening and locking your pc, and they effect the ways in which we can write our code.

Esri describe a number of best practices for multiprocessing with arcpy. These include:

Use the “memory“ (Pro) or "in_memory" (legacy, but still works) workspaces to store temporary results because as noted earlier memory is faster than disk.
Avoid writing to file geodatabase (FGDB) data types and GRID raster data types. These data formats can often cause schema locking or synchronization issues. That is because file geodatabases and GRID raster types do not support concurrent writing – that is, only one process can write to them at a time. You might have seen a version of this problem in arcpy previously if you tried to modify a feature class in Python that was open in ArcGIS. That problem is magnified if you have an FGDB and you’re trying to write many feature classes to it at once. Even if all of the featureclasses are independent, you can only write them to the FGDB one at a time.

So bearing the two points in mind we should make use of memory workspaces wherever possible and we should avoid writing to FGDBs (in our worker functions at least – but we could use them in our master function to merge a number of shapefiles or even individual FGDBs back into a single source).

1.6.6.1 Multiprocessing with raster data

1.6.6.1 Multiprocessing with raster data jmk649 Tue, 09/10/2024 - 21:23

There are two types of operations with rasters that can easily (and productively) be implemented in parallel: operations that are independent components in a workflow, and raster operations which are local, focal, or zonal – that is they work on a small portion of a raster such as a pixel or a group of pixels.

Esri’s Clinton Dow and Neeraj Rajasekar presented way back at the 2017 User Conference demonstrating multiprocessing with arcpy and they had a number of useful graphics in their slides which demonstrate these two categories of raster operations which we have reproduced here as they're still appropriate and relevant.

An example of an independent workflow would be if we calculate the slope, aspect and some other operations on a raster and then produce a weighted sum or other statistics. Each of the operations is independently performed on our raster up until the final operation which relies on each of them (see the first image below). Therefore, the independent operations can be parallelized and sent to a worker and the final task (which could also be done by a worker) aggregates or summarises the result. Which is what we can see in the second image as each of the tasks is assigned to a worker (even though two of the workers are using a common dataset) and then Worker 4 completes the task. You can probably imagine a more complex version of this task where it is scaled up to process many elevation and land-use rasters to perform many slope, aspect and reclassification calculations with the results being combined at the end.

parallel problem slide see text description below

Figure 1.17 Slide 15 from Parallel Python

Click for a text description of Figure 1.17

Slide titled: Pleasingly parallel problems. Slide shows serialized execution of a model workflow as worker 1 does 3 steps sequentially which feed into the final worker 1 which does weighted sum leading to an output suitability raster. The 3 original processes completed by work one are: First, elevation raster to slope, then, elevation raster to aspect, and finally, Land use raster to reclassify. The slide then asks at the bottom, why is multiprocessing relevant to geoprocessing workflows?

Credit: Esri GIS Co., permission requested for use

Figure 1.18 Slide 16 from Parallel Python

Click for a text description for Figure 1.18

Slide titled: Pleasingly parallel problems. Slide shows parallelized execution of a model workflow as three different workers simultaneously feed into a fourth worker which does weighted sum leading to an output suitability raster. Worker 1 processes elevation raster to slope, worker 2 processes elevation raster to aspect, and worker 3 processes land use raster to reclassify. The slide then asks at the bottom, why is multiprocessing relevant to geoprocessing workflows?

Credit: Esri GIS Co., permission requested for use

An example of the second type of raster operation is a case where we want to make a mathematical calculation on every pixel in a raster such as squaring or taking the square root. Each pixel in a raster is independent of its neighbors in this operation so we could have multiple workers processing multiple tiles in the raster and the result is written to a new raster. In this example, instead of having a single core serially performing a square root calculation across a raster (the first image below) we can segment our raster into a number of tiles, assign each tile to a worker and then perform the square root operation for each pixel in the tile outputting the result to a single raster which is shown in the second image below.

Figure 1.19 Slide 19 from Parallel Python

Click for a text description for Figure 1.19

Slide titled: Pleasingly parallel problems. Slide shows tool executed serially on a large input dataset. Starts with large elevation raster leading to worker 1 leading to square root math tool and finally output square root raster. The slide then asks at the bottom, why is multiprocessing relevant to geoprocessing workflows?

Credit: Esri GIS Co., permission requested for use

Figure 1.20 Slide 20 from Parallel Python

Click for a text description for Figure 1.20

Slide titled: Pleasingly parallel problems. Slide shows tool executed in parallel on a large input dataset. Starts with large elevation raster leading to four different workers identically using the square root math tool and all leading to the same output square root raster. The slide then asks at the bottom, why is multiprocessing relevant to geoprocessing workflows?

Credit: Esri GIS Co., permission requested for use

Let‘s return to the raster coding example that we used to build our ArcGIS Pro tool earlier in the lesson. That simple example processed a list of rasters and completed a number of tasks on each raster. Based on what you have read so far I expect that you have realized that this is also a pleasingly parallel problem.

Bearing in mind the caveats about parallel programming from above and the process that we undertook to convert the Cherry-O program, let's begin.

Our first task is to identify the parts of our problem that can work in parallel and the parts which we need to run sequentially.

The best place to start with this can be with the pseudocode of the original task. If we have documented our sequential code well, this could be as simple as copying/pasting each line of documentation into a new file and working through the process. We can start with the text description of the problem and build our sequential pseudocode from there and then create the multiprocessing pseudocode. It is very important to correctly and carefully design our multiprocessing solutions to ensure that they are as efficient as possible and that the worker functions have the bare minimum of data that they need to complete the tasks, use in_memory workspaces, and write as little data back to disk as possible.

Our original task was:

Get a list of raster tiles
For every tile in the list:
     Fill the DEM
     Create a slope raster
     Calculate a flow direction raster
     Calculate a flow accumulation raster
     Convert those stream rasters to polygon or polyline feature classes.

You will notice that I’ve formatted the pseudocode just like Python code with indentations showing which instructions are within the loop.

As this is a simple example we can place all the functionality within the loop into our worker function as it will be called for every raster. The list of rasters will need to be determined sequentially and we’ll then pass that to our multiprocessing function and let the map element of multiprocessing map each raster onto a worker to perform the tasks. We won’t explicitly be using the reduce part of multiprocessing here as the output will be a featureclass but reduce will probably tidy up after us by deleting temporary files that we don’t need.

Our new pseudocode then will look like :

Get a list of raster tiles
For every tile in the list:
    Launch a worker function with the name of a raster

Worker:
     Fill the DEM
     Create a slope raster
     Calculate a flow direction raster
     Calculate a flow accumulation raster
     Convert those stream rasters to polygon or polyline feature classes.

Bear in mind that not all multiprocessing conversions are this simple. We need to remember that user output can be complicated because multiple workers might be attempting to write messages to our screen at once and that can cause those messages to get garbled and confused. A workaround for this problem is to use Python’s logging library which is much better at handling messages than us manually using print statements. We haven't implemented logging in this sample solution for this script but feel free to briefly investigate it to supplement the print and arcpy.AddMessage functions with calls to the logging function. The Python Logging Cookbook has some helpful examples.

As an exercise, attempt to implement the conversion from sequential to multiprocessing. You will probably not get everything right since there are a few details that need to be taken into account such as setting up an individual scratch workspace for each call of the worker function. In addition, to be able to run as a script tool the script needs to be separated into two files with the worker function in its own file. But don't worry about these things, just try to set up the overall structure in the same way as in the Cherry-O multiprocessing version and then place the code from the sequential version of the raster example either in the main function or worker function depending on where you think it needs to go. Then check out the solution linked below.

Check out one way of implementing the solution.

When you run this code, do you notice any performance differences between the sequential and multiprocessor versions?

The sequential version took 96 seconds on the same 4-processor PC we were using in the Cherry-O example, while the multiprocessor version completed in 58 seconds. Again not 4 times faster as we might expect but nearly twice as fast with multiprocessing is a good improvement. For reference, the 32-processor PC from the Cherry-O example processed the sequential code in 110 seconds and the multiprocessing version in 40 seconds. We will look in more detail at the individual lines of code and their performance when we examine code profiling, but you might also find it useful to watch the CPU usage tab in Task Manager to see how hard (or not) your PC is working.

1.6.6.2 Multiprocessing with vector data

1.6.6.2 Multiprocessing with vector data jmk649 Tue, 09/10/2024 - 21:25

The best practices of multiprocessing that we introduced earlier are even more important when we are working with vector data than they are with raster data. The geodatabase locking issue is likely to become much more of a factor as typically we use more vector data than raster and often geodatabases are used more with feature classes.

The example we’re going to use here involves clipping a feature layer by polygons in another feature layer. A sample use case of this might be if you need to segment one or several infrastructure layers by state or county (or even a smaller subdivision). If I want to provide each state or county with a version of the roads, sewer, water or electricity layers (for example) this would be a helpful script. To test out the code in this section (and also the first homework assignment), you can again use the data from the USA.gdb geodatabase (Section 1.5) we provided. The application then is to clip the data from the roads, cities, or hydrology data sets to the individual state polygons from the States data set in the geodatabase.

To achieve this task, one could run the Clip tool manually in ArcGIS Pro but if there are a lot of polygons in the clip data set, it will be more effective to write a script that performs the task. As each state/county is unrelated to the others, this is an example of an operation that can be run in parallel.

Let us examine the code’s logic and then we’ll dig into the syntax. The code has two Python files. This is important because when we want to be able to run it as a script tool in ArcGIS, it is required that the worker function for running the individual tasks be defined in its own module file, not in the main script file for the script tool with the multiprocessing code that calls the worker function. The first file called scripttool.py imports arcpy, multiprocessing, and the worker code contained in the second python file called multicode.py and it contains the definition of the main function mp_handler() responsible for managing the multiprocessing operations similar to the Cherry-O multiprocessing version. It uses two script tool parameters, the file containing the polygons to use for clipping (variable clipper) and the file to be clipped (variable tobeclipped). Furthermore, the file includes a definition of an auxiliary function get_install_path() which is needed to determine the location of the Python interpreter for running the subprocesses in when running the code as a script tool in ArcGIS. The content of this function you don't have to worry about. The main function mp_handler() calls the worker(...) function located in the multicode file, passing it the files to be used and other information needed to perform the clipping operation. This will be further explained below . The code for the first file including the main function is shown below.

import os, sys
import arcpy
import multiprocessing
from multicode import worker

# Input parameters
clipper = arcpy.GetParameterAsText(0) if arcpy.GetParameterAsText(0) else r"C:\489\USA.gdb\States"
tobeclipped = arcpy.GetParameterAsText(1) if arcpy.GetParameterAsText(1) else r"C:\489\USA.gdb\Roads"

def mp_handler():

    try:
        # Create a list of object IDs for clipper polygons

        arcpy.AddMessage("Creating Polygon OID list...")
        clipperDescObj = arcpy.Describe(clipper)
        field = clipperDescObj.OIDFieldName

        idList = []
        with arcpy.da.SearchCursor(clipper, [field]) as cursor:
            for row in cursor:
                id = row[0]
                idList.append(id)

        arcpy.AddMessage(f"There are {len(idList)} object IDs (polygons) to process.")

        # Create a task list with parameter tuples for each call of the worker function. Tuples consist of the clippper,
        # tobeclipped, field, and oid values.

        jobs = []

        for id in idList:
            # adds tuples of the parameters that need to be given to the worker function to the jobs list
            jobs.append((clipper,tobeclipped,field,id))

        arcpy.AddMessage(f"Job list has {len(jobs)} elements.")

        # Create and run multiprocessing pool.
        # Set the python exe. Make sure the pythonw.exe is used for running processes, even when this is run as a
        # script tool, or it will launch n number of Pro applications.

        multiprocessing.set_executable(os.path.join(sys.exec_prefix, 'pythonw.exe'))
        arcpy.AddMessage(f"Using {os.path.join(sys.exec_prefix, 'pythonw.exe')}")

        # determine number of cores to use
        cpuNum = multiprocessing.cpu_count()
        arcpy.AddMessage(f"there are: {cpuNum} cpu cores on this machine")

        # Create the pool object
        with multiprocessing.Pool(processes=cpuNum) as pool:
            arcpy.AddMessage("Sending to pool")
            # run jobs in job list; res is a list with return values of the worker function
            res = pool.starmap(worker, jobs)

        # If an error has occurred within the workers, report it
        # count how many times False appears in the list (res) with the return values
        failed = res.count(False)
        if failed > 0:
            arcpy.AddError(f"{failed} workers failed!")

        arcpy.AddMessage("Finished multiprocessing!")

    except Exception as ex:
        arcpy.AddError(ex)


if __name__ == '__main__':
    mp_handler()

Let's now have a close look at the logic of the two main functions which will do the work. The first one is the mp_handler() function shown in the code section above. It takes the input variables and has the job of processing the polygons in the clipping file to get a list of their unique IDs, building a job list of parameter tuples that will be given to the individual calls of the worker function, setting up the multiprocessing pool and running it, and taking care of error handling.

The second function is the worker function called by the pool (named worker in this example) located in the multicode.py file (code shown below). This function takes the name of the clipping feature layer, the name of the layer to be clipped, the name of the field that contains the unique IDs of the polygons in the clipping feature layer, and the feature ID identifying the particular polygon to use for the clipping as parameters. This function will be called from the pool constructed in mp_handler().

The worker function will then make a selection from the clipping layer. This has to happen in the worker function because all parameters given to that function in a multiprocessing scenario need to be of a simple type that can be "pickled." Pickling data means converting it to a byte-stream which in the simplest terms means that the data is converted to a sequence of bytes that represents the object in a format that can be stored or transmitted. As feature classes are much more complicated than that containing spatial and non-spatial data, they cannot be readily converted to a simple type. That means feature classes cannot be "pickled" and any selections that might have been made in the calling function are not shared with the worker functions.

We need to think about creative ways of getting our data shared with our sub-processes. In this case, that means we’re not going to do the selection in the master module and pass the polygon to the worker module. Instead, we’re going to create a list of feature IDs that we want to process and we’ll pass an ID from that list as a parameter with each call of the worker function that can then do the selection with that ID on its own before performing the clipping operation. For this, the worker function selects the polygon matching the OID field parameter when creating a layer with MakeFeatureLayer_management() and uses this selection to clip the feature layer to be clipped. The results are saved in a shapefile including the OID in the file's name to ensure that the names are unique.

import arcpy

def worker(clipper, tobeclipped, field, oid):
    """
       This is the function that gets called and does the work of clipping the input feature class to one of the
       polygons from the clipper feature class. Note that this function does not try to write to arcpy.AddMessage() as
       nothing is ever displayed.
       param: clipper
       param: tobeclipped
       param: field
       param: oid
    """

    try:
        # Create a layer with only the polygon with ID oid. Each clipper layer needs a unique name, so we include oid in the layer name.
        query = f"{field} = {oid}"
        tmp_flayer = arcpy.MakeFeatureLayer_management(clipper, f"clipper_{oid}", query)

        # Do the clip. We include the oid in the name of the output feature class.
        outFC = fr"c:\489\output\clip_{oid}.shp"
        arcpy.Clip_analysis(tobeclipped, tmp_flayer, outFC)

        # uncomment for debugging
        # arcpy.AddMessage("finished clipping:", str(oid))

        return True # everything went well so we return True
    except:
        # Some error occurred so return False
        # print("error condition")
        return False

Having covered the logic of the code, let's review the specific syntax used to make it all work. While you’re reading this, try visualizing how this code might run sequentially first – that is one polygon being used to clip the to-be-clipped feature class, then another polygon being used to clip the to-be-clipped feature class and so on (maybe through 4 or 5 iterations). Then once you have an understanding of how the code is running sequentially try to visualize how it might run in parallel with the worker function being called 4 times simultaneously and each worker performing its task independently of the other workers.

We’ll start with exploring the syntax within the mp_handler(...) function.

The mp_handler(...) function begins by determining the name of the field that contains the unique IDs of the clipper feature class using the arcpy.Describe(...) function (line 16 and 17). The code then uses a Search Cursor to get a list of all of the object (feature) IDs from within the clipper polygon feature class (line 20 to 23). This gives us a list of IDs that we can pass to our worker function along with the other parameters. As a check, the total count of that list is printed out (line 25).

Next, we create the job list with one entry for each call of the worker() function we want to make (lines 30 to 34). Each element in this list is a tuple of the parameters that should be given to that particular call of worker(). This list will be required when we set up the pool by calling pool.starmap(...). To construct the list, we simply loop through the ID list and append a parameter tuple to the list in variable jobs. The first three parameters will always be the same for all tuples in the job list; only the polygon ID will be different. In the homework assignment for this lesson, you will adapt this code to work with multiple input files to be clipped. As a result, the parameter tuples will vary in both the values for the oid parameter and for the tobeclipped parameter.

To prepare the multiprocessing pool, we first specify what executable should be used each time a worker is spawned (line 41). Without this line, a new instance of ArcGIS Pro would be launched by each worker, which is clearly less than ideal. Instead, this line uses the builtin sys.exec_prefix() variable and joins that path to the pythonw.exe executable. The next line prints out the path of the interpreter being used so you can verify that it is using the right one.

The code then sets up the size of the pool using the maximum number of processors in lines 45-49 (as we have done in previous examples) and then, using the starmap() method of Pool, calls the worker function worker(...) once for each parameter tuple in the jobs list (line 52).

Any outputs from the worker function will be stored in variable res as a list. These are the boolean values returned by the worker() function, True to indicate that everything went ok and False to indicate that the operation failed. If there is at least one False value in the list, an error message is produced stating the exact number of worker processes that failed (lines 57 to 58).

Let's now look at the code in our worker function worker(...) in the multicode file. As we noted in the logic section above, it receives four parameters: the full paths of the clipping and to-be-clipped feature classes, the name of the field that contains the unique IDs in the clipper feature class, and the OID of the polygon it is to use for the clipping.

Notice that the MakeFeatureLayer_management(...) function in line 18 is used to create a layer in memory which is a copy of the original clipper layer. This use of the memory layer is important in three ways: The first is performance – memory layers are faster; second, the use of an memory layer can help prevent any chance of file locking (although not if we were writing back to the file); third, selection only works on layers so even if we wanted to, we couldn’t get away without creating this layer.

The call of MakeFeatureLayer_management(...) also includes an SQL query string defined one line earlier in line 17 to create the layer with just the polygon that matches the oid that was passed as a parameter. The name of the layer we are producing here should be unique; this is why we’re adding {oid} to the name in the first parameter.

Now with our selection held in our memory, uniquely named feature layer, we perform the clip against our to-be-clipped layer (line 22) and store the result in outFC which we defined earlier in line 21 to be a hardcoded folder with a unique name starting with "clip_" followed by the oid. To run and test the code, you will most likely have to adapt the path used in variable outFC to match your PC.

The process then returns from the worker function and will be supplied with another oid. This will repeat until a call has been made for each polygon in the clipping feature class.

We are going to use this code as the basis for our Lesson 1 homework project. Have a look at the Assignment Page for full details.

You can test this code out by running it in a number of ways. If you run it from ArcGIS Pro as a script tool, the ternary operator will use the input values from GetParameterAsText(), or if it is executed outside the Pro environment (executed from an IDE or from the commandline), the hard coded path will be used since arcpy.GetParameterAsText(i) will be None.

The final thing to remember about this code is that it has a hardcoded output path defined in variable outFC in the worker() function - which you will want to change, create and/or parameterize etc. so that you have some output to investigate. If you do none of these things then no output will be created.

When the code runs it will create a shapefile for every unique object identifier in the "clipper" shapefile (there are 51 in the States data set from the sample data) named using the OID (that is clip_1.shp - clip_59.shp).

1.6.6.3 The if name == "main": revisited

1.6.6.3 The if __name__ == "__main__": revisited jmk649 Tue, 09/10/2024 - 21:19

In the examples we've used in the previous two sections, the worker function script has been separated from the main caller script and protected against possible infinite recursion or namespace conflicts by importing the worker script as a module (module.function()). However, what if you are limited to using a single script file, such as a Pro script tool that needs to be condensed for easy distribution? If we just add the worker function to the main script and try to reference that function as usual (function()), it will cause the multiprocessing module to import the entire main script, and will execute any top-level code (code not under the if __name__ == "__main__": block, functions, classes) for each new process. Other than infinite recursion, this can create conflicts in some custom Python environments such as Pro, leading to script failure, and it could possibly lock your PC up trying to open n** (exponentially for each process started) instances of Pro.

Remember that Python sets special variables when a script is executed. One of these is __name__, which is set to "__main__" when the script is run directly, or set to the script's name (module) when it is being imported. By importing the main script name within the if __name__ == "__main__": block, you ensure that the multiprocessing module correctly references functions that are within the imported script using the standard <main_script_name>.<function>() syntax and prevents the imported main script's code within the if __name__ == "__main__": from executing for each process.

1.7 Debugging and profiling

1.7 Debugging and profiling mjg8 Wed, 04/02/2008 - 13:06

Debugging and profiling are important skills for any serious programmer – debugging helps you step through your code and helps you analyze the contents of variables (watches) and set breakpoints to check code progress. Profiling runs code to provide an in-depth breakdown of the execution times of individual lines of code or blocks of code to identify performance bottlenecks in the code (e.g. slow I/O, inefficient loops etc.)

In this section, we will first examine debugging techniques and processes before investigating code profiling.

1.7.1 Debugging

1.7.1 Debugging mrs110 Tue, 05/01/2018 - 10:52

As you may remember from GEOG 485, the simplest method of debugging is to embed print statements in your code to either determine how far your code is running through a loop or to print out the contents of a variable. However, when you begin to create more complex scripts, the debugger can provide a more robust and detailed method of troubleshooting. This involves using the tools or features of your IDE to create watches for checking the contents of variables and breakpoints for stepping through your code. We will provide a generic overview of the techniques including setting watches, breakpoints and stepping through code. Don’t focus on the specifics of the interface as we do this. Instead, it is more important to understand the purpose of each of the different methods of debugging, and you should take the time to understand your IDE's debugging process.

The best way to explain the aspects of debugging is to work through an example. This time, we'll look at some code that tries to calculate the factorial of an integer (the integer is hard-coded to 5 in this case). In mathematics, a factorial is the product of an integer and all positive integers below it. Thus, 5! (or "5 factorial") should be 5 * 4 * 3 * 2 * 1 = 120.

The code below attempts to calculate a factorial through a loop that increments the multiplier by 1 until it reaches the original integer. This is a valid approach since 1 * 2 * 3 * 4 * 5 would also yield 120.

# This script calculates the factorial of a given
#  integer, which is the product of the integer and
#  all positive integers below it.

number = 5
multiplier = 1

while multiplier < number:
    number *= multiplier
    multiplier += 1

print (number)

Even if you can spot the error, follow along with the steps below to get a feel for the debugging process and the PyScripter Debug toolbar.

Open PyScripter and copy the above code into a new script.

Save your script as debugger_walkthrough.py. You can optionally run the script, but you won't get a result.
Click View > Toolbars and ensure Debug is checked. You should see a toolbar like this: Many IDEs have debugging toolbars like this, and the tools they contain are pretty standard: a way to run the code, a way to set breakpoints, a way to step through the code line by line, and a way to watch the value of variables while stepping through the code. We'll cover each of these in the steps below.
Move your cursor to the left of line 5 (number = 5) and click. If you are in the right area, you will see a red dot next to the line number, indicating the addition of a breakpoint. A breakpoint is a place where you want your code to stop running, so you can examine it line by line using the debugger. Often you'll set a breakpoint deep in the middle of your script, so you don't have to examine every single line of code. In this example, the script is very short, so we're putting the breakpoint right at the beginning. The breakpoint is represented by a circle next to the line of code, and this is common in other debuggers too. Note that F5 is the shortcut key for this command.
Press the Debug file button . This runs your script up to the breakpoint. In the Python Interpreter console, note that the debugfile() function is run on your script rather than the normal runfile() function. Also, instead of the normal >>> prompt, you should now see a [Dbg]>>> prompt. The cursor will be on that same line in PyScripter's Editor pane, which causes that line to be highlighted.
Click the Step over next function call button or the Step into subroutine button. This executes the line of your code, in this case the number = 5 line. Both buttons execute the highlighted statement, but it is important to note here that they will behave differently when the statement includes a call to a function. The Step over next function call button will execute all the code within the function, return to your script, and then pause at the script's next line. You'd use this button when you're not interested in debugging the function code, just the code of your main script. The Step into subroutine button, on the other hand, is used when you do want to step through the function code one line at a time. The two buttons will produce the same behavior for this simple script. You'll want to experiment with them later in the course when we discuss writing our own functions and modules.
Before going further, click the Variable window tab in PyScripter's lower pane. Here, you can track what happens to your variables as you execute the code line by line. The variables will be added automatically as they are encountered. At this point, you should see a globals {}, which contain variables from the python packages and locals {} that will contain variables created by your script. We will be looking at the locals dictionary so you can disregard the globals. Expanding the locals dictionary, you should see some are built in variables (__<name>__) and we can ignore those for now. The "number" variable should be listed, with a type of int. Expanding on the +, will expose more of the variables properties.
Click the Step button again. You should now see the "multiplier" variable has been added in the Variable window, since you just executed the line that initializes that variable, as called out in the image.
Click the Step button a few more times to cycle through the loop. Go slowly, and use the Variable window to understand the effect that each line has on the two variables. (Note that the keyboard shortcut for the Step button is F8, which you may find easier to use than clicking on the GUI.) Setting a watch on a variable is done by placing the cursor on the variable and then pressing Alt+W or right clicking in the Watches pane and selecting Add Watch At Cursor. This isolates the variable to the Watch window and you can watch the value change as it changes in the code execution.
Step through the loop until "multiplier" reaches a value of 10. It should be obvious at this point that the loop has not exited at the desired point. Our intent was for it to quit when "number" reached 120.

Can you spot the error now? The fact that the loop has failed to exit should draw your attention to the loop condition. The loop will only exit when "multiplier" is greater than or equal to "number." That is obviously never going to happen as "number" keeps getting bigger and bigger as it is multiplied each time through the loop.

In this example, the code contained a logical error. It re-used the variable for which we wanted to find the factorial (5) as a variable in the loop condition, without considering that the number would be repeatedly increased within the loop. Changing the loop condition to the following would cause the script to work:
```
while multiplier < 5:
```
Even better than hard-coding the value 5 in this line would be to initialize a variable early and set it equal to the number whose factorial we want to find. The number could then get multiplied independent of the loop condition variable.
Click the Stop button in the Debug toolbar to end the debugging session. We're now going to step through a corrected version of the factorial script, but you may notice that the Variable window still displays a list of the variables and their values from the point at which you stopped executing. That's not necessarily a problem, but it is good to keep in mind.

Open a new script, paste in the code below, and save the script as debugger_walkthrough2.py

# This script calculates the factorial of a given
# integer, which is the product of the integer and
# all positive integers below it.
number = 5
loopStop = number
multiplier = 1
while multiplier < loopStop: 
    number *= multiplier
    multiplier += 1
        
print (number)

Step through the loop a few times as you did above. Watch the values of the "number" and "multiplier" variables, but also the new "loopStop" variable. This variable allows the loop condition to remain constant while "number" is multiplied. Indeed, you should see "loopStop" remain fixed at 5 while "number" increases to 120.
Keep stepping until you've finished the entire script. Note that the usual >>> prompt returns to indicate you've left debugging mode.

In the above example, you used the Debug toolbar to find a logical error that had caused an endless loop in your code. Some IDE's allow you to make changes to values while you are debugging so you can test your code with different values, or correct a variable's value so you can continue the current debugging session. This is very useful for operations like slicing, pandas dataframes, web requests, Classes, and much more.

I encourage you to practice using the Debug toolbar in the script-writing assignments that you receive in this course. We try to be as responsive as possible for students that are stuck, but there may be times when it takes a day or two before we can assist. A simple walkthrough of their code using the debugger would have revealed the problem. Using the debugger can save you a lot of time and headache.

1.7.2 Optional - Profiling

1.7.2 Optional - Profiling jmk649 Tue, 09/10/2024 - 21:32

We have experimented with some very simple code profiling by introducing the time() function into our code in previous examples using it to record the start and end times of our code and check the overall performance.

While that gives us a high level view of the performance of our code, we do not know where specific bottlenecks might exist within that code. For example what is the slowest part of our algorithm, is it the file reading or writing or the calculations in between? When we know where these bottlenecks are we can investigate ways of removing them or at least using faster techniques to speed up our code.

In this section, we will focus on basic profiling that looks at how long each function call takes – not each individual line of code. This basic type of code profiling is built into most IDEs. However, we also provide some complementary materials that explain how to visualize profiling results and on how to profile each line of code individually but these parts will be entirely optional because they are quite a bit more complex and require the installation of additional software and packages. It is possible that you will run into some technical/installation issues, so our recommendation is that you only come back and try to run the described steps in these optional subsections yourself if you are done with lesson and homework assignment and still have time left.

1.7.2.1 Basic code profiling

1.7.2.1 Basic code profiling jmk649 Tue, 09/10/2024 - 21:28

This section uses the Spyder IDE for this demonstration (you do not have to install Spyder) and our basic raster code from earlier in the lesson. The IDE you are using may provide profiling and I encourage you to research and attempt to profile your script if you have time left at the end of this lesson. Spyder has a Profile pane as well which is accessible from the View -> Panes menu. You may need to manually load your code into Profiler using the folder icon. The Spyder help for the Profiler is here if you'd like to read it (Profiler — Spyder 5 documentation (spyder-ide.org)) but we explain the important parts below.

Once you load your code, Profiler automatically starts profiling it and displays the message “ Profiling, please wait...“ at the top of the Profiler window. You will need to be a little patient as Spyder runs through all of your code to perform the timing (or alternatively in these raster examples reduce the sample size). You probably remember that we recommended to run multiprocessing code from the command line rather than inside Spyder. However, using the built-in profiler for loading and running multiprocessing code works as long as the worker function is defined in its own module as we did for the vector clipping example to be able to use it as a script tool. If this is not the case, you will receive an error that certain elements in the code cannot be “pickled“ which you might remember from our multiprocessing discussion means those objects cannot be converted to a basic type. We didn't split the multiprocessing version of the raster example into two separate modules, so here we will only look at the non-multiprocessing version and profile that. We won't have this issue when we using other profilers in the following optional sections and in the homework assignment you will work with the vector clipping multiprocessing example which has been set up in a way that allows for profiling with the Spyder profiler.

Once the Profiler has completed you will see a set of results like the ones below.

Figure 1.21 Profile Results 1

Looking over those results you will see a list of functions together with the times each has taken. The important column to examine is the Local Time column which shows you how long each function has taken to execute in us (microseconds), ms (milliseconds), seconds, minutes etc. The Total Time column is showing you the cumulative time for each of those processes that was run (e.g. if your code was running in a function). You can order any of the columns but if you arrange the Total Time column in ascending order this will give you a logical starting point as the times will be arranged from the shortest to longest running. There is no way to order the results to see the order in which your code ran. So you will see (depending on your code arrangement) overwriteOutput followed by the time function, then filterList, etc.

The next column to look at is the Calls column which has the count of how many times each of those functions was launched. So a high value in Local Time might be indicative of either a large number of calls to a very fast function or a small number of calls to a very slow function.

In my timings, there aren’t any obvious places to look for performance improvements although the code could be fractionally faster (but less of a good team player) if I didn’t check the extension back in and my .replace() method and print add a small amount of time to the execution.

What we are doing with this type of profiling is examining the sum of the functions and methods which were called during the code execution and how long they took, in total, for the number of times that they were called. It is possible to identify inefficiencies with this sort of profiling, particularly in combination with debugging. Am I calling a slow function too often? Can I find a faster function to do the same job (for example some mathematical functions are significantly faster than others and achieve the same result and often there exist approximations that give almost the exact result but are much faster to compute)?

It is worth pointing out here that the results from Spyder’s Profiler are actually the output from the cProfile package of the Python standard library which is essentially wrapped around our script to calculate the statistics we are seeing above. You could import this package into your own code and use it there directly but we will focus on using its functionality from the IDE which is usually more convenient and the results are presented in a more readily understood format.

You might be thinking that these results aren’t really that readily understood and it would be easier if there were a graphical visualization of the timings. Luckily there is and if you want to learn more about it, the following optional sections on code profiling with visualization are a good starting point. In addition, we continue another optional section that explains how you can do more detailed profiling looking at each individual line rather than complete functions. However, we recommend that you skip or only skim through these optional sections on your first pass through the lesson materials and come back when you have the time.

1.7.2.2 Optional complementary materials: Code profiling with visualizations

1.7.2.2 Optional complementary materials: Code profiling with visualizations mrs110 Tue, 05/01/2018 - 10:56

As we said at the beginning of this section, this section is provided for interest only. Please feel free to skip over it and you can loop back to it at the end of the lesson if you have free time or after the end of the class. Be warned: installing the required extensions and using them is a little complicated - but this is an advanced class, so don't let this deter you.

We are going to need to download some software called QCacheGrind which reads a tree type file (like a family tree). Unfortunately, QcacheGrind doesn’t natively support the profile files we are going to be creating so we will also need a converter (pyprof2calltree), written in Python. Our workflow is going to be :

Download & install QCacheGrind (we only need to do this once)
Use pip (a Python package manager) to install our converter (we only need to do this once, too)
Run our function and line profiling and save the output to files
Convert those output profile files using our converter
Open the converted files in QCacheGrind
...
Conquer Python (okay maybe not, but at least have a better understanding of our code’s performance)

Installing QCacheGrind

Download QCacheGrind and unzip it to a folder. QcacheGrind can be run by double-clicking on the qcachegrind executable in the folder you’ve just unzipped it to. Don’t do that just yet though, we’ll come back to it once we’ve done the other steps in our workflow and when we have some profile files to visualize.

Installing the Converter - pyprof2calltree

Now we’re going to install our converter using the Python Preferred Installer Program, pip. If you would like to learn more about pip, the Python 3 Help (Key Terms) has a full explanation. You will also learn more about Python packet managers in the next lesson. Pip is included by default with the Python installation you have but we have to access it from the Python Command Prompt.

As we mentioned in Section 1.6.5.2, there should be a shortcut within the ArcGIS program group on the start menu called "Python Command Prompt" on your PC that opens a command window running within the conda environment indicating that this is Python 3 (py3). You actually may have several shortcuts with rather similar sounding names, e.g. if you have both ArcGIS Pro and ArcGIS Desktop installed, and it is important that you pick the right one from ArcGIS Pro using Python 3.

In the event that there isn’t a shortcut, you can start Python from a standard Windows command prompt by typing :

"%PROGRAMFILES%\ArcGIS\Pro\bin\Python\Scripts\propy"

The instructions above mirror Esri's help for running Python.

Open your Python command prompt and you should be in the folder C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\ or C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\ depending on your version of ArcGIS Pro. This is the default folder when the command prompt opens (which you can see at the prompt). Then type (and hit Enter) :

Scripts\pip install pyprof2calltree

Pip will then download from a repository of hosted Python packages the converter that we need and then install it. You should see a message saying “Successfully installed pyprof2calltree-1.4.3“ although your version number may be higher and that’s ok. If you receive an error message about permissions during the pyprof2calltree installation, close out of Python command prompt and reopen the program with administrative privileges (usually right-clicking on the program and selecting "Run as administrator," in Windows 10, you might need to "Open File Location" and then right-click to "Run as administrator").

After running commands in Python command prompt, you will probably also see an information message stating:

“You are using pip version 9.0.3, however version 10.0.1 is available. You should consider upgrading via the `python –m pip install –upgrade pip‘ command.“

screenshot of administrator python command prompt

Figure 1.22 Python Command Prompt

You can ignore this message, we’re going to leave pip at its current version as that is what came with the ArcGIS Pro Python distribution and we know that it works.

1.7.2.3 Optional complementary materials: Code profiling with visualizations (continued)

1.7.2.3 Optional complementary materials: Code profiling with visualizations (continued) mrs110 Tue, 05/01/2018 - 10:58

This section is provided for interest only. Please feel free to skip over it and you can loop back to it at the end of the lesson if you have free time or after the end of the class.

Creating profile files

Now that we have the QCacheGrind viewer and converter installed we can create some profile files and we'll do that using IPython (available on another tab in Spyder). Think of IPython a little like the interactive Python window in ArcGIS with a few more bells and whistles. We’ll use IPython very shortly for line-by-line profiling, so as a bridge to that task, let's use it to perform some more simple profiling.

Open the IPython pane in Spyder and then you will probably need to change to the folder where your code is located. This should be as easy as looking at the top of the Spyder window, selecting and copying the folder name where your code is, then clicking in the IPython window and typing cd and pasting in the folder name and hitting enter (as seen below).

Screenshot changing folder of code as described in the text above

Figure 1.23 profile_ipython

This might look like:

cd c:\Users\YourName\Documents\GEOG489\Lesson1\

We could (but won't) run our code from IPython by typing:

run Lesson1B_basic_raster_for_mp.py

and hitting enter and our code will run and display its output in the IPython window.

That is somewhat useful, but more useful is the ability to use additional packages within IPython for more detailed profiling. Let’s create a function-level profile file using IPython for that raster code and then we’ll run it through the converter and then visualize it in QCacheGrind.

To create that function-level profile we’ll use the built-in profiler prun.

We’ll access it using is a magic word instruction to Spyder which is shorthand for loading an external package; that package is called line_profiler which we just installed. Think of it in the same way as import for Python code – we’re now going to be able to access functionality embedded within that package.

Our magic words have % in front of them. Let's use it to see what the parameters are for using prun with (the ? on the end tells prun to show us its built-in help):

%prun?

If you scroll back up through the IPython console, you will be able to see all of the options for prun. You can compare our command below to that list to break down the various options and experiment with others if you wish. Notice the very last line of the help which states:

If you want to run complete programs under the profiler's control, use "%run -p [prof_opts] filename.py [args to program]" where prof_opts contains profiler specific options as described here.

That is what we’re going to do – use run with the prun options.

cd "C:\Users\YourName"

%run -p -T profile_run.txt -D profile_run.prof
Lesson1B_basic_raster_for_mp

There are a couple of important things to note in these commands. The first is the double quotes around the full path name in the cd command – these are important: just in case there is a space in your path, the double quotes encapsulate it so your folder is found correctly. The other thing is the casing of the various parameters (remember Python is case-sensitive and so are a lot of the built-in tools).

It could take a little while to complete our profiling as our code will run through from start to end. We can check that our code is running by opening the Windows Task Manager and watching the CPU usage which is probably at 100% on one of our cores.

While our code is running we’ll see the normal output with the timing print functions we had implemented earlier. When the run command completes we’ll see a few lines of output that look like :

%run -p -T profile_run.txt -D profile_run.prof 
Lesson1B_basic_raster_for_mp 

*** Profile stats marshalled to file 'profile_run.prof'.  

*** Profile printout saved to text file 'profile_run.txt'.  
         3 function calls in 0.000 seconds 

   Ordered by: internal time 

   ncalls  tottime  percall  cumtime  percall 
filename:lineno(function) 

        1    0.000    0.000    0.000    0.000 {built-in method 
builtins.exec} 
        1    0.000    0.000    0.000    0.000  
...  
        1    0.000    0.000    0.000    0.000 
SSL.py:677(Session) 
        1    0.000    0.000    0.000    0.000 
cookiejar.py:1753(LoadError) 
        1    0.000    0.000    0.000    0.000 
socks.py:127(ProxyConnectionError) 
        1    0.000    0.000    0.000    0.000 
_conditional.py:177(cryptography_has_mem_functions) 
        1    0.000    0.000    0.000    0.000 
_conditional.py:196(cryptography_has_x509_store_ctx_get_issuer) 
        1    0.000    0.000    0.000    0.000
_conditional.py:210(cryptography_has_evp_pkey_get_set_tls_enco
dedpoint)

These summary outputs will also be written to a text file that we can open in a text editor (or in Spyder) and our profile file which we will convert and open in QCacheGrind. Writing that output to a text file is useful because there is too much of it to fit within IPython’s window buffer, and you won’t be able to get back to the output right at the start of the execution. If you open the profile_run.txt file in Spyder you’ll see the full output.

Convert output profile files with pyprof2calltree

We’ll run the converter using some familiar Python commands and the convert function within the IPython window:

from pyprof2calltree import convert, visualize 
convert('profile_run.prof','callgrind.profile_run')

Open the converted files in QCacheGrind and inspect graphs

The converted output file can now be opened in QCacheGrind. Open QCacheGrind from the folder you installed it into earlier by double-clicking its icon. Click the folder icon in the top left of QCacheGrind or choose File ->Open from the menu and open the callgrind.profile_run file we just created, which should be in the same folder as your source code.

What we have now is a complicated and detailed interface and visualization of every function that our code called but in a more graphically friendly format than the original text file. We can sort by time, number of times a function was called, the function name and the location of that code (our own, within the arcpy library or another library) in the left-hand pane of the interface.

Figure 1.24 The QCacheGrind interface

Figure 1.25 Another view of the QCacheGrind interface

In the list, you will see a lot of built-in functions (things that Python does behind the scenes or that arcpy has it do – calculations, string functions etc.) but you will also see the names of some of the arcpy functions that we used such as FlowAccumulation(...) or StreamToFeature(...). If you double-click on one of them and click on the Call Graph tab in the lower pane you will see a graphical representation of where the tasks were called from. If you double-click on the function’s box above it in the Call Graph pane you will see all of the other modules that were called within our code. The tree-like representation of this graph helps us to visualize what our code is doing and how long each of the tasks takes.

Screenshot of callgraph. Has 4 sequential steps: 100%, module, Flowaccumulation & swapper to 5 parallel wrappers leading 2 final box, lambda

Figure 1.26 The CallGraph view of QCacheGrind

In the example below, we can see that FlowAccumulation(...) is the slowest of our tasks taking about 43% of the execution time. If we can find a way to speed up (or eliminate this process), we’ll make our code more efficient. If we can’t – that’s ok too – we’ll just have to accept that our code takes a certain amount of time to run.

Spend a little time clicking around in the interface and exploring your code – don’t worry too much about going down the rabbit hole of optimizing your code – just explore. Check out functions whose name you recognize such as those raster ones, or ListRasters(). Experiment with examining the content of the different tabs and seeing which modules call which functions (Callers, All Callers and Callee Map). Click down deep into one of the modules and watch the Callee Map change to show each small task being undertaken. If you get too far down the rabbit hole you can use the up arrow near the menu bar to find your way back to the top level modules.

If you’re interested, feel free to run through the process again with your multiprocessing code and see the differences. As a quick reminder, the IPython commands are (although your filenames might be different and be sure to double-check that you're in the correct folder if you get file not found errors):

%run -p -T profile_run_mp.txt -D profile_run_mp.prof Lesson1B_basic_raster_using_mp 

from pyprof2calltree import convert, visualize 
convert('profile_run_mp.prof','callgrind.profile_run_mp')

If you load that newly created file into QCacheGrind, you’ll note it looks a little different – like an Escher drawing or a geometric pattern. That is the representation of the multiprocessing functions being run. Feel free to explore among here as well – and you will notice that the functions we previously saw are harder to find – or invisible.

Screenshot of multiprocessing code in Qcachegrind. 2 toned green rectangle split in half by a diagonal, plateau & diagonal

Figure 1.27 How our multiprocessing code looks in QCacheGrind

I haven't forgotten about my promise from earlier in the lesson to review the reasons why the Cherry-O code is only about 2-3x faster in multiprocessing mode than the 4x that we would have hoped.

Feel free to run both versions of your Cherry-O code against the profiler and you'll notice that most of the time is taken up by some code described as {method 'acquire' of '_thread.lock' objects} which is called a small number of times. This doesn't give us a lot of information but does hint that perhaps the slower performance is related to something to do with handling multiprocessing objects.

Remember back to our brief discussion about pickling objects which was required for multiprocessing?

It's the culprit, and the following optional section on line profiling will take a closer look at this issue. However, as we said before, line profiling adds quite a bit of complexity, so feel free to skip this section entirely or get back to it after you have worked through the rest of the lesson.

1.7.2.4 Optional complementary materials: Line profiling

1.7.2.4 Optional complementary materials: Line profiling mrs110 Tue, 05/01/2018 - 10:58

Before we begin this optional section, you should know that line profiling is slow – it adds a lot of overhead to our code execution and it should only be done on functions that we know are slow and we’re trying to identify the specifics of why they are slow. Due to that overhead, we cannot rely on the absolute timing reported in line profiling, but we can rely on the relative timings. If a line of code is taking 50% of our execution time and we can reduce that to 40,30 or 20% (or better) of our total execution time then we have been successful.

With that warning about the performance overhead of line profiling, we’re going to install our line profiler (which isn’t the same one that Spyder or ArcGIS Pro would have us install) again using pip (although you are welcome to experiment with those too).

Setting Permissions

Before we start installing packages we will need to adjust operating system (Windows) permissions on the Python folder of ArcGIS Pro, as we will need the ability to write to some of the folders it contains. This will also help us if we inadvertently attempt to create profile files in the Python folder as the files will be created instead of producing an error that the folder is inaccessible (but we shouldn't create those files there as it will add clutter).

Open up Windows Explorer and navigate to the c:\Program Files\ArcGIS\Pro\bin folder. Select the Python folder and right click on it and select Properties. Select the Security tab, and click the Advanced button. In the new window that opens select Change Permissions, Select the Users group, Uncheck the Include inheritable permissions from this object’s parent box or Disable Inheritance – depending on your version of Windows, and select Add (or Make Explicit) on the dialog that appears.

advanced security setting with highlighted include inheritable permissions

Figure 1.28 Windows Permissions.How the ICIJ Used Neo4j to Unravel the Panama Papers - Mar Cabra

Then click the Edit button and select all permissions other than Full control (the first one) and Take ownership (the last one) – click OK, and then click Apply on the parent window to apply changes. You can click OK on all the rest of the windows. It may take a few minutes to update all of the permissions.

Once again open (if it isn’t already) your Python command prompt and you should be in the folder C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3 or C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone, the default folder of your ArcGIS Pro Python environment, when that command prompt opens (which you can see at the prompt). If in your version of Pro, the command prompt instead shows C:\Users\<username>\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone, then you will have to use this path instead in some of the following commands where the kernprof program is used.

Then type :

scripts\pip install line_profiler

Pip will then download from a repository of hosted Python packages the line_profiler that we need and then install it.

If you receive an error that "Microsoft Visual C++ 14.0 is required," visit V isual S tudio D ownloads and download the package for "Visual Studio Community 2017" which will download Visual Studio Installer. Run Visual Studio Installer, and under the "Workloads" tab, you will select two components to install. Under Windows, check the box in the upper right hand corner of "Desktop Development with C++," and, under Web & Cloud, check the box for "Python development." After checking the boxes, click "Install" in the lower right hand corner. After installing those, open the Python command prompt again and enter:

scripts\pip install misaka

If that works, then install the line_profiler with...

scripts\pip install line_profiler

You should see a message saying "Successfully installed line_profiler 2.1.2" although your version number may be higher and that’s okay.

Now that IPython is aware of the line profiler we can run it. There are two modes for running the line profiler, function mode, where we supply a Python file and a function we want to run as well as the parameters for that function given as parameters to the profiler, and module mode, where we supply the module name.

Function-level line profiling is very useful when you want to test just a single function, or if you’re doing multiprocessing as we saw above. Module-level line profiling is a useful first pass to identify those functions that might be slowing things down and it's why we did a similar approach with our higher-level profiling earlier.

Now we can dive into function-level profiling to find the specific lines which might be slowing down our code and then optimize or enhance them and then perform further function-level line profiling (or module-level profiling) again to test our improvements.

We will start with module-level profiling using our single processor Cherry-O code, look at our non-multiprocessing raster example code that did the analysis of the Penn State campus, and finally move on to the multiprocessing Cherry-O code. You may notice a few little deviations in the code that is being used in this section compared to the versions presented in Section 1.5 and 1.6. These deviations are really minimal and have no effect on how the code works and the insights we gain from the profiling.

Our line profiler is in a package called KernProf (named after its author) and it works as a wrapper around the standard cProfile and Line_Profiler tools.

We need to make some changes to our code so that the line profiler knows which functions we wish it to interrogate. The first of those changes is to wrap a function definition around our code (so that we have a function to profile instead of just a single block of code). The second change we need is to use a decorator which is an instruction or meta information for a piece of code that will be ignored by Python. In the case of our profiler, we need to use the decorator @profile to tell KernProf which functions (and we can do many) to examine and it will ignore any without a decorator. Your decorator may give you errors if you're not running your code against the profiler so in that case comment it out.

We’ve made those changes to our original Cherry-O code below so you can see them for yourselves. Check out line 8 for the decorator, line 9 for the function definition (and note how the code is now indented within the function) and line 53 where the function is called. You might also notice that I reduced the number of games back down to 10001 from our very large number earlier. Don’t forget to save your code after you make these changes.

# Simulates 10K game of Hi Ho! Cherry-O 
# Setup _very_ simple timing. 
import time 
start_time = time.time() 

import random 

@profile 
def cherryo(): 
    spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] 
    turns = 0 
    totalTurns = 0 
    cherriesOnTree = 10 
    games = 0 

    while games < 10001: 
        # Take a turn as long as you have more than 0 cherries 
        cherriesOnTree = 10 
        turns = 0 
        while cherriesOnTree > 0: 

            # Spin the spinner 
            spinIndex = random.randrange(0, 7) 
            spinResult = spinnerChoices[spinIndex] 
            
            # Print the spin result     
            #print ("You spun " + str(spinResult) + ".") 

            # Add or remove cherries based on the result 
            cherriesOnTree += spinResult 
          
            # Make sure the number of cherries is between 0 and 10    
            if cherriesOnTree > 10: 
                cherriesOnTree = 10 
            elif cherriesOnTree < 0: 
                cherriesOnTree = 0 

            # Print the number of cherries on the tree        
            #print ("You have " + str(cherriesOnTree) + " cherries on your tree.") 

            turns += 1 
        # Print the number of turns it took to win the game 
        #print ("It took you " + str(turns) + " turns to win the game.") 
        games += 1 
        totalTurns += turns 

    print ("totalTurns "+str(float(totalTurns)/games)) 
    #lastline = raw_input(">") 

    # Output how long the process took. 
    print ("--- %s seconds ---" % (time.time() - start_time)) 

cherryo()

We could try to run the profiler and the other code from within IPython but that often causes issues such as unfound paths, files, etc., as well as making it difficult to convert our output to a nice readable text file. Instead, we’ll use our Python command prompt and then we’ll run the line profiler using (note: the "-l" is a lowercase L and not the number 1):

python "c:\program files\arcgis\pro\bin\python\envs\arcgispro-py3\lib\site-packages\kernprof.py" -l “c:\users\YourName\CherryO.py”

When the profiling completes (and it will be fast in this simple example), you’ll see our normal code output from our print functions and the summary from the line profiler:

Wrote profile results to CherryO.py.lprof

This tells us the profiler has created an lprof file in our current directory called CherryO.py.lprof (or whatever our input code was called).

The profile files will be saved wherever your Python command prompt path is pointing. Unless you've changed the directory, the Python command prompt will most likely be pointing to C:\Program Files\ArcGIS\Pro\Bin\Python\envs\arcgispro-py3, and the files will be saved in that folder.

The profile files are binary files that will be impossible for us to read without some help from another tool. So to rectify that we’ll run that .lprof file through the line_profiler (which seems a little confusing because you would think we just created that file with the line_profiler and we did, but the line_profiler can also read the files it created) and then pipe (redirect) the output to a text file which we’ll put back in our code directory so we can find it more easily.

To achieve this, we run the following command in our Python command window:

..\python –m line_profiler CherryO.py.lprof > "c:\users\YourName\CherryO.profile.txt"

This command will instruct Python to run the line_profiler (which is some Python code itself) to process the .lprof file we created. The > will redirect the output to a text file at the provided path instead of displaying the output to the screen.

We can then open the resulting output file which should be back in our code folder from within Spyder and read the results. I’ve included my output for reference below and they are also in the CherryO.Profile PDF.

Timer unit: 4.27655e-07 s

Total time: 3.02697 s
File: c:\users\YourName\Lesson 1\CherryO.py
Function: cherryo at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           @profile
     9                                           def cherryo():
    10         1          5.0      5.0      0.0      spinnerChoices = [-1, -2, -3, -4, 2, 2, 10]
    11         1          2.0      2.0      0.0      turns = 0
    12         1          1.0      1.0      0.0      totalTurns = 0
    13         1          1.0      1.0      0.0      cherriesOnTree = 10
    14         1          1.0      1.0      0.0      games = 0
    15                                               
    16     10002      36775.0      3.7      0.5      while games < 10001:
    17                                                   # Take a turn as long as you have more than 0 cherries
    18     10001      25568.0      2.6      0.4          cherriesOnTree = 10
    19     10001      27091.0      2.7      0.4          turns = 0
    20    168060     464529.0      2.8      6.6          while cherriesOnTree > 0:
    21                                                    
    22                                                       # Spin the spinner
    23    158059    4153276.0     26.3     58.7              spinIndex = random.randrange(0, 7)
    24    158059     487698.0      3.1      6.9              spinResult = spinnerChoices[spinIndex]
    25                                                    
    26                                                       # Print the spin result    
    27                                                       #print "You spun " + str(spinResult) + "."
    28                                                    
    29                                                       # Add or remove cherries based on the result
    30    158059     460642.0      2.9      6.5              cherriesOnTree += spinResult
    31                                                    
    32                                                       # Make sure the number of cherries is between 0 and 10   
    33    158059     458508.0      2.9      6.5              if cherriesOnTree > 10:
    34     42049     112815.0      2.7      1.6                  cherriesOnTree = 10
    35    116010     325651.0      2.8      4.6              elif cherriesOnTree < 0:
    36      5566      14506.0      2.6      0.2                  cherriesOnTree = 0
    37                                                    
    38                                                       # Print the number of cherries on the tree       
    39                                                       #print "You have " + str(cherriesOnTree) + " cherries on your tree."
    40                                                    
    41    158059     445969.0      2.8      6.3              turns += 1
    42                                                   # Print the number of turns it took to win the game
    43                                                   #print "It took you " + str(turns) + " turns to win the game."
    44     10001      29417.0      2.9      0.4          games += 1
    45     10001      31447.0      3.1      0.4          totalTurns += turns
    46                                               
    47         1        443.0    443.0      0.0      print ("totalTurns "+str(float(totalTurns)/games))
    48                                               #lastline = raw_input(">")
    49                                               
    50                                               # Output how long the process took.
    51         1       3723.0   3723.0      0.1      print ("--- %s seconds ---" % (time.time() - start_time))

What we can see here are the individual times to run each line of code: The numbers on the left are the code line numbers, the number of times each line was run (Hits), the time each line took in total (Hits * Time Per Hit), the time per hit, the percentage of time those lines took and, for reference, the line of code alongside.

The first thing that jumps out at me is that the random number selection (line 23) takes the longest time and is called the most – if we can speed this up somehow we can improve our performance.

Let’s move onto the other examples of our code to see some more profiling results.

First we’ll look at the sequential version of our raster processing code before coming back to look at the multiprocessing examples as they’re a little special.

As before with the Cherry-O example we’ll need to wrap our code into a function and use the @profile decorator (and of course call the function). Attempt to make these changes on your own and if you get stuck you can find my code sample here if you want to check your work against mine.

We’ll run the profiler again, produce our output file and then convert it to text and review the results using:

python "c:\program files\arcgis\pro\bin\python\envs\arcgispro-py3\lib\site-packages\kernprof.py" -l "c:\users\YourName\Lesson1B_basic_raster_for_mp.py"

and then

..\python –m line_profiler Lesson1B_basic_raster_for_mp.py.lprof > "c:\users\YourName\Lesson1B_basic_raster_for_mp.py_profile.txt"

If we investigate these outputs (my outputs are here) we can see that the Flow Accumulation calculation is again the slowest, just as we saw when we were doing the module-level calculations. In this case, because we’re predominantly using arcpy functions, we’re not seeing as much granularity or resolution in the results. That is, we don’t know why Flow Accumulation(...) is so slow but I’m sure you can see that in some other circumstances, you could identify multiple arcpy functions which could achieve the same result – and choose the most efficient.

Next, we’ll look at the multiprocessing example of the Cherry-O example to see how we can implement line profiling into our multiprocessing code. As we noted earlier, multiprocessing and profiling are a little special as there is a lot going on behind the scenes and we need to very carefully select what functions and lines we’re profiling as some things cannot be pickled.

Therefore what we need to do is use the line profiler in its API mode. That means instead of using the line profiling outside of our code, we need to embed it in our code and put it in a function between our map and the function we’re calling. This will give us output for each process that we launch. Now if we do this for the Cherry-O example we’re going to get 10,000 files – but thankfully they are small so we’ll work through that as an example.

The point to reiterate before we do that is the Cherry-O code runs in seconds (at most) – once we make these line profiling changes the code will take a few minutes to run.

We’ll start with the easy steps and work our way up to the more complicated ones:

import line_profiler

Now let’s define a function within our code to sit between the map function and the called function (cherryO in my version).

We’ll break down what is happening in this function shortly and that will also help to explain how it fits into our workflow. This new function will be called from our mp_handler() function instead of our original call to cherryO and this new function will then call cherryO.

Our new mp_handler function looks like:

def mp_handler(): 

    myPool = multiprocessing.Pool(multiprocessing.cpu_count()) 
    ## The Map function part of the MapReduce is on the right of the = and the Reduce part on the left where we are aggregating the results to a list. 
    turns = myPool.map(worker,range(numGames)) 
    # Uncomment this line to print out the list of total turns (but note this will slow down your code's execution 
    #print(turns) 
    # Use the statistics library function mean() to calculate the mean of turns 
    print(mean(turns))

Note that our call to myPool.map now has worker(...) not cherryO(...) as the function being spawned multiple times. Now let's look at this intermediate function that will contain the line profiler as well as our call to cherryO(...).

def worker(game): 
    profiler=line_profiler.LineProfiler(cherryO) 
    call = 'cherryO('+str(game)+')' 
    turns = profiler.run(call)    
    profiler.dump_stats('profile_'+str(game)+'.lprof') 
    return(turns)

The first line of our new function is setting up the line profiler and instructing it to track the cherryO(...) function.

As before, we pass the variable game to the worker(...) function and then pass it through to cherryO(...) so it can still perform as before. It’s also important that, when we call cherryO(...), we record the value it returns into a variable turns – so we can return that to the calling function so our calculations work as before. You will notice we’re not just calling cherryO(...) and passing it the variable though – we need to pass the variable a little differently as the profiler can only support certain picklable objects. The most straightforward way to achieve that is to encode our function call into a string (call) and then have the profiler run that call. If we don’t do this the profiler will run but no results will be returned.

Just before we send that value back we use the profiler’s function dump_stats to write out the profile results for the single game to an output file.

Don’t forget to save your code after you make these changes. Now we can run through a slightly different (but still familiar) process to profile and export our results, just with different file names. To run this code we’ll use the Python command prompt:

python c:\users\YourName\CherryO_mp.py

Notice how much longer the code now takes to run – this is another reason to wrap the line profiling in its own function. That means that we don’t need to leave it in production code; we can just change the function calls back and leave the line profiling code in place in case we want to test it again.

It's also possible you'll receive several error messages when the code runs, but the lprof files are still created.

Once our code completes, you will notice we have those 10,000 lprof files (which is overkill as they are probably all largely the same). Examine a few of the files if you like by converting them to text files and viewing them in your favorite text editor or Spyder using the following in the Python command prompt:

python –m line_profiler profile_1.lprof > c:\users\YourName\profile_1.txt

If you examine one of those files, you’ll see results similar to:

Timer unit: 4.27655e-07 s

Total time: 0.00028995 s
File: c:\users\obrien\Dropbox\Teaching_PSU\Geog489_SU_1_18\Lesson 1\CherryO_MP.py
Function: cherryO at line 25

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                           def cherryO(game):
    26         1         11.0     11.0      1.6      spinnerChoices = [-1, -2, -3, -4, 2, 2, 10]
    27         1          9.0      9.0      1.3      turns = 0
    28         1          8.0      8.0      1.2      totalTurns = 0
    29         1          8.0      8.0      1.2      cherriesOnTree = 10
    30         1          9.0      9.0      1.3      games = 0
    31                                           
    32                                               # Take a turn as long as you have more than 0 cherries
    33         1          9.0      9.0      1.3      cherriesOnTree = 10
    34         1          9.0      9.0      1.3      turns = 0
    35        16         41.0      2.6      6.0      while cherriesOnTree > 0:
    36                                                
    37                                                   # Spin the spinner
    38        15        402.0     26.8     59.3          spinIndex = random.randrange(0, 7)
    39        15         38.0      2.5      5.6          spinResult = spinnerChoices[spinIndex]
    40                                                
    41                                                   # Print the spin result    
    42                                                   #print "You spun " + str(spinResult) + "."
    43                                                
    44                                                   # Add or remove cherries based on the result
    45        15         34.0      2.3      5.0          cherriesOnTree += spinResult
    46                                                
    47                                                   # Make sure the number of cherries is between 0 and 10   
    48        15         35.0      2.3      5.2          if cherriesOnTree > 10:
    49         4          8.0      2.0      1.2              cherriesOnTree = 10
    50        11         24.0      2.2      3.5          elif cherriesOnTree < 0:
    51                                                       cherriesOnTree = 0
    52                                                
    53                                                   # Print the number of cherries on the tree       
    54                                                   #print "You have " + str(cherriesOnTree) + " cherries on your tree."
    55                                                
    56        15         32.0      2.1      4.7          turns += 1
    57                                               # Print the number of turns it took to win the game
    58         1          1.0      1.0      0.1      return(turns)

Arguably we’re not learning anything that we didn’t know from the sequential version of the code – we can still see the randrange() function is the slowest or most time consuming (by percentage) – however, if we didn’t have the sequential version and wanted to profile our multiprocessing code this would be a very important skill.

The same steps to modify our code above would be implemented if we were performing this line profiling on arcpy (or any other) multiprocessing code. The same type of intermediate function would be required, we would need to pass and return parameters (if necessary) and also reformat the function call so that it was picklable. The output from the line profiler is delivered in a different format to the module-level profiling we were doing before and, therefore, isn’t suitable for loading into QCacheGrind. I'd suggest that isn't as important as we're looking at a much smaller number of lines of code, so the graphical representation isn't as important.

Returning to our ongoing discussion about the less than anticipated performance improvement between our sequential and multiprocessing Cherry-O code, what we can infer by comparing the line profile output of the sequential version of our code and the multiprocessing version is that pretty much all of the steps take the same proportion of time. So if we're doing nearly everything in about the same proportions, but 4 times as many of them (using our 4 processor PC example) then why isn't the performance improvement around 4x faster? We'd expect that setting up the multiprocessing environment might be a little bit of an overhead so maybe we'd be happy with 3.8x or so.

That isn't the case though so I did a little bit of experimenting with calculating how much time it takes to pickle those simple integers. I modified the mp_handler function in my multiprocessor code so that instead of doing actual work selecting cherries, it pickled the 1 million integers that would represent the game number. That function looked like this (nothing else changed in the code):

import pickle
def mp_handler():
 myPool = multiprocessing.Pool(multiprocessing.cpu_count())
 ## The Map function part of the MapReduce is on the right of the = and the Reduce part on the left where we are aggregating the results to a list.
 #turns = myPool.map(worker,range(numGames))
 #turns = myPool.map(cherryO,range(numGames))
 t_start=time.time()
 for i in range(0,numGames):
 pickle_data = pickle.dumps(range(numGames))
 print ("cPickle data took",time.time()-t_start)
 t_start=time.time()
 pickle.loads(pickle_data)
 print ("cPickle data took",time.time()-t_start)

What I learned from this experimentation was that the pickling took around 4 seconds - or about 1/4 of the time my code took to play 1M Cherry-O games in multiprocessing mode - 16 seconds (it was 47 seconds for the sequential version).

A simplistic analysis suggests that the pickling comprises about 25% of my execution time (and your results might vary). If I subtract the time taken for the pickling, my code would have run in 12 seconds and then 47s ÷ 12s = 3.916 - or the nearly the 4x improvement we would have anticipated. So the takeaway message here is the reinforcing of some of the implications of implementing multiprocessing that we discussed earlier: there is an overhead to multiprocessing and lots of small calculations as in this case aren't the best application for it because we lose some of that performance benefit due to that implementation overhead. Still, an almost tripling of performance (47s / 16s) is worth the effort.

Your coding assessment for this lesson will have you modify some code (as mentioned earlier) and profile some multiprocessing analysis code which is why we’re not demonstrating it specifically here. See the lesson assignment page for full details.

1.7.2.5 Optional A last word on profiling

1.7.2.5 Optional A last word on profiling mrs110 Tue, 05/01/2018 - 10:59

Before we move on from profiling a few important points need to be made. As you might have worked out for yourself by this point, profiling is time-consuming and you really should only undertake it if you have a very slow piece of code or one where you will be running it thousands of times or more and a small performance improvement is likely to be beneficial in the long run. To put that another way, if you spend a day profiling your code to improve its performance and you reduce the execution time from ten minutes to five minutes but you only run that code once then I would argue you haven’t used your time productively. If your code is already fast enough that it executes in a reasonable amount of time – that is fine.

Do not get caught in the trap of beginning to optimize and profile your code too early, particularly before the code is complete. You may be focusing on a slow piece of code that will only be executed once or twice and the performance improvement will not be significant compared to the execution of the other 99% of the code.

We have to accept that some external libraries are inefficient and if we need to use them, then we must accept that they take as long as they do to get the job done. It is also possible that those libraries are extremely efficient and take as long as they do because the task they are performing is complicated. There isn’t any point attempting to speed up the arcpy.da cursors for example as they are probably as fast as they are likely to be in the near future. If that is the slowest part of our code, we may have to accept that.

1.8 Optional - Version Control Systems, Git, and GitHub

1.8 Optional - Version Control Systems, Git, and GitHub mrs110 Thu, 05/03/2018 - 13:14

Version control systems

Software projects can often grow in complexity and expand to include multiple developers. Version Control Systems (VCS) are designed to record changes to data and encourage team collaboration. Data, often software code, can be backed up to prevent data loss and track changes made. VCS are tools to facilitate teamwork and merging of different contributor’s changes. Version control [1] is also known as “revision control.” Version control tools like Git can help development teams or individuals manage their projects in a logical, procedural way without needing to email copies of files around and worry about who made what changes in which version.

Differences between centralized VCS and distributed VCS

Centralized VCS, like Subversion (SVN), Microsoft Team Foundation Server (TFS) and IBM ClearCase, all use a centralized, client-server model for data storage and to varying degrees discourage “branching” of code (discussed in more detail below). These systems instead encourage a ﬁle check-out, check-in process and often have longer “commit cycles” where developers work locally with their code for longer periods before committing their changes to the central repository for back-up and collaboration with others. Centralized VCS have a longer history in the software development world than DVCS, which are comparatively newer. Some of these tools are difﬁcult to compare solely on their VCS merits because they perform more operations than just version control. For example, TFS and ClearCase are not just VCS software, but integrate bug tracking and release deployment as well.

Distributed VCS (DVCS) like Git (what we’re focusing on) or Mercurial (hg), all use a decentralized, peer-to-peer model where each developer can check out an entire repository to their local environment. This creates a system of distributed backup where if any one system becomes unavailable, the code can be reconstructed from a different developer’s copy of the repository. This also allows off-line editing of the repository code when a network connection to a central repository is unavailable. As well, DVCS software encourages branching to allow developers to experiment with new functionality without “breaking” the main “trunk” of the code base.

A hybrid VCS might use the concept of a central main repository that can be branched by multiple developers using DVCS software, but where all changes are eventually merged back to the main trunk code repository. This is generally the model used by online code repositories like GitHub or Bitbucket.

Basics of Git

Git is a VCS that stores and tracks source code in a repository. A variety of data about code projects is tracked such as what changes were made, who made them, and comments about the changes [3]. Past versions of a project can be accessed and reinstated if necessary. Git uses permissions to control what changes get incorporated in the master repository. In projects with multiple people, one user will be designated as the project owner and will approve or reject changes as necessary.

Changes to the source code are handled by branches, merges, and commits. Branching, sometimes called forking, lets a developer copy a code repository (or part of a repository) for development in parallel with the main trunk of the code base. This is typically done to allow multiple developers to work separately and then merge their changes back into a main trunk code repository.

Although Git is commonly used on code projects with multiple developers, the technology can be applied to any number of users (including one) working on any types of digital files. More recently, Git has gained in popularity since it is used as the back end for GitHub among other platforms. Although other VCS exist, Git is frequently chosen since it is free, open source, and easily implemented.

Dictionary

Git has a few key terms to know moving forward [2]:

Repository: (n.) - place where the history of work is stored
Clone: (n.) - a copy you make of someone else's repository which you may or may not intend to edit
Fork: (v.) - the act of copying someone else's repository, usually with the intent of making your own edits
Branch: (n.) - similar to a clone, but a branch is a copy of a repository created by forking a project. The intent with a branch is to make edits that result in either reconciling the branch to the parent repository or having the branch become a new separate repository.
Merge: (v.) - integrating changes from one branch into another branch
Commit: (n.) - an individual change to a file or set of files. It's somewhat similar to hitting the "save" button.
Pull: (v.) - integrating others' changes into your local copy of files
Pull request: (n.) - a request from another developer to integrate their changes into the repository
Push: (v.) - sending your committed changes to a remote repository

Basic Git progression

A Git repository begins as a folder, either one that already exists or one that is created specifically to house the repository. For the cleanest approach, this folder will only contain folders and files that contribute to one particular project. When a folder is designated as a repository, Git adds one additional hidden subfolder called .git that houses several folders and files and two text files called .gitignore and .gitmodule as highlighted in Figure 1.29

screenshot of .git folder, .gitignore ad .gitmodules

Figure 1.29 The highlighted portions are the folder and files that Git adds when a repository is created

These file components handle all of the version control and tracking as the user commits changes to Git. If the user does not commit their changes to Git, the changes are not “saved” in the version control system. Because of this, it’s best to commit changes at fairly frequent intervals. The committed changes are only active on one particular user’s computer at this point. If the user is working on a branch of another repository, they will want to pull changes from the master repository fairly often to make sure they’re working on the most recent version of the code. If a conflict arises when the branch and the master have both changed in the same place in different ways, the user can work through how to resolve the conflict. When the user wants to integrate their changes with the master repository, the user will create a pull request to the owner of the repository. The owner will then review the changes made and any conflicts that exist, and either choose to accept the pull request to merge the edits into the master repository or send the changes back for additional work. These workflow steps may happen hundreds or thousands of times throughout the lifetime of a code project.

On its own, Git operates off a command line interface; users perform all actions by typing commands. Although this method is perfectly fine, visualizing what’s going on with the project can be a bit hard. To help with that, multiple GUI interfaces have been created to visualize and thus simplify the version control process, and some IDEs include built-in version control hooks.

Resources:
[1] https://en.wikipedia.org/wiki/Version_control
[2] https://github.com/kansasgis/GithubWebinar_2015
[3] https://en.wikipedia.org/wiki/Git

1.8.1 Introduction to Online VCS and GitHub

1.8.1 Introduction to Online VCS and GitHub mrs110 Fri, 05/04/2018 - 20:38

Introduction to Online VCS

Some popular online hosting solutions for VCS and DVCS code repositories include: GitHub, Bitbucket, Google Code and Microsoft CodePlex. These online repositories are often used as the main trunk repositories for open-source projects with many developers who may be geographically dispersed. For the purposes of this class, we will focus on GitHub.

Introduction to GitHub

GitHub takes all of Git’s version control components, adds a graphical user interface to repositories, change history, and branch documentation, and adds several social components. Users can add comments, submit issues, and get involved in the bug tracking process. Users can follow other GitHub users or specific projects to be notified of project updates. GitHub can either be used entirely online or with an application download for easily managing and syncing local and online repositories. Optional (not required for class): Use the following link to download the desktop application.

The following exercise will cover the basics of Git and how they’re used in the GitHub website.

Git Exercise in GitHub

Go to GitHub's website and sign up for an account using any email and password you like. If you already have a GitHub account, feel free to use that.
Follow the instructions posted at the GitHub Guides Hello World page.

GitHub's change log

GitHub has the ability to display everything that changed with every commit. Take a look at GitHub's Kansasgis/NG911 page. If you click on one of the titles of one of the commits, it displays whatever basic description the developer included of the changes and then as you scroll down, you can see every code change that occurred - red highlighting what was removed and green highlighting what was added. If you mouse over the code, a plus sign graphic shows up, and users can leave comments and such.

Resolving conflicts on GitHub

Conflicts occur if two branches being merged have had different changes in the same places. Git automatically flags conflicts and will not complete the merge like normal; instead, the user will be notified that the conflicts must be resolved. Some conflicts can be resolved inside GitHub, and other types of conflicts have to be resolved in the Git command line [4]. Due to the complexity of resolving conflicts in the command line, it’s best to plan ahead and silo projects as much as possible to avoid conflicts.

Git adds three different markers to the code to flag conflicts:

<<<<<<<HEAD – This marker indicates the beginning of the conflict in the base branch. The code from the base branch is located directly under this marker.

======= – This marker divides the base branch code from the other branch.

>>>>>>> BRANCH-NAME – This marker will have the name of the other branch next to it and indicates the end of the conflict.

Here’s a full example of how Git flags a conflict between branches:

<<<<<<<HEAD
myString = “Monty Python and the Holy Grail is the best. ”
======= 
myString = “John Cleese is hilarious.”
>>>>>>> cleese-branch

To resolve the conflict, the user needs to pick what myString will equal. Possible resolution options-

Keeping the base branch -
myString = “Monty Python and the Holy Grail is the best.”

Using the other branch -
myString = “John Cleese is hilarious.”

Combining branches, in this case combining the options -
myString = “Monty Python and the Holy Grail is the best. John Cleese is hilarious.”

GitHub has an interface that can be activated for resolving basic conflicts by clicking on the “Resolve Conflicts” button under the “Pull Requests” tab. This interface steps through each conflict and the user must decide which version to take, keep their changes, use the other changes, or work out a way to integrate both sets of changes. Inside the GitHub interface, the user must also remove the Git symbols for the conflict. The user steps through every conflict in that particular file to decide how to resolve the conflict and then will eventually click on the “Mark as resolved” button. The next file in the project with conflicts will show up and the user will repeat all of the steps until the conflicts are resolved. At this point, the user will click “Commit merge” and then “Merge pull request.”

For more complex types of conflicts like one branch deleting a file that the other keeps, the resolution has to take place in the Git command line. This process can hopefully be avoided, but basic instructions are available at GitHub Help: Resolving a merge conflict using the command line.

Resources:
[4] Resolving a merge conflict on GitHub

1.8.2 Open source and large companies

1.8.2 Open source and large companies mrs110 Fri, 05/04/2018 - 21:16

GitHub is a great fit for managing open source code projects since, with a free account, all repositories are available on the internet at large. For example, the open source GIS software QGIS (see Lesson 4) is housed on GitHub at GitHub's qgis/QGIS page. Take a look at the repository.

On the front page, you can see in the dashboard statistics that (at the time of this writing) there have been over 40,000 commits, 50 branches, 100 releases, and 250 contributors to the QGIS project. Users worldwide can now contribute their ideas, bugs, and code improvements to a central location that can be managed with standard version control workflows.

Some software companies that have traditionally been protective about their code have adopted GitHub to open certain projects. Esri is rather active on GitHub at GitHub's Esri page including the documentation and samples for the ArcGIS API for Python (see Lesson 3). Microsoft also is present at the GitHub Microsoft page with the tagline “Open source, from Microsoft with love.”

1.8.3 GitHub and Python

1.8.3 GitHub and Python mrs110 Fri, 05/04/2018 - 21:22

While GitHub is open to all digital files and any programming languages, Python is a great fit for use in GitHub for multiple reasons. Unlike other, heavier programming languages, Python doesn’t require extensive libraries with complex dlls and installation structures to get the job done.

Creating Python repositories is as simple as adding the .py files, and then the project can be shared, documented, and updated as needed. GitHub is also a great place to find both Python snippets and entire modules to use. For basic purposes, users can copy/paste just the portions of code off another project they want to try. Otherwise, users can fork an entire repository and tweak it as necessary to fit their purposes.

1.8.4 GitHub's README file

1.8.4 GitHub's README file mrs110 Fri, 05/04/2018 - 21:24

GitHub strongly recommends that every repository contain a README.txt or README.md file. This file will act as the “home page” for the project and is displayed on the repository page after files and folders are listed. This document should contain specific information about the project, how to use it, licensing, and support.

Text files will show up without formatting, so many users choose to use an .md (markdown) file instead. Markdown notation will be interpreted to show various formatting components like font size, bold, italics, imbedded links, numbered lists, and bullet points.

For more information on markdown formatting, visit GitHub Guide's Mastering Markdown page. We will also use Markdown in Lesson 3, in the context of Jupyter notebooks, and provide a brief introduction there.

1.8.5 Gists and GeoJson

1.8.5 Gists and GeoJson mrs110 Fri, 05/04/2018 - 21:27

Gists

While all free GitHub accounts are required to publish public repositories, all accounts have the ability to create Gists. Gists are single page repositories in GitHub, so they don't support projects with folder structures or multiple files. Since Gists are a single page repository, they are good for storing code snippets or one page projects. Gists can be public or private, even with a free account.

To create a Gist in GitHub, log into GitHub and then click on the plus sign in the upper right hand corner. In the options presented, choose "New gist." Enter a description of the Gist (in figure 1.30 "Delete if Exists Shortcut" is the description) as well as the filename with extension (in figure 1.30 this is DeleteIfExists.py). Enter code or notes in the large portion of the screen or import the code by using the "Add File" button. You have two options for saving your Gist- either "Create secret gist" or "Create public gist."

"Secret" Gists are only mostly secret since they use the internet philosophy of difficult-to-guess urls. If you create a secret Gist, you can still share the Gist with anyone by sending them the url, but there are no logins required to view the Gist. Along this same philosophy, if someone stumbles across the url, they will be able to see the Gist.

For more information about Gists, see the official GitHub documentation at About Gists page on Github's website.

decorative image: screenshot of gist note with public and private options

Figure 1.30 Example of creating a Gist, note the secret and public gist options

GeoJson

For GIS professionals, Gists are additionally useful since a Gist can be a single GeoJson file. GeoJson files are essentially a text version of geographic data in json formatting. Other developers can instantly access your GeoJson data and incorporate it from GitHub into their online mapping applications without needing to get a hard copy of the shapefile or geodatabase feature class or rely on some kind of map server. GitHub will automatically display GeoJson files as a map whether the file is a Gist or a part of a larger repository. For example, take a look at GitHub's lyzidiamond/learn-geojson page. At first, you’ll see the GeoJson file interpreted as a map. If you click the “Raw” button located on the upper right-ish side of the map, you will see what the GeoJson file looks like in text form. GeoJson can be easily used in Python since after reading in the file, Python can work with the text as if it is one giant dictionary.

1.8.6 GitHub Conclusion

1.8.6 GitHub Conclusion mrs110 Fri, 05/04/2018 - 21:31

Using GitHub in This Course

In GEOG 489, using GitHub to store the sample code and exercise code from the lessons can be a great way to practice and gain experience with a new software tool. Using GitHub is not required and we don't recommend that you store your completed projects on there. GitHub is an encouraged platform for students to learn since many organizations use GitHub or other VCS.

Conclusion

Git and GitHub provide fast and convenient ways to track projects, whether the project is by one individual or a team of software developers. Although GitHub has many complex features available, it’s easily accessible for individual and small projects that need some kind of tracking mechanism. In addition to version control, GitHub provides users with a social platform for project management as well as the ability for users to create Gists and store GeoJson.

Lesson 1 Assignment

Lesson 1 Assignment jmk649 Tue, 09/10/2024 - 21:31

Part 1 – Multiprocessing Script

We are going to use the arcpy vector data processing code from Section 1.6.6.2 (download Lesson1_Assignment_initial_code) as the basis for our Lesson 1 programming project. The code is already in multiprocessing mode, so you will not have to write multiprocessing code on your own from scratch but you still will need a good understanding of how the script works. If you are unclear about anything the script does, please ask on the course forums. This part of the assignment will be for getting back into the rhythm of writing arcpy based Python code and practice creating script tool with ArcGIS Pro. Your task is to extend our vector data clipping script by doing the following:

Modify the code to handle a parameterized output folder path (still using unique output filenames for each shapefile) defined in a third input variable at the beginning of the main script file. One way to achieve this task is by adding another (5th) parameter to the worker() function to pass the output folder information along with the other data.

To realize the modified code versions in this part, all main modifications have to be made to the input variables and within the code of the worker() and mp_handler() functions. Of course, we will also look at code quality, so make sure the code is readable and well documented. There are a few hints that may be helpful after we talk about Part 2.

Part 2 – Single File Multiprocessing Script Tool

In a single script file, (combining the mp_handler code and the worker function into one script) expand the code so that it can handle multiple input featureclasses to be clipped (still using a single polygon clipping feature class).

The input variable tobeclipped should now take a list of feature class names rather than a single name.
The worker function should, as before, perform the operation of clipping a single input file (not all of them!) to one of the features in the clipper feature class.
The main change you will have to make here will be in the main code where the jobs are created.
The names of the output files produced should have the format:
clip_<oid>_<name of the input feature class>.shp
For instance, clip_0_Roads.shp produced by clipping the Roads featureclass (found in the USA.gdb filegeodatabse) to the State oid '0'.
Ensure that the multiprocessing method obtains its own exclusive worker function.

To realize the modified code versions in this part, it is important to remember how to avoid infinite recursions and the purpose of if __name__ == '__main__': conditional, how namespace/ module imports work, and the use of the module.function() syntax.

Successful delivery of the above requirements is sufficient to earn 95% on the project. The remaining 5% is reserved for efforts that go "over and above" the minimum requirements. Over and above points may be earned by adding further geoprocessing operations (e.g. reprojection) to the worker() function, or other enhancements as you see fit, such as returning a dictionary of results from the workers and parsing them to print success/ failure messages or trying a different multiprocessing method from the table in section 1.6.5.3.

You will have to submit several versions of the modified script for this assignment:

(A) The modified single-input-file script version from Part 1.
(B) The single file version multiple-input-files script tool version from Part 2 (within the .atbx)
(C) Potentially a third version if you made substantial modifications to the code for "over and above" points. If you created a new script tool for this, make sure to include the .atbx file as well.

Hint 1:

When you adapt the worker() function, I strongly recommend that you do some tests with individual calls of that function first before you run the full multiprocessing version. For this, you can, for instance, utilize what we learned about the if __name__ == '__main__': conditional for the multicode script, or comment out the pool code and instead call worker() directly from the loop that produces the job list, meaning all calls will be made sequentially rather than in parallel. This makes it easier to detect errors compared to running everything in multiprocessing mode right away. Similarly, it could be a good idea to view the variables in the debugger or add print statements placed in the job list to make sure that the correct values will be passed to the worker function.

Hint 2 (concerns Part 2):

When changing to the multiple-input-files version, you will not only have to change the code that produces the name of the output files in variable outFC by incorporating the name of the input feature class, you will have to do the same for the name of the temporary layer that is being created by MakeFeatureClass_managment() to make sure that the layer names remain unique. Else some worker calls will fail because they try to create a layer with a name that is already in use.

To get the basename of a feature class without file extension, you can use a combination of the os.path.basename() and os.path.splitext() functions defined in the os module of the Python standard library. The basename() function will remove the leading path (so e.g., turn "C:\489\data\Roads.shp" into just "Roads.shp"). The expression os.path.splitext(filename)[0] will give you the filename without file extension. So for instance "Roads.shp" will become just "Roads". (Using [1] instead of [0] will give you just the file extension but you won't need this here.)

Hint 3 (concerns Part 2):

Once you have the script working in the IDE, it is now time to move it into a script tool. You will have to import your script into itself in order to ensure that each process in the pool can find the worker function when the script is executed as a script tool. Refer to sections 1.3.1 and 1.6.6.3 for this requirement and prevent infinite recursion.

Hint 4 (concerns Part 2):

You will also have to use the "Multiple value" option for the input parameter you create for the to-be-clipped feature class list in the script tool interface. If you then use GetParameterAsText(...) for this parameter in your code, you will get a single string(!) of strings with the names/paths of the feature classes the user picked separated by semicolons, not a list of name/path strings. You can then either use the string method .split(...) and then .strip(...) to turn this list of stringed strings into a usable list of paths. You can also use GetParameter(...), which will provide you with a list of geoprocessing value objects that you can then cast to strings (str(..)) for pickeling. It can save you a lot of time if you add some arcpy.AddMessage(...) statements to print these parameters out so you can see what your variables are. Be sure to verify the output results!

Deliverable

Submit a single .zip file to the corresponding drop box on Canvas; the zip file should contain:

Your modified code files and ArcGIS Pro toolbox files (up to three different versions as described above). Please organize the files cleanly, e.g., using a separate subfolder for each version.
A separate line-by-line code explanation of your script tool's code. This explanation should elaborate on the what/why of the code line to demonstrate that you understand the code's purpose.
A short write-up of any issues that you encountered during the assignment.
- Think back to the beginning of Section 1.6.6 and include a brief discussion of any changes to the processing workflow and/or the code that might be necessary if we wanted to write our output data to geodatabases and briefly comment on possible issues (using pseudocode or a simple flowchart if you wish).
- A description of what you did for "over and above" points (if anything).

Lesson 1: Python, ArcGIS Pro, and Multiprocessing

1.1 Overview and Checklist

List of Lesson 1 Downloads

Data

1.2 Optional - The Integrated Developer Environment

IDLE (Integrated Development and Learning Environment)

PyCharm

Visual Studio Code

PyScripter

Spyder

NotePad++

1.3 import, loops revisited, and some syntactic sugar

1.3.1 import

1.3.2 loops and flow control statements

Flow control statements

1.3.3 Expressions and the Ternary Operator

Expressions

Ternary Operator

1.3.4 Optional - Match & Object Literal

Match

1.4 Functions revisited

1.4.1 Functions with keyword arguments

1.4.2 Functions with an arbitrary number of parameters

1.4.3 Variables: local vs. global, mutable vs. immutable

1.4.4 Multiple return values

1.4.5 The if __name__ == "__main__": conditional

The __name__ variable

Why use if __name__ == "__main__":?

script_A.py

script_B.py

Multiprocessing Safety

safe_multiprocessing.py

Common Pitfalls

Key Takeaways

1.5 Working with Python and arcpy in ArcGIS Pro

1.5.1 Making a Script Tool

1.5.1.1 Converting the script to a tool

1.5.1.2 Optional - Adding tool validation code

1.6 Performance and how it can be improved

1.6.1 32-bit vs. 64-bit processing

1.6.3 Parallel processing

1.6.3.1 Multithreading

1.6.3.2 Multiprocessing

Multiprocessing in Python

1.6.3.3 Distributed processing

1.6.4 Speed limiters

1.6.5 First steps with Multiprocessing

1.6.5.1 Converting from sequential to multiprocessing

1.6.5.2 Putting it all together

1.6.5.3 Multiprocessing Variants

map

apply_async

1.6.6 Arcpy multiprocessing examples

1.6.6.1 Multiprocessing with raster data

1.6.6.2 Multiprocessing with vector data

1.6.6.3 The if __name__ == "__main__": revisited

1.7 Debugging and profiling

1.7.1 Debugging

1.7.2 Optional - Profiling

1.7.2.1 Basic code profiling

1.7.2.2 Optional complementary materials: Code profiling with visualizations

Installing QCacheGrind

Installing the Converter - pyprof2calltree

1.7.2.3 Optional complementary materials: Code profiling with visualizations (continued)

Creating profile files

Convert output profile files with pyprof2calltree

Open the converted files in QCacheGrind and inspect graphs

1.7.2.4 Optional complementary materials: Line profiling

Setting Permissions

1.7.2.5 Optional A last word on profiling

1.8 Optional - Version Control Systems, Git, and GitHub

Version control systems

Differences between centralized VCS and distributed VCS

Basics of Git

Dictionary

Basic Git progression

1.8.1 Introduction to Online VCS and GitHub

Introduction to Online VCS

Introduction to GitHub

Git Exercise in GitHub

1.4.5 The if name == "main": conditional

The name variable

Why use if name == "main":?

`script_A.py`

`script_B.py`

`safe_multiprocessing.py`

1.6.6.3 The if name == "main": revisited