Exercise 1

Warning

Please note that we provide assignment feedback only for students enrolled in the course at the University of Helsinki.

Start your assignment

You can start working on your copy of Exercise 1 by accepting the GitHub Classroom assignment.

Exercise 1 is due by the start of lecture in week 2.

You can also take a look at the open course copy of Exercise 1 in the course GitHub repository (does not require logging in). Note that you should not try to make changes to this copy of the exercise, but rather only to the copy available via GitHub Classroom.

General hints for Exercise 1

Sigma notation

Sigma notation is used to indicate values that should be added together or summed. It is a shorthand notation to indicate the sum of \(N\) values, where each value in the set can be represented by the subscript \(i\). The example below shows how sigma notation works,

\[\sum_{i=1}^{N} x_{i} = \sum_{i} x_{i} = \sum x_{i} = x_{1} + x_{2} + ... + x_{N}\]

which is the sum of \(N\) values, each indicated by \(x\).

As you can see, the complete form explicitly states the starting and ending values for the sum (\(i = 1\) and \(N\)), but often these values are not written as it can be assumed that all \(N\) values will be summed.

Problem 1

Function docstrings

Say what? Yeah, a docstring. Docstrings are short texts that describe a function, script, or other code component in Python. They are generally given on the line below the def statement for a function or at the top of a script file, and start and end with triple quotes """. The docstring is helpful because you can use the help() function to find out how a function works, for example. Since we’re using Jupyter notebooks, we can take advantage of a special ? feature that can provide even a bit more information than the help() function. To use the ?, you simply add it after the name of some function, variable, or other object. Let’s check out some examples.

In [1]: def square(number):
   ...:     """Returns the supplied number squared"""
   ...:     return number * number
   ...: 

In [2]: number = 4

In [3]: print(number, "squared is", square(number))
4 squared is 16
In [4]: help(square)
Help on function square in module __main__:

square(number)
    Returns the supplied number squared
In [5]: square?
Signature: square(number)
Docstring: Returns the supplied number squared
File:      ~/build/IntroQG/site/<ipython-input-1-8ef02da09d14>
Type:      function

Problem 2

Skipping rows reading data with Pandas

Remember that if you give a number to skiprows in the pd.read_csv() function (e.g., pd.read_csv(fp, skiprows = 0)) the number you give is the number of rows you want to skip when reading the file. Instead, if you give a list (e.g., pd.read_csv(fp, skiprows = [0])), Pandas will skip the rows with that index (i.e., the first row).

Dropping NA values

When using the Pandas .dropna() function there are two options to drop NaN values from a DataFrame. Let’s assume we have a DataFrame called df. We could drop our NaN values from all rows where the column df['Column name'] has a NaN value by typing either

df = df.dropna(subset='Column name'])

or

df.dropna(subset='Column name'], inplace=True)

In the first case we’re assigning the output from the .dropna() function to the DataFrame df directly, whereas in the second case we include the inplace parameter to drop the NaN values and update the df DataFrame at the same time. Both do the same thing, but the second option might produce warnings about the index values.

Volcanoes above sea level

Just a note: I assumed elevations above sea level were greater than or equal to 0.0 in Part 2 of Problem 2, so you should use that same condition for selecting elevations to remove.

Column name in Part 5

Another note: I assumed that the column name for where you would store your mean elevations is 'Mean elevation', so you should use that when creating your DataFrame.

Problem 3

Plotting a horizontal line (for the bar plots)

We haven’t seen how to add a horizontal line to a plot, so I hope the explanation below will help. Basically, we want to add a line that goes across the bar plot and shows the mean elevation of all volcanoes globally. We can do this using the ax.plot() function and giving two pairs of points (x:sub:1, y:sub:1) and (x:sub:2, y:sub:2) - the end points of the line. To do this in ax.plot() we give two lists. The first is the list of x points [x1, x2], and the second the y points [y1, y2]. Putting this together you should do something like:

ax.plot([x1, x2], [y1, y2])

where you replace the x1, x2, y1, and y2 values with numbers for the end points of the line. In this case, the x value should start at -1 and end at 20 (going across all the 19 bars), and the y values should both be the global mean elevation. You calculated that earlier. For the formatting, you can refer to the earlier lesson on plotting with Pandas.

Creating arrays of numbers between two values

As you may recall from this week’s lesson on using NumPy, we can use NumPy to create NumPy arrays of values between a starting and ending value. Consider the simple example below using the np.linspace() method:

In [6]: import numpy as np

In [7]: number_array = np.linspace(0.0, 8000.0, 9)

In [8]: print(number_array)
[   0. 1000. 2000. 3000. 4000. 5000. 6000. 7000. 8000.]

Here you can see we start with 0.0, end with 8000.0, and produce an array of 9 equally spaced values that includes the starting and ending numbers. This is probably the easiest way to create most arrays of this kind.

Gaussian troubles

For the Gaussian function there are a few things to keep in mind:

  1. It should take a mean value, a standard deviation, and a list or array of values at which the Gaussian probability should be calculated. At each elevation the probability should be calculated and stored.

  2. To be more clear about part 1, it would be a good suggestion to create an empty NumPy array where the probability values can be stored. It should be the same size as x_array within your gaussian() function, and one way to create such an array would be using the np.zeros() function. np.zeros() will create an array with values equal to zero of whatever size you like. For example, if we want 10 values, we could do the following:

    In [9]: import numpy as np
    
    In [10]: array = np.zeros(10)
    
    In [11]: print(array)
    [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    

    This would yield an array of 10 values equal to zero. If you use the length of x_array you could create a similar array in your function to ensure that for each value in x_array you will have a corresponding probability. Another possibility for doing the same kind of thing can be found in the hint below about adding values to a list.

  3. If you create an array of zeros as suggested in 2, then you should also use a for loop to go over the length of your values in x_array and calculate the probability at each value in x_array. This means using a for loop of the form below:

    for i in range(len(x_array)):
        prob[i] = 1 + 1 / x_array[i]
    

    where you should replace the equation above with the actual equation for calculating the normal distribution. The key thing is using the index value i to be able to calculate the probability at each value in x_array.

Creating and appending to lists

This is mostly a reminder of something we had seen back in Lesson 2 of the Geo-Python course. When you are calculating the values for the normal distribution, one option is to create an empty list and append the calculated values to the list, calculating one value for each age in an age list/array from 0-8000m by 8 m. In the example below we can do the same kind of thing with elevations from 0-8000m in steps of 1000m, assuming you have created the NumPy array number_array as shown the previous hint:

In [12]: dummyList = []

In [13]: for i in range(len(number_array)):
   ....:     dummyList.append(number_array[i]**2.0)
   ....: 

In [14]: print(dummyList)
[0.0, 1000000.0, 4000000.0, 9000000.0, 16000000.0, 25000000.0, 36000000.0, 49000000.0, 64000000.0]

As you can see, dummyList ends up with the same number of values as number_array (see previous hint), with one calculated value in dummyList for each corresponding value in number_array.

Adding columns to a DataFrame

Just a reminder that you can add new values to a pandas DataFrame by assiging values to a new column name in the DataFrame. If we consider the example below for a DataFrame called df you could add a new column called ‘Column 2’ by typing

df['Column 2'] = 1.0

This would create a new column called 'Column 2' that has all of the rows equal to 1.0.