Chapter 4 Python’s numpy and pandas

4.1 numpy

NumPy stands for Numerical Python, and it is a basic package for any numerical analyses. See the documentation for installation procedure, it is most likely that you will have to deal with pip or conda.

Conventionally, this package gets an alias np in the scripts, but of course you can call it as you want:

import numpy as np

Here are some useful functions:

import numpy as np

mydata = [1, 2, 3.5, -1] # a conventional python list
mydata
## [1, 2, 3.5, -1]
arr = np.array(mydata) # rewriting as nupy array
arr
## array([ 1. ,  2. ,  3.5, -1. ])

It doesn’t look much better or more convenient. Mostly because one uses numpy for performance (data structures like arrays are supposed to use less memory) rather than convenience.

4.1.1 Array properties

Let’s see some properties of the array object.

arr.dtype
## dtype('float64')
arr.shape
## (4,)
arr.ndim
## 1

How about multidimensional data?

mydata2 = [[1, 2, 3.5, -1, 0], [9, 7, 4, 2.2, 5]]

arr2 = np.array(mydata2)
arr2
## array([[ 1. ,  2. ,  3.5, -1. ,  0. ],
##        [ 9. ,  7. ,  4. ,  2.2,  5. ]])
arr2.shape
## (2, 5)
np.reshape(arr2, (5, 2))
## array([[ 1. ,  2. ],
##        [ 3.5, -1. ],
##        [ 0. ,  9. ],
##        [ 7. ,  4. ],
##        [ 2.2,  5. ]])

4.1.2 Element-wise operations

Just like R basic syntax, arithmetic is element-wise with arrays:

arr2 * 2
## array([[ 2. ,  4. ,  7. , -2. ,  0. ],
##        [18. , 14. ,  8. ,  4.4, 10. ]])
arr2[0] + arr[1]
## array([3. , 4. , 5.5, 1. , 2. ])
arr2[0] > arr[1]
## array([False, False,  True, False, False])

4.1.3 Careful with changes

It’s important to remember that operations on numpy objects are designed to be memory efficient. Therefore, if you commit in operation on a subset or a copy of an object, it will be reflected on the original object:

x = np.array([1, 2, 3, 4, 5])
y = x
y[2] = 100
x
## array([  1,   2, 100,   4,   5])

4.1.4 Indexing

By the way, indexing in arrays is similar to lists, with square brackets. In two-dimensional arrays, you specify a row first, then column.

arr2
## array([[ 1. ,  2. ,  3.5, -1. ,  0. ],
##        [ 9. ,  7. ,  4. ,  2.2,  5. ]])
arr2[0, 2]
## np.float64(3.5)
arr2[0][2]
## np.float64(3.5)

4.1.5 Functions

Numpy has some useful functions for arrays, e.g.,

x = [1, 2, 3, 4, 5]

np.mean(x)
## np.float64(3.0)
np.sum(x)
## np.int64(15)
np.sin(x)
## array([ 0.84147098,  0.90929743,  0.14112001, -0.7568025 , -0.95892427])

These functions can also be used for two-dimensional arrays, just don’t forget to specify the dimension (0 - row-wise, 1 - column-wise).

x = np.array([[1, 2, 3.5, -1, 0], [9, 7, 4, 2.2, 5]])

np.sum(x)
## np.float64(32.7)
np.sum(x, axis = 0)
## array([10. ,  9. ,  7.5,  1.2,  5. ])
np.sum(x, axis = 1)
## array([ 5.5, 27.2])

4.2 pandas

Pandas is the first package every data scientist will ever load in their python script. Again, you will need touse pip or conda to install pandas. Conventionally, we load this package under alias pd:

## Warning in py_install("pandas"): An ephemeral virtual environment managed by 'reticulate' is currently in use.
## To add more packages to your current session, call `py_require()` instead
## of `py_install()`. Running:
##   `py_require("pandas")`
import pandas as pd

4.2.1 Series

Series in pandas is similar to arrays in numpy, and you can imagine it is as one column in some data table. For example, have a community with 2 Carolina Chickadees, 5 American Goldfinches, 3 American Robins, 1 American Crow, 1 House Finch, and 1 Cooper’s Hawk.

abundances = pd.Series([2, 5, 3, 1, 1, 1])
abundances
## 0    2
## 1    5
## 2    3
## 3    1
## 4    1
## 5    1
## dtype: int64

If we looked at these abundances, the object above is not too helpful: we just don’t know what these values correspond to! Well, we can create an index of names there.

abundances = pd.Series([2, 5, 3, 1, 1, 1], index = ["CACH", "AMGO", "AMRO", "AMCR", "HOFI", "COHA"])
abundances
## CACH    2
## AMGO    5
## AMRO    3
## AMCR    1
## HOFI    1
## COHA    1
## dtype: int64

We can reference these values or their subsets as a usual list or using an index name:

abundances[:3]
## CACH    2
## AMGO    5
## AMRO    3
## dtype: int64
abundances["AMRO"]
## np.int64(3)

Note that the index names have to be unique. In that sense, named series are similar to basic Python dictionaries:

pd.Series({"CACH" : 2, "AMGO" : 5, "AMRO" : 3, "AMCR" : 1, "HOFI" : 1, "COHA" : 1})
## CACH    2
## AMGO    5
## AMRO    3
## AMCR    1
## HOFI    1
## COHA    1
## dtype: int64

We also can index series based on value, e.g., return only those values that are larger than some threshold:

abundances[abundances >= 2]
## CACH    2
## AMGO    5
## AMRO    3
## dtype: int64