Chapter 4 Python’s numpy and pandas
4.1 numpy
NumPy stands for Numerical Python, and it is a basic package for any numerical analyses. See the documentation for installation procedure, it is most likely that you will have to deal with pip or conda.
Conventionally, this package gets an alias np
in the scripts, but of course you can call it as you want:
Here are some useful functions:
## [1, 2, 3.5, -1]
## array([ 1. , 2. , 3.5, -1. ])
It doesn’t look much better or more convenient. Mostly because one uses numpy for performance (data structures like arrays are supposed to use less memory) rather than convenience.
4.1.1 Array properties
Let’s see some properties of the array object.
## dtype('float64')
## (4,)
## 1
How about multidimensional data?
## array([[ 1. , 2. , 3.5, -1. , 0. ],
## [ 9. , 7. , 4. , 2.2, 5. ]])
## (2, 5)
## array([[ 1. , 2. ],
## [ 3.5, -1. ],
## [ 0. , 9. ],
## [ 7. , 4. ],
## [ 2.2, 5. ]])
4.1.2 Element-wise operations
Just like R basic syntax, arithmetic is element-wise with arrays:
## array([[ 2. , 4. , 7. , -2. , 0. ],
## [18. , 14. , 8. , 4.4, 10. ]])
## array([3. , 4. , 5.5, 1. , 2. ])
## array([False, False, True, False, False])
4.1.3 Careful with changes
It’s important to remember that operations on numpy objects are designed to be memory efficient. Therefore, if you commit in operation on a subset or a copy of an object, it will be reflected on the original object:
## array([ 1, 2, 100, 4, 5])
4.1.4 Indexing
By the way, indexing in arrays is similar to lists, with square brackets. In two-dimensional arrays, you specify a row first, then column.
## array([[ 1. , 2. , 3.5, -1. , 0. ],
## [ 9. , 7. , 4. , 2.2, 5. ]])
## np.float64(3.5)
## np.float64(3.5)
4.1.5 Functions
Numpy has some useful functions for arrays, e.g.,
## np.float64(3.0)
## np.int64(15)
## array([ 0.84147098, 0.90929743, 0.14112001, -0.7568025 , -0.95892427])
These functions can also be used for two-dimensional arrays, just don’t forget to specify the dimension (0 - row-wise, 1 - column-wise).
## np.float64(32.7)
## array([10. , 9. , 7.5, 1.2, 5. ])
## array([ 5.5, 27.2])
4.2 pandas
Pandas is the first package every data scientist will ever load in their python script. Again, you will need touse pip or conda to install pandas. Conventionally, we load this package under alias pd
:
## Warning in py_install("pandas"): An ephemeral virtual environment managed by 'reticulate' is currently in use.
## To add more packages to your current session, call `py_require()` instead
## of `py_install()`. Running:
## `py_require("pandas")`
4.2.1 Series
Series in pandas is similar to arrays in numpy, and you can imagine it is as one column in some data table. For example, have a community with 2 Carolina Chickadees, 5 American Goldfinches, 3 American Robins, 1 American Crow, 1 House Finch, and 1 Cooper’s Hawk.
## 0 2
## 1 5
## 2 3
## 3 1
## 4 1
## 5 1
## dtype: int64
If we looked at these abundances, the object above is not too helpful: we just don’t know what these values correspond to! Well, we can create an index of names there.
abundances = pd.Series([2, 5, 3, 1, 1, 1], index = ["CACH", "AMGO", "AMRO", "AMCR", "HOFI", "COHA"])
abundances
## CACH 2
## AMGO 5
## AMRO 3
## AMCR 1
## HOFI 1
## COHA 1
## dtype: int64
We can reference these values or their subsets as a usual list or using an index name:
## CACH 2
## AMGO 5
## AMRO 3
## dtype: int64
## np.int64(3)
Note that the index names have to be unique. In that sense, named series are similar to basic Python dictionaries:
## CACH 2
## AMGO 5
## AMRO 3
## AMCR 1
## HOFI 1
## COHA 1
## dtype: int64
We also can index series based on value, e.g., return only those values that are larger than some threshold:
## CACH 2
## AMGO 5
## AMRO 3
## dtype: int64