Useful Python packages for science
I’ve done 5 years of science research and development using mainly Python, and have used the major scientific computing libraries. While this topic may seem intimidating, the actual usage of each library is typically pretty brief (a few lines of code), so you don’t need to be an expert in each library, you just need to generally know what they do so you can use them when you need to. Here’s a brief overview with some of the most common code explained.
Installation
All of these package are heavily-used scientific computing Python packages, so they’re available through pip or conda (the two main Python package managers).
- Pip or Conda? I use pip as much as I can since it’s simple, and then I use Conda when I can’t find something on pip.
pip install numpy pandas matplotlib
# Just type em out, no need for commasconda install -c conda-forge numpy pandas matplotlib
# Same, but specify conda-forge for the channel (-c) because cf is cool.
For arrays: Numpy
Numpy is the library you use to turn data into arrays in Python. It is open source on Github. You can use these arrays to perform linear algebra. Back before PyTorch/TensorFlow and tensors, Numpy was the only good way to create arrays in Python. Now, for AI, you’ll probably use Tensors more, but arrays are still great to know how to use.
General Usage
import numpy as np
# Everyone imports numpy as ’np’ for efficiency lol
Creating Arrays
- A Numpy array is called an
ndarray
- An
ndarray
is processed much faster than a normallist
- An array holds one type of data, e.g. floats, so it can’t store stuff like words and numbers together. For that, use a Pandas dataframe.
- Once created, the size of the array can’t change, but the items within the array are mutable i.e. you can change them
- WIP: The bracket formatting got messed up, I will fix this later…
a1D = np.array([1, 2, 3, 4])
# Creates a 1D array… and so ona2D = np.array(1, 2], [3, 4)
# Create a 2D arraya3D = np.array([1, 2], [3, 4]], [[5, 6], [7, 8])
# Create a 3D array- You can create arrays using
np.ones()
,np.zeros()
,np.arange()
,np.linspace()
,np.eye()
, and more functions for specific behaviors.
Indexing Arrays
a1D[:3]
returnsarray([1,2,3])
, the 0th-2nd elements of the arraya2D[0,1]
returns2
, the element in row0
, column1
a2D.shape
prints out(2, 2)
the dimensions of the array
For loading data: Pandas
Pandas is a great all-around data analysis toolkit for scientific computing. Free and open-source on Github. I pretty much use it to load external files. For AI, this is crucial, since you’re often loading external datasets. It also does a great job at a variety of tasks like handling missing data, merging datasets together, converting other Python/Numpy datasets into Pandas Dataframes, and more, but I mostly use the same three lines.
It’s pretty easy to install Pandas since it’s a heavily used Python package, it’s available through Pip and Conda.
General Usage
import pandas as pd
# Are you seeing a theme?data = {'column1': [1, 2, 3], 'column2': [4, 5, 6]}
# An example dictdf_dict = pd.DataFrame(data)
# Create a dataframe (df) from thedata
dictionary declared abovedf_csv = pd.read_csv('file.csv')
# Create a df from a csv filedf_xlsx = pd.read_excel('file.xlsx')
# Create a df from an Excel filedf['column1']
select the data in column1df'column1','column2'
# Select the data in columns 1 and 2- You can do a lot with selecting which columns and rows you want, ommitting data, replacing bad data with NaNs, handling missing data
df.isnull()
# Check if there are any NULL values in the dfdf.fillna(0, inplace=True)
# Replace missing values with a 0df.to_csv('filename.csv', index=False)
# Save the data into a file of your choice
For scientific algorithms: Scipy
Scipy is a package built on Numpy that adds a lot of built-in functions to run integrations, optimization, interpolation, signal processing, linear algebra, Fourier transforms, eigenvalues, and more. Yes, as you can already see, there is some overlap between different packages. Free and open source on Github.
Scipy is not an everyday style of package, you’ll use it more for specific scenarios, so as such, there’s not really any code you should memorize. Moreso, just be aware of what kinds of things Scipy can do, and know that you don’t need to reinvent the whell every time you want to integrate or optimize an equation.
For making visualizations: Matplotlib
The go-to plotting tool for making data visualizations. Free and open source on Github. You can make the following types of plots:
- Pairwise: plot(x,y), scatter(x,y), bar(x,height), stem(x,y), fill_between(x,y1,y2), stackplot(x,y), stairs(values)
- Statistical: hist(x), boxplot(x), errorbar(x,y,yerr,xerr), violinplot(D), eventplot(D), hist2d(x,y), hexbin(x,y,C), pie(x), ecdf(x)
- Gridded: imshow(Z), pcolormesh(X,Y,Z), contour(X,Y,Z), contourf(X,Y,Z), barbs(X,Y,U,V), quiver(X,Y,U,V), streamplot(X,Y,U,V)
- Plus Irregular Gridded Data and 3D and Volumetric data
General Usage
import matplotlib.pyplot as plt
# This one’s a mouthful, but you’ll typically see “plt” everywherefig, ax = plt.subplots()
# Create the figureax.plot([1, 2, 3, 4], [1, 4, 2, 3])
# Plot some data on the Axes.plt.show()
# Show the figure.
Editing the plot !matplotlib_commands.png
For more visualizations: Seaborn
Check out Seaborn if you need more advanced viz than matplotlib. I’ve never used it, but I’ve heard it’s cool.