# Pandas for CIS 1100

This is an abridged version of the documentation for `pandas`

that should contain most of the functions you will need to use for the course. We will leave out certain information (e.g., additional keyword arguments) that aren’t relevant to the course. For more complete and accurate information, refer to the official documentation.

**A couple notes while reading …**

- Throughout this guide, we will refer to
`pandas`

as`pd`

, a generic`DataFrame`

as`df`

, and a generic`Series`

as`s`

. This is important, as certain functions can only be called at certain levels (library vs.`DataFrame`

vs.`Series`

). - Keyword arguments are optional and will have default values that we show.
- Where relevant, we will provide data types for parameters.
- Remember that most functions will return a
*copy*of the`DataFrame`

/`Series`

rather than modify the existing one.

## Cheat sheet

### Reading in data

Function | Description |
---|---|

`pd.read_csv(filepath, sep=',', index_col=0)` |
Reads in a csv file and turns it into a `DataFrame` . |

### Viewing data

Function | Description |
---|---|

`df.col` |
Accesses the column named `col` in `df` as a `Series` . |

`df['col']` |
Accesses the column named `col` in `df` as a `Series` . |

`df[['col1', 'col2', ...]]` |
Accesses a list of columns (here, `['col1', 'col2', ...]` ) as a `DataFrame` . |

`df.iloc[i]` |
Accesses row `i` in `df` as a `DataFrame` . |

`df[start:stop:step]` |
Accesses the rows `start` (inclusive) to `stop` (exclusive) using step size `step` . |

`df.head(n=5)` |
Shows the first `n` rows. |

`df.tail(n=5)` |
Shows the last `n` rows. |

### Modifying data

Function | Description |
---|---|

`df.rename(columns=None)` |
Renames columns of `df` with a given dictionary mapping old names to new names. |

`df.drop(columns=None)` |
Deletes (drops) the specified columns. |

`df.dropna()` |
Deletes (drops) all rows that have at least one missing value. |

`df.fillna(value=0)` |
Replaces all missing values with `value` . |

### Describing data

Function | Description |
---|---|

`df.shape` |
A tuple of the dimensions of `df` of the form `(num_rows, num_cols)` . |

`df.info()` |
Prints out summary information about a `DataFrame` , including the number of columns, non-null values, and datatypes. |

`df.dtypes` |
A `Series` containing the data type of each column. |

`df.columns` |
A `Series` containing all column labels. |

`df.describe()` |
Generates a `DataFrame` with descriptive statistics (count, mean, standard deviation, median, quartiles, etc.). |

`df.empty` |
A `bool` that indicates whether an entire `DataFrame` (or `Series` ) is empty. |

`df.count()` |
Counts the number of non-null elements in each column. |

`df.value_counts(normalize=False)` |
Counts the number of each value that appears in each column. |

`df.nunique()` |
Counts the number of unique values in each column. |

`df.isna()` |
Checks if each cell in a `DataFrame` is null and returns a boolean index. |

`df.min(axis=0)` |
Finds the minimum of each row or column. |

`df.max(axis=0)` |
Finds the maximum of each row or column. |

`df.std(axis=0)` |
Finds the standard deviation of each row or column. |

`df.mean(axis=0)` |
Finds the mean/average of each row or column. |

`df.median(axis=0)` |
Finds the median of each row or column. |

`df.sum(axis=0)` |
Finds the sum of each row or column. |

### Strings (and other sequences)

Function | Description |
---|---|

`s.str.len()` |
Gets the length of each element (this works on lists and other sequences too). |

`s.str.upper()` |
Makes a `Series` uppercase. |

`s.str.lower()` |
Makes a `Series` lowercase. |

`s.str.get(i)` |
Extracts the `i` th index of each element. |

`s.str.strip()` |
Removes whitespace at the beginning and end of every element. |

`s.str.split(sep=None)` |
Divides `str` into a list divided by `sep` . |

`s.str.contains(pat)` |
Checks if each element contains a given string `pat` . |

`s.str.count(pat)` |
Counts the number of times each element contains a given string `pat` . |

`s.str.replace(pat, repl, regex=False)` |
Replaces all instances of `pat` with `repl` . |

`df.explode(column)` |
For list-like columns, separates each value of each list into its own row, copying other elements. |

### Dates and times

Function | Description |
---|---|

`pd.to_datetime(arg, format=None)` |
Converts `arg` into a `pandas` `datetime` object, which makes it easier to work with dates and times. |

### Plotting

Function | Description |
---|---|

`df.plot(x=None, y=None, kind='line', title=None, xlabel=None, ylabel=None)` |
Plots the given `DataFrame` or `Series` . |

## Details

### Reading in data

`pd.read_csv(filepath, sep=',', index_col=0)`

Reads in a csv file and turns it into a `DataFrame`

.

**Parameters:**

`filepath`

(`str`

): File path to csv (make sure to include folders!)`index_col`

(`False`

,`str`

, or`int`

): Specifies the index with the column name or index. We primarily use`index_col=False`

to not include an index at all.

**Returns:** New `DataFrame`

from csv.

### Viewing data

`df.col`

Accesses the column named `col`

in `df`

as a `Series`

.

**Note:** `col`

must be a proper name in Python (i.e., you should be able to name a variable `col`

—meaning no spaces, can’t start with a number, etc.).

`df['col']`

Accesses the column named `col`

in `df`

as a `Series`

.

**Note:** Unlike the `df.col`

syntax, `col`

here just needs to be a valid string.

`df[['col1', 'col2', ...]]`

Accesses a list of columns (here, `['col1', 'col2', ...]`

) as a `DataFrame`

.

`df.iloc[i]`

Accesses row `i`

in `df`

as a `DataFrame`

.

**Note:** `df[i]`

does **not** do the same thing.

`df[start:stop:step]`

Accesses the rows `start`

(inclusive) to `stop`

(exclusive) using step size `step`

. This behaves the same way as ordinary sequence slicing (e.g., `step`

is optional).

`df.head(n=5)`

Shows the first `n`

rows.

**Parameters:**

`n`

(`int`

): Number of rows to show

**Returns:** New `DataFrame`

with specified rows.

`df.tail(n=5)`

Shows the last `n`

rows.

**Parameters:**

`n`

(`int`

): Number of rows to show

**Returns:** New `DataFrame`

with specified rows.

### Modifying data

`df.rename(columns=None)`

Renames columns of `df`

with a given dictionary mapping old names to new names.

**Parameters:**

`columns`

(`Dict[str, str]`

): A dictionary mapping old names (usually strings) to new names.

**Returns:** New `DataFrame`

with renamed columns.

**Notes:**

- By default, keys in
`columns`

that are not actual columns in`df`

will be ignored. - Although
`columns`

is a keyword argument, for our purposes it is required.

`df.drop(columns=None)`

Deletes (drops) the specified columns.

**Parameters:**

`columns`

(`List[str]`

): A list of column names to drop.

**Returns:** New `DataFrame`

without dropped columns.

**Notes:**

- By default, if you try to drop a nonexistent column, it will throw an error.
- Although
`columns`

is a keyword argument, for our purposes it is required.

`df.dropna()`

Deletes (drops) all rows that have at least one missing value.

**Returns:** New `DataFrame`

without dropped rows.

`df.fillna(value=0)`

Replaces all missing values with `value`

.

**Parameters:**

`value`

: Value to replace missing values with. We will primarily have this be a scalar (`str`

,`int`

, etc.) but advanced uses can assign different values to different columns, etc.

**Returns:** New `DataFrame`

with filled values.

### Describing data

`df.shape`

A tuple of the dimensions of `df`

of the form `(num_rows, num_cols)`

.

`df.info()`

Prints out summary information about a `DataFrame`

, including the number of columns, non-null values, and datatypes.

**Returns:** `None`

`df.dtypes`

A `Series`

containing the data type of each column.

`df.columns`

A `Series`

containing all column labels.

`df.describe()`

Generates a `DataFrame`

with descriptive statistics (count, mean, standard deviation, median, quartiles, etc.).

**Returns:** A `Series`

containing descriptive statistics for each column.

**Note:** Also works with `Series`

.

`df.empty`

A `bool`

that indicates whether an entire `DataFrame`

(or `Series`

) is empty.

`df.count()`

Counts the number of non-null elements in each column.

**Returns:** A `Series`

with the non-null element counts for each column.

`df.value_counts(normalize=False)`

Counts the number of each value that appears in each column.

**Parameters:**

`normalize`

(`bool`

): Whether to use raw counts (an integer number) or the proportion of values (a value from 0 to 1).

**Returns:** A `Series`

containing the value counts for each column.

**Notes:**

- By default the method sorts values in descending order.
- For our purposes, it’s easiest if you only call this on a
`Series`

, as calling it on a`DataFrame`

may result in multi-indexing, which we do not cover.

`df.nunique()`

Counts the number of unique values in each column.

**Returns:** A `Series`

with the number of unique values for each column.

`df.isna()`

Checks if each cell in a `DataFrame`

is null and returns a boolean index.

**Returns:** A `DataFrame`

of the same shape as `df`

, where each value is `True`

if the corresponding value in `df`

is `NaN`

and `False`

otherwise.

`df.min(axis=0)`

Finds the minimum of each row or column.

**Parameters:**

`axis`

(`0`

or`1`

): The axis to find the minimum along.`0`

means finding the minimum of each column.`1`

means finding the minimum of each row.

**Returns:** A scalar (single number) if applied on a `Series`

or a `Series`

with the minimum for each row/column if applied on a `DataFrame`

.

`df.max(axis=0)`

Finds the maximum of each row or column.

**Parameters:**

`axis`

(`0`

or`1`

): The axis to find the maximum along.`0`

means finding the maximum of each column.`1`

means finding the maximum of each row.

**Returns:** A scalar (single number) if applied on a `Series`

or a `Series`

with the maximum for each row/column if applied on a `DataFrame`

.

`df.std(axis=0)`

Finds the standard deviation of each row or column.

**Parameters:**

`axis`

(`0`

or`1`

): The axis to find the standard deviation along.`0`

means finding the standard deviation of each column.`1`

means finding the standard deviation of each row.

**Returns:** A scalar (single number) if applied on a `Series`

or a `Series`

with the standard deviation for each row/column if applied on a `DataFrame`

.

`df.mean(axis=0)`

Finds the mean/average of each row or column.

**Parameters:**

`axis`

(`0`

or`1`

): The axis to find the mean along.`0`

means finding the mean of each column.`1`

means finding the mean of each row.

**Returns:** A scalar (single number) if applied on a `Series`

or a `Series`

with the mean for each row/column if applied on a `DataFrame`

.

`df.median(axis=0)`

Finds the median of each row or column.

**Parameters:**

`axis`

(`0`

or`1`

): The axis to find the median along.`0`

means finding the median of each column.`1`

means finding the median of each row.

**Returns:** A scalar (single number) if applied on a `Series`

or a `Series`

with the median for each row/column if applied on a `DataFrame`

.

`df.sum(axis=0)`

Finds the sum of each row or column.

**Parameters:**

`axis`

(`0`

or`1`

): The axis to find the sum along.`0`

means finding the sum of each column.`1`

means finding the sum of each row.

**Returns:** A scalar (single number) if applied on a `Series`

or a `Series`

with the sum for each row/column if applied on a `DataFrame`

.

### Strings (and other sequences)

`s.str.len()`

Gets the length of each element (this works on lists and other sequences too).

**Returns:** A `Series`

containing the length of each element.

`s.str.upper()`

Makes a `Series`

uppercase.

**Returns:** A new `Series`

with every element uppercase.

**Note:** Analogous to `str.upper()`

`s.str.lower()`

Makes a `Series`

lowercase.

**Returns:** A new `Series`

with every element lowercase.

**Note:** Analogous to `str.lower()`

`s.str.strip()`

Removes whitespace at the beginning and end of every element.

**Returns:** A new `Series`

with every element stripped.

**Note:** Analogous to `str.strip()`

`s.str.split(sep)`

Divides `str`

into a list divided by `sep`

.

**Parameters:**

`sep`

(`str`

): The separator string to divide each element using. If this is not provided (i.e.,`None`

) the default is to split by whitespace. Note that there are also options to make this a regex.

**Returns:** A new `Series`

of lists containing the divided `str`

s.

**Note:** Analogous to `str.split(sep)`

`s.str.get(i)`

Extracts the `i`

th index of each element.

**Parameters:**

`i`

(`int`

): The index to extract.

**Returns:** A new `Series`

with just the `i`

th element of each original sequence.

**Note:** Analogous to `str[i]`

(but importantly *not* slicing).

`s.str.contains(pat)`

Checks if each element contains a given string `pat`

.

**Parameters:**

`pat`

(`str`

): String to check membership of. Note that this can be a “regex,” which you don’t need to know about but may appear occasionally in the course.

**Returns:** A `Series`

of `bool`

s that indicate whether each element contains `pat`

.

`s.str.count(pat)`

Counts the number of times each element contains a given string `pat`

.

**Parameters:**

`pat`

(`str`

): String to count. Note that this can be a “regex,” which you don’t need to know about but may appear occasionally in the course.

**Returns:** A `Series`

of counts that `pat`

appears in each element.

`s.str.replace(pat, repl, regex=False)`

Replaces all instances of `pat`

with `repl`

.

**Parameters:**

`pat`

(`str`

): Pattern to find.`repl`

(`str`

): String to replace`pat`

with.`regex`

(`bool`

): Determines whether`pat`

and`repl`

can be regex expressions. Unless we explicitly tell you otherwise, you can leave this alone.

**Returns:** A new `Series`

with replaced values.

`df.explode(column)`

For list-like columns, separates each value of each list into its own row, copying other elements.

**Parameters:**

`column`

(`str`

or`List[str]`

): Either the column name to “explode” or a list of column names to explode. For the most part, stick to just a single column.

**Returns:** A `DataFrame`

with exploded rows.

**Note:** If applied to a `Series`

, you don’t need to specify any parameters.

### Dates and times

`pd.to_datetime(arg, format=None)`

Converts `arg`

into a `pandas`

`datetime`

object, which makes it easier to work with dates and times.

**Parameters:**

`arg`

(`int`

,`str`

,`Series`

,`DataFrame`

, etc.): The thing to convert into a`datetime`

object. While`arg`

can be a wide range of types, we will primarily use it as a`Series`

.`format`

(`str`

): How you want to format the`datetime`

object. For the most part, you should use`'mixed'`

, which tries to automatically guess how the dates/times are formatted in the original data. There are more advanced options available.

**Returns:** The same type as `arg`

(e.g., a `Series`

), with dates converted into `datetime`

objects.

### Plotting

`df.plot(x=None, y=None, kind='line', title=None, xlabel=None, ylabel=None)`

Plots the given `DataFrame`

or `Series`

.

**Parameters:**

`x`

(label like`int`

or`str`

): The variable to use for the $x$-axis. By default, this is the index.`y`

(label or list of labels): The variable(s) to use for the $y$-axis. By default, this is all numerical columns.`kind`

(`str`

): Type of plot. A full list of options is provided below.`title`

(`str`

): Title for the plot.`xlabel`

(`str`

): Label for the $x$-axis.`ylabel`

(`str`

): Label for the $y$-axis.

**Returns:** The “plot” object (don’t worry about the specifics).

**Note:** There are *many* more optional parameters that you can use to modify the plot. Refer to the official documentation for this.

**Types of plots:**

`'line'`

: line plot (default)`'bar'`

: vertical bar plot`'barh'`

: horizontal bar plot`'hist'`

: histogram`'box'`

: boxplot`'kde'`

: Kernel Density Estimation plot`'density'`

: same as ‘kde’`'area'`

: area plot`'pie'`

: pie plot`'scatter'`

: scatter plot (DataFrame only)`'hexbin'`

: hexbin plot (DataFrame only)