pandas
What is pandas?
Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.
Series
Creating a Series
You can create a Series by putting a list or dictionary as the parameter to the method pd.Series() .
The Series is a one-dimensional array holding data of any type.
label
The label is the index of the element in the Series.
using list
1 | import pandas as pd |
output
1 | 0 1 |
This label can be used to access a specified value.
1 | print(myvar[0]) |
With the index argument, you can name your own labels.
1 | myvar=pd.Series(a, index=["x","y","z"]) |
output
1 | x 1 |
using dictionary
Dictionary is a storage to store key/value pairs.
So when you make dictionary as the parameter, the key will become the label of the Series.
DataFrame
DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
As an example:
1 | import pandas as pd |
output
1 | calories [420, 380, 390] |
local a row
Pandas use the loc attribute to return one or more specified row(s)
1 | print(myvar2.loc[0]) |
output
1 | calories 420 |
This example returns a Pandas Series.
local multiple row
1 | print(df.loc[[0, 1]]) |
output
1 | calories duration |
Read CSV
Load the CSV into a DataFrame:
1 | import pandas as pd |
Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
- Empty cells
- Data in wrong format
- Wrong data
- Duplicates
Empty Cells
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.
Example
Return a new Data Frame with no empty cells:
1
2
3
4
5
6
7import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())By default, the
dropna()method returns a new DataFrame, and will not change the original.If you want to change the original DataFrame, use the
inplace = Trueargument:Now, the
dropna(inplace = True)will NOT return a new DataFrame, but it will remove all rows containg NULL values from the original DataFrame.Example
Remove all rows with NULL values:
1
2
3
4
5import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())Another way of dealing with empty cells is to insert a new value instead.
The
fillna()method allows us to replace empty cells with a value:1
df.fillna(130, inplace = True)
To only replace empty values for one column, specify the column name for the DataFrame:
1
df["Calories"].fillna(130, inplace = True)
Data of Wrong Format
Pandas has a to_datetime() method for this:
1 | df['Date']=pd.to_datetime(df['Date']) |
Wrong Data
Replace Value
Set “Duration” = 45 in row 7:
1
df.loc[7, 'Duration'] = 45
To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries for legal values, and replace any values that are outside of the boundaries.
Loop through all values in the “Duration” column.
If the value is higher than 120, set it to 120:
1
2
3for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120Remove Row
Delete rows where “Duration” is higher than 120:
1
2
3for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
Duplicates
To discover duplicates, we can use the duplicated() method.
1 | df.drop_duplicates(inplace = True) |
Plotting
Pandas uses the plot() method to create diagrams.
df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')
Example
1 | import pandas as pd |
Read JSON
JSON = Python Dictionary
JSON objects have the same format as Python dictionaries.
JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.
Load the JSON file into a DataFrame:
1 | import pandas as pd |
1 | import pandas as pd |
