What is pandas?

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.

Series

Creating a Series

You can create a Series by putting a list or dictionary as the parameter to the method pd.Series() .

The Series is a one-dimensional array holding data of any type.

label

The label is the index of the element in the Series.

using list

1
2
3
4
5
6
7
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

output

1
2
3
4
0    1
1 7
2 2
dtype: int64

This label can be used to access a specified value.

1
print(myvar[0])

With the index argument, you can name your own labels.

1
myvar=pd.Series(a, index=["x","y","z"])

output

1
2
3
4
x    1
y 7
z 2
dtype: int64

using dictionary

Dictionary is a storage to store key/value pairs.

So when you make dictionary as the parameter, the key will become the label of the Series.

DataFrame

DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

As an example:

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
data={
"calories":[420, 380, 390],
"duration":[50,40,45]
}
# one dimension array
myvar=pd.Series(data)
# two dimension array
myvar2=pd.DataFrame(data)

print(myvar)
print(myvar2)

output

1
2
3
4
5
6
7
calories    [420, 380, 390]
duration [50, 40, 45]
dtype: object
calories duration
0 420 50
1 380 40
2 390 45

local a row

Pandas use the loc attribute to return one or more specified row(s)

1
print(myvar2.loc[0])

output

1
2
calories    420
duration 50

This example returns a Pandas Series.

local multiple row

1
print(df.loc[[0, 1]])

output

1
2
3
   calories  duration
0 420 50
1 380 40

Read CSV

Load the CSV into a DataFrame:

1
2
3
4
5
import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())

Data Cleaning

Data cleaning means fixing bad data in your data set.

Bad data could be:

  • Empty cells
  • Data in wrong format
  • Wrong data
  • Duplicates

Empty Cells

  1. One way to deal with empty cells is to remove rows that contain empty cells.

    This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.

    Example

    Return a new Data Frame with no empty cells:

    1
    2
    3
    4
    5
    6
    7
    import pandas as pd

    df = pd.read_csv('data.csv')

    new_df = df.dropna()

    print(new_df.to_string())

    By default, the dropna() method returns a new DataFrame, and will not change the original.

    If you want to change the original DataFrame, use the inplace = True argument:

    Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows containg NULL values from the original DataFrame.

    Example

    Remove all rows with NULL values:

    1
    2
    3
    4
    5
    import pandas as pd
    df = pd.read_csv('data.csv')

    df.dropna(inplace = True)
    print(df.to_string())
  2. Another way of dealing with empty cells is to insert a new value instead.

    The fillna() method allows us to replace empty cells with a value:

    1
    df.fillna(130, inplace = True)

    To only replace empty values for one column, specify the column name for the DataFrame:

    1
    df["Calories"].fillna(130, inplace = True)

Data of Wrong Format

Pandas has a to_datetime() method for this:

1
df['Date']=pd.to_datetime(df['Date'])

Wrong Data

  1. Replace Value

    Set “Duration” = 45 in row 7:

    1
    df.loc[7, 'Duration'] = 45

    To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries for legal values, and replace any values that are outside of the boundaries.

    Loop through all values in the “Duration” column.

    If the value is higher than 120, set it to 120:

    1
    2
    3
    for x in df.index:
    if df.loc[x, "Duration"] > 120:
    df.loc[x, "Duration"] = 120
  2. Remove Row

    Delete rows where “Duration” is higher than 120:

    1
    2
    3
    for x in df.index:
    if df.loc[x, "Duration"] > 120:
    df.drop(x, inplace = True)

Duplicates

To discover duplicates, we can use the duplicated() method.

1
df.drop_duplicates(inplace = True)

Plotting

Pandas uses the plot() method to create diagrams.

df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import matplotlib.pyplot as plt

df=pd.read_csv('Amazon.csv')
# original DataFram
print(df.to_string())

# filling the NaN row
df.fillna(4.0, inplace = True)

# Formating the Date
df['评论日期']=pd.to_datetime(df['评论日期'])

# Removing Dupulicates
df.drop_duplicates(inplace = True)

# Show the cLeaned DataFram
print(df.to_string())

# Show the Diagram
df.plot(kind = 'scatter',x = '评论日期', y = '评分')
plt.show()

Read JSON

JSON = Python Dictionary

JSON objects have the same format as Python dictionaries.

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

Load the JSON file into a DataFrame:

1
2
3
4
5
import pandas as pd

df = pd.read_json('data.js')

print(df.to_string())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import pandas as pd

data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}
df=pd.DataFrame(data)
print(df.to_string())