What is Pandas Python? Why use Pandas?
How to Install Pandas?
What is a Pandas DataFrame? What is a Series?
Create Pandas DataFrame Pandas Range Data
Inspecting Data
Slice Data
Drop a Column
Concatenation

Why use Pandas?

Data scientists make use of Pandas in Python for its following advantages:

Easily handles missing data It uses Series for one-dimensional data structure and DataFrame for multi-dimensional data structure It provides an efficient way to slice the data It provides a flexible way to merge, concatenate or reshape the data It includes a powerful time series tool to work with

In a nutshell, Pandas is a useful library in data analysis. It can be used to perform data manipulation and analysis. Pandas provide powerful and easy-to-use data structures, as well as the means to quickly perform operations on these structures.

How to Install Pandas?

Now in this Python Pandas tutorial, we will learn how to install Pandas in Python. To install Pandas library, please refer our tutorial How to install TensorFlow. Pandas is installed by default. In remote case, pandas not installed- You can install Pandas using:

Anaconda: conda install -c anaconda pandas In Jupyter Notebook :

import sys !conda install –yes –prefix {sys.prefix} pandas

Data Frame is well known by statistician and other data practitioners. Below a picture of a Pandas data frame:

What is a Series?

A series is a one-dimensional data structure. It can have any data structure like integer, float, and string. It is useful when you want to perform computation or return a one-dimensional array. A series, by definition, cannot have multiple columns. For the latter case, please use the data frame structure. Python Pandas Series has following parameters:

Data: can be a list, dictionary or scalar value

pd.Series([1., 2., 3.])

0 1.0 1 2.0 2 3.0 dtype: float64

You can add the index with index. It helps to name the rows. The length should be equal to the size of the column

pd.Series([1., 2., 3.], index=[‘a’, ‘b’, ‘c’])

Below, you create a Pandas series with a missing value for the third rows. Note, missing values in Python are noted “NaN.” You can use numpy to create missing value: np.nan artificially

pd.Series([1,2,np.nan])

Output

0 1.0 1 2.0 2 NaN dtype: float64

Create Pandas DataFrame

Now in this Pandas DataFrame tutorial, we will learn how to create Python Pandas dataframe: You can convert a numpy array to a pandas data frame with pd.Data frame(). The opposite is also possible. To convert a pandas Data Frame to an array, you can use np.array()

Numpy to pandas

import numpy as np h = [[1,2],[3,4]] df_h = pd.DataFrame(h) print(‘Data Frame:’, df_h)

Pandas to numpy

df_h_n = np.array(df_h) print(‘Numpy array:’, df_h_n) Data Frame: 0 1 0 1 2 1 3 4 Numpy array: [[1 2] [3 4]]

You can also use a dictionary to create a Pandas dataframe.

dic = {‘Name’: [“John”, “Smith”], ‘Age’: [30, 40]} pd.DataFrame(data=dic)

Pandas Range Data

Pandas have a convenient API to create a range of date. Let’s learn with Python Pandas examples: pd.data_range(date,period,frequency):

The first parameter is the starting date

The second parameter is the number of periods (optional if the end date is specified) The last parameter is the frequency: day: ‘D,’ month: ‘M’ and year: ‘Y.’

Create date

Days

dates_d = pd.date_range(‘20300101’, periods=6, freq=‘D’) print(‘Day:’, dates_d)

Output

Day: DatetimeIndex([‘2030-01-01’, ‘2030-01-02’, ‘2030-01-03’, ‘2030-01-04’, ‘2030-01-05’, ‘2030-01-06’], dtype=‘datetime64[ns]’, freq=‘D’)

Months

dates_m = pd.date_range(‘20300101’, periods=6, freq=‘M’) print(‘Month:’, dates_m)

Output

Month: DatetimeIndex([‘2030-01-31’, ‘2030-02-28’, ‘2030-03-31’, ‘2030-04-30’,‘2030-05-31’, ‘2030-06-30’], dtype=‘datetime64[ns]’, freq=‘M’)

Inspecting Data

You can check the head or tail of the dataset with head(), or tail() preceded by the name of the panda’s data frame as shown in the below Pandas example: Step 1) Create a random sequence with numpy. The sequence has 4 columns and 6 rows

random = np.random.randn(6,4)

Step 2) Then you create a data frame using pandas. Use dates_m as an index for the data frame. It means each row will be given a “name” or an index, corresponding to a date. Finally, you give a name to the 4 columns with the argument columns

Create data with date

df = pd.DataFrame(random, index=dates_m, columns=list(‘ABCD’))

Step 3) Using head function

df.head(3)

Step 4) Using tail function

df.tail(3)

Step 5) An excellent practice to get a clue about the data is to use describe(). It provides the counts, mean, std, min, max and percentile of the dataset.

df.describe()

Slice Data

The last point of this Python Pandas tutorial is about how to slice a pandas data frame. You can use the column name to extract data in a particular column as shown in the below Pandas example:

Slice

Using name

df[‘A’]

2030-01-31 -0.168655 2030-02-28 0.689585 2030-03-31 0.767534 2030-04-30 0.557299 2030-05-31 -1.547836 2030-06-30 0.511551 Freq: M, Name: A, dtype: float64

To select multiple columns, you need to use two times the bracket, [[..,..]] The first pair of bracket means you want to select columns, the second pairs of bracket tells what columns you want to return.

df[[‘A’, ‘B’]].

You can slice the rows with : The code below returns the first three rows

using a slice for row

df[0:3]

The loc function is used to select columns by names. As usual, the values before the coma stand for the rows and after refer to the column. You need to use the brackets to select more than one column.

Multi col

df.loc[:,[‘A’,‘B’]]

There is another method to select multiple rows and columns in Pandas. You can use iloc[]. This method uses the index instead of the columns name. The code below returns the same data frame as above

df.iloc[:, :2]

Drop a Column

You can drop columns using pd.drop()

df.drop(columns=[‘A’, ‘C’])

Concatenation

You can concatenate two DataFrame in Pandas. You can use pd.concat() First of all, you need to create two DataFrames. So far so good, you are already familiar with dataframe creation

import numpy as np df1 = pd.DataFrame({’name’: [‘John’, ‘Smith’,‘Paul’], ‘Age’: [‘25’, ‘30’, ‘50’]}, index=[0, 1, 2]) df2 = pd.DataFrame({’name’: [‘Adam’, ‘Smith’ ], ‘Age’: [‘26’, ‘11’]}, index=[3, 4])

Finally, you concatenate the two DataFrame

df_concat = pd.concat([df1,df2]) df_concat

Drop_duplicates

If a dataset can contain duplicates information use, drop_duplicates is an easy to exclude duplicate rows. You can see that df_concat has a duplicate observation, Smith appears twice in the column name.

df_concat.drop_duplicates(’name’)

Sort values

You can sort value with sort_values

df_concat.sort_values(‘Age’)

Rename: change of index

You can use rename to rename a column in Pandas. The first value is the current column name and the second value is the new column name.

df_concat.rename(columns={“name”: “Surname”, “Age”: “Age_ppl”})

Summary

Below is a summary of the most useful method for data science with Pandas