pandas tips


  © 2024 borui. All rights reserved.

  This content may be freely reproduced, displayed, modified, or distributed with proper attribution to borui and a link to the article: 
  
  borui(2024-02-28 02:25:26 +0000). pandas tips. https://borui/blog/2024-02-28-en-pandas-tips.

@misc{
  borui2024,
  author = {borui},
  title = {pandas tips},
  year = {2024},
  publisher = {borui's blog},
  journal = {borui's blog},
  url={https://borui/blog/2024-02-28-en-pandas-tips}
}

How to create an empty dataframe in pandas

# import pandas library as pd
import pandas as pd
 
# create an Empty DataFrame
# object With column names only
df = pd.DataFrame(columns = ['Name', 'Scores', 'Questions'])
print(df)
'''
OUTPUT:
Empty DataFrame
Columns: [Name, Scoress, Questions]
Index: []
'''
# append rows to an empty DataFrame
df = df.append({'Name' : 'Anna', 'Scores' : 97, 'Questions' : 2200}, 
                ignore_index = True)
df = df.append({'Name' : 'Linda', 'Scores' : 30, 'Questions' : 50},
                ignore_index = True)
df = df.append({'Name' : 'Tommy', 'Scores' : 17, 'Questions' : 220},
               ignore_index = True)
print(df)
'''
OUTPUT:
Name Scores	Questions
0	Anna	97	2200
1	Linda	30	50
2	Tommy	17	220
'''

# import pandas library as pd
import pandas as pd
 
# create an Empty DataFrame object With
# column names and indices 
df = pd.DataFrame(columns = ['Name', 'Scores', 'Questions'], 
                   index = ['a', 'b', 'c'])
 
print("Empty DataFrame With NaN values : \n\n", df)
'''
Empty DataFrame With NaN values : 
   Name    Scores Questions
a  NaN      NaN      NaN
b  NaN      NaN      NaN
c  NaN      NaN      NaN
'''
 
# adding rows to an empty 
# dataframe at existing index
df.loc['a'] = ['Anna', 50, 100]
df.loc['b'] = ['Pete', 60, 120]
df.loc['c'] = ['Tommy', 30, 60]
 
print(df)
'''
    Name	Scores  Questions
a	Anna	50	100
b	Pete	60	120
c	Tommy	30	60
'''

How to remove table headers from pandas dataframe

dataframe = pd.read_csv("test.csv",sep=",", header=None)

Sep is the character used to seprate values in the csv. Dsefault value of sep is ",", be sure to set it according to the csv file.

pandas.DataFrame.to_csv. (n.d.). pandas docs. Retrieved Feb 28, 2024, from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

How to create test train split using pandas dataframe

# Using DataFrame.sample() Method by random_state arg.
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)

If you want to get same split, set the random_state. You will get same split on same random_state value. random_state will be used for generating random seed.

How do I get the row count of a Pandas DataFrame?

For a dataframe df, one can use any of the following:

len(df.index)
df.shape[0]

⚠️ Warning: Dangerous! Beware that df.count() will only return the count of non-NA/NaN rows for each column.

root(answered Apr 11, 2013). Mateen Ulhaq(edited Oct 6, 2021). For a dataframe df, one can use any of the following: len(df.index) df.shape[0] df[df.columns[0]].count() (== number of non-NaN values in first column) Performance plot Code to reproduce the plot: import numpy as np import pandas as pd import perfplot perfplot.save( "out.png", setup=lambda n: pd.DataFrame(np.arange(n * 3).reshape(n, 3)), n_range=[2**k for k in range(25)], kernels=[ lambda df: len(df.index), lambda df: df.shape[0], lambda df: df[df.columns[0]].count(), ], labels=["len(df.index)", "df.shape[0]", "df[df.columns[0]].count()"], xlabel="Number of rows", ). [Answer]. stackoverflow. https://stackoverflow.com/a/15943975

Binning a Column with Python Pandas

Specify Bin Labels

By default, the cut() function returns a categorical variable with labels corresponding to the bin edges. However, you can specify custom labels using the labels parameter:

bins = [0, 20, 40, 60, 80]
labels = ['young', 'middle-aged', 'old', 'very-old']

df['age_bin'] = pd.cut(df['age'], bins, labels=labels)

print(df)

Output:

   age      age_bin
0   18        young
1   25  middle-aged
2   30  middle-aged
3   35  middle-aged
4   40  middle-aged
5   45          old
6   50          old
7   55          old
8   60          old
9   65     very-old

In this example, we specified custom labels for each bin using the labels parameter. The resulting categorical variable now has more descriptive labels.

Bin by Quantile

Instead of specifying bin edges manually, you can also bin a column by quantile using the qcut() function. The qcut() function takes a continuous variable and a number of quantiles and returns a categorical variable representing the quantile intervals.

df['age_bin'] = pd.qcut(df['age'], q=4, labels=False)

print(df)

Output:

   age  age_bin
0   18        0
1   25        0
2   30        1
3   35        1
4   40        2
5   45        2
6   50        3
7   55        3
8   60        3
9   65        3

In this example, we used the qcut() function to bin the age column into four quantiles. The resulting categorical variable now represents the quantile intervals, with values ranging from 0 to 3.

@Saturn Cloud. (June 19, 2023). Binning a Column with Python Pandas. saturncloud. [Blog post]. Retrieved February 28, 2024, from https://saturncloud.io/blog/binning-a-column-with-python-pandas/#specify-bin-labels

How to insert empty values into a panadas dataframe

import pandas as pd

# create a sample DataFrame
df = pd.DataFrame({'col1': [1, 2, None], 'col2': [3, None, 5]})

# check for null values
null_df = df.isnull()
print(null_df)
'''
Output:
    col1   col2
0  False  False
1  False   True
2   True  False
'''

and the ouput csv will be(none will be totally omited):

col1,col2
1,3
2,
,5

How to Check if a Particular Cell in Pandas DataFrame is Null. (June 19, 2023). [Blog post]. saturncloud. Retrieved from https://saturncloud.io/blog/how-to-check-if-a-particular-cell-in-pandas-dataframe-is-null/

How to append pandas dataframe to existing csv file

# if file does not exist write header 
if not os.path.isfile(output_filename):
    df.to_csv(output_filename, index=False, mode='a')
else: # else it exists so append without writing the header
    df.to_csv(output_filename, index=False, header=False, mode='a')

📓 Note: if you use don't use df.to_csv('GFG.csv', mode='a', index=False) the output will output headers every iteration like:
dataset_name,algo_name,threshold
cars,no_resample,0.1
dataset_name,algo_name,threshold
cars,no_resample,0.1

Padraic Cunningham. [@Padraic Cunningham]. (Jun 22, 2015). Not sure there is a way in pandas but checking if the file exists would be a simple approach: import os # if file does not exist write header if not os.path.isfile('filename.csv'): df.to_csv('filename.csv', header='column_names') else: # else it exists so append without writing the header df.to_csv('filename.csv', mode='a', header=False). [Answer]. stackoverflow. https://stackoverflow.com/questions/30991541/pandas-write-csv-append-vs-write/30991707#30991707

How to split training dataframe to X and y

    # load whole dataset from csv
    df = pd.read_csv(CSV_FILENAME)

    # remove column 'y' from the dataframe
    # if df.drop(columns=[target_column], inplace=True), then df will be changed
    X = df.drop(columns=[target_column])

    # df[[colname1,colname2,...]] returns a copy slice of the original dataframe
    # df is untouched
    y = df[[target_column]]

How to concatenate dataframes X y back into the one

Use df = pd.concat([X, y], axis=1)

Pandas DataFrame: How to concatenate with Python examples. (August 15, 2023). [Blog post]. Capital One Tech. Retrieved April 2, 2024, from https://www.capitalone.com/tech/open-source/pandas-dataframe-concat/
Merge, join, concatenate and compare. (n.d.). pandas. Retrieved April 2, 2024, from https://pandas.pydata.org/docs/user_guide/merging.html#concatenating-series-and-dataframe-together
How to add one row in an existing Pandas DataFrame?. (04 Jan, 2024). geeksforgeeks. Retrieved from https://www.geeksforgeeks.org/how-to-add-one-row-in-an-existing-pandas-dataframe/

How to select a subset of a DataFrame?

# I’m interested in the names of the passengers older than 35 years.
In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]

In [24]: adult_names.head()
Out[24]: 
1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object


# I’m interested in rows 10 till 25 and columns 3 to 5.
In [25]: titanic.iloc[9:25, 2:5]
Out[25]: 
    Pclass                                 Name     Sex
9        2  Nasser, Mrs. Nicholas (Adele Achem)  female
10       3      Sandstrom, Miss. Marguerite Rut  female
11       1             Bonnell, Miss. Elizabeth  female
12       3       Saundercock, Mr. William Henry    male
13       3          Andersson, Mr. Anders Johan    male
..     ...                                  ...     ...
20       2                 Fynney, Mr. Joseph J    male
21       2                Beesley, Mr. Lawrence    male
22       3          McGowan, Miss. Anna "Annie"  female
23       1         Sloper, Mr. William Thompson    male
24       3        Palsson, Miss. Torborg Danira  female

[16 rows x 3 columns]

💡 Tip: Changes made on dataframe selected using loc/iloc will be propagated to original dataframe.

How do I select a subset of a DataFrame?. (n.d.). pandas. Retrieved April 2, 2024, from https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-select-a-subset-of-a-dataframe
JohnE. (Sep 5, 2018). Case A: Changes to df2 should NOT affect df1 This is trivial, of course. You want two completely independent dataframes so you just explicitly make a copy: df2 = df1.copy() After this anything you do to df2 affects only df2 and not df1 and vice versa. Case B: Changes to df2 should ALSO affect df1. [Answer]. stackoverflow. https://stackoverflow.com/questions/48173980/pandas-knowing-when-an-operation-affects-the-original-dataframe/52182208#52182208

How to Change All the Values of a Column in a Pandas DataFrame

Step 1: Select the Column

To select a column in Pandas, you can use the bracket notation with the column name as the key. For example, if you have a DataFrame df with a column named age, you can select the column like this:

age_column = df['age']

Step 2: Apply a Function to Each Value

Once you have selected the column, you can use the .apply() method to apply a function to each value in the column. The function should take a single argument, which will be the current value of the column. It should return the new value that you want to replace the current value with.

For example, let’s say you want to replace all the values in the age column with their square roots. You could define a function like this:

import numpy as np

def sqrt(x):
    return np.sqrt(x)

Then, you can use the .apply() method to apply this function to each value in the age column:

new_age_column = age_column.apply(sqrt)

This will create a new Series new_age_column with the square root of each value in the age column.

Step 3: Assign the New Values Back to the Column

Finally, you can assign the new values back to the original column. You can do this by using the bracket notation to select the column, and then assigning the new Series to it. For example:

df['age'] = new_age_column

This will replace the age column in the original DataFrame df with the new values.

Example

RULESET = {
    ' United-States': 'us',
}
def chg_to_ruleset(v, ruleset):
    if v in ruleset:
        return ruleset[v]
    else:
        return 'others'

y = df[['native-country']]
df[['native-country']] = y.applymap(lambda x: chg_to_ruleset(x, ruleset=RULESET))

⚠️ Warning: applymap is deprecated since version 2.1.0: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

How to Change All the Values of a Column in a Pandas DataFrame. (June 19, 2023). [Blog post]. saturncloud. Retrieved from https://saturncloud.io/blog/pandas-how-to-change-all-the-values-of-a-column/
pandas.DataFrame.applymap. (n.d.). pandas. Retrieved April 2, 2024, from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html

Addtional Reading

@古明地盆. (18 Mar, 2020.). 详解pandas的read_csv方法. cnblogs. [Blog post]. Retrieved February 28, 2024, from https://www.cnblogs.com/traditional/p/12514914.html