© 2024 borui. All rights reserved.
This content may be freely reproduced, displayed, modified, or distributed with proper attribution to borui and a link to the article:
borui(2024-02-28 02:25:26 +0000). pandas tips. https://borui/blog/2024-02-28-en-pandas-tips.
@misc{
borui2024,
author = {borui},
title = {pandas tips},
year = {2024},
publisher = {borui's blog},
journal = {borui's blog},
url={https://borui/blog/2024-02-28-en-pandas-tips}
}
How to create an empty dataframe in pandas
# import pandas library as pd
import pandas as pd
# create an Empty DataFrame
# object With column names only
df = pd.DataFrame(columns = ['Name', 'Scores', 'Questions'])
print(df)
'''
OUTPUT:
Empty DataFrame
Columns: [Name, Scoress, Questions]
Index: []
'''
# append rows to an empty DataFrame
df = df.append({'Name' : 'Anna', 'Scores' : 97, 'Questions' : 2200},
ignore_index = True)
df = df.append({'Name' : 'Linda', 'Scores' : 30, 'Questions' : 50},
ignore_index = True)
df = df.append({'Name' : 'Tommy', 'Scores' : 17, 'Questions' : 220},
ignore_index = True)
print(df)
'''
OUTPUT:
Name Scores Questions
0 Anna 97 2200
1 Linda 30 50
2 Tommy 17 220
'''
# import pandas library as pd
import pandas as pd
# create an Empty DataFrame object With
# column names and indices
df = pd.DataFrame(columns = ['Name', 'Scores', 'Questions'],
index = ['a', 'b', 'c'])
print("Empty DataFrame With NaN values : \n\n", df)
'''
Empty DataFrame With NaN values :
Name Scores Questions
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
'''
# adding rows to an empty
# dataframe at existing index
df.loc['a'] = ['Anna', 50, 100]
df.loc['b'] = ['Pete', 60, 120]
df.loc['c'] = ['Tommy', 30, 60]
print(df)
'''
Name Scores Questions
a Anna 50 100
b Pete 60 120
c Tommy 30 60
'''
How to remove table headers from pandas dataframe
dataframe = pd.read_csv("test.csv",sep=",", header=None)
Sep is the character used to seprate values in the csv. Dsefault value of sep is ",", be sure to set it according to the csv file.
- pandas.DataFrame.to_csv. (n.d.). pandas docs. Retrieved Feb 28, 2024, from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
How to create test train split using pandas dataframe
# Using DataFrame.sample() Method by random_state arg.
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)
If you want to get same split, set the random_state. You will get same split on same random_state value. random_state will be used for generating random seed.
How do I get the row count of a Pandas DataFrame?
For a dataframe df, one can use any of the following:
len(df.index)
df.shape[0]
⚠️ Warning: Dangerous! Beware that df.count() will only return the count of non-NA/NaN rows for each column.
- root(answered Apr 11, 2013). Mateen Ulhaq(edited Oct 6, 2021). For a dataframe df, one can use any of the following: len(df.index) df.shape[0] df[df.columns[0]].count() (== number of non-NaN values in first column) Performance plot Code to reproduce the plot: import numpy as np import pandas as pd import perfplot perfplot.save( "out.png", setup=lambda n: pd.DataFrame(np.arange(n * 3).reshape(n, 3)), n_range=[2**k for k in range(25)], kernels=[ lambda df: len(df.index), lambda df: df.shape[0], lambda df: df[df.columns[0]].count(), ], labels=["len(df.index)", "df.shape[0]", "df[df.columns[0]].count()"], xlabel="Number of rows", ). [Answer]. stackoverflow. https://stackoverflow.com/a/15943975
Binning a Column with Python Pandas
Specify Bin Labels
By default, the cut() function returns a categorical variable with labels corresponding to the bin edges. However, you can specify custom labels using the labels parameter:
bins = [0, 20, 40, 60, 80]
labels = ['young', 'middle-aged', 'old', 'very-old']
df['age_bin'] = pd.cut(df['age'], bins, labels=labels)
print(df)
Output:
age age_bin
0 18 young
1 25 middle-aged
2 30 middle-aged
3 35 middle-aged
4 40 middle-aged
5 45 old
6 50 old
7 55 old
8 60 old
9 65 very-old
In this example, we specified custom labels for each bin using the labels parameter. The resulting categorical variable now has more descriptive labels.
Bin by Quantile
Instead of specifying bin edges manually, you can also bin a column by quantile using the qcut() function. The qcut() function takes a continuous variable and a number of quantiles and returns a categorical variable representing the quantile intervals.
df['age_bin'] = pd.qcut(df['age'], q=4, labels=False)
print(df)
Output:
age age_bin
0 18 0
1 25 0
2 30 1
3 35 1
4 40 2
5 45 2
6 50 3
7 55 3
8 60 3
9 65 3
In this example, we used the qcut() function to bin the age column into four quantiles. The resulting categorical variable now represents the quantile intervals, with values ranging from 0 to 3.
- @Saturn Cloud. (June 19, 2023). Binning a Column with Python Pandas. saturncloud. [Blog post]. Retrieved February 28, 2024, from https://saturncloud.io/blog/binning-a-column-with-python-pandas/#specify-bin-labels
How to insert empty values into a panadas dataframe
import pandas as pd
# create a sample DataFrame
df = pd.DataFrame({'col1': [1, 2, None], 'col2': [3, None, 5]})
# check for null values
null_df = df.isnull()
print(null_df)
'''
Output:
col1 col2
0 False False
1 False True
2 True False
'''
and the ouput csv will be(none will be totally omited):
col1,col2
1,3
2,
,5
- How to Check if a Particular Cell in Pandas DataFrame is Null. (June 19, 2023). [Blog post]. saturncloud. Retrieved from https://saturncloud.io/blog/how-to-check-if-a-particular-cell-in-pandas-dataframe-is-null/
How to append pandas dataframe to existing csv file
# if file does not exist write header
if not os.path.isfile(output_filename):
df.to_csv(output_filename, index=False, mode='a')
else: # else it exists so append without writing the header
df.to_csv(output_filename, index=False, header=False, mode='a')
📓 Note: if you use don't use
df.to_csv('GFG.csv', mode='a', index=False)
the output will output headers every iteration like:dataset_name,algo_name,threshold cars,no_resample,0.1 dataset_name,algo_name,threshold cars,no_resample,0.1
- Padraic Cunningham. [@Padraic Cunningham]. (Jun 22, 2015). Not sure there is a way in pandas but checking if the file exists would be a simple approach: import os # if file does not exist write header if not os.path.isfile('filename.csv'): df.to_csv('filename.csv', header='column_names') else: # else it exists so append without writing the header df.to_csv('filename.csv', mode='a', header=False). [Answer]. stackoverflow. https://stackoverflow.com/questions/30991541/pandas-write-csv-append-vs-write/30991707#30991707
How to split training dataframe to X and y
# load whole dataset from csv
df = pd.read_csv(CSV_FILENAME)
# remove column 'y' from the dataframe
# if df.drop(columns=[target_column], inplace=True), then df will be changed
X = df.drop(columns=[target_column])
# df[[colname1,colname2,...]] returns a copy slice of the original dataframe
# df is untouched
y = df[[target_column]]
How to concatenate dataframes X y back into the one
Use df = pd.concat([X, y], axis=1)
-
Pandas DataFrame: How to concatenate with Python examples. (August 15, 2023). [Blog post]. Capital One Tech. Retrieved April 2, 2024, from https://www.capitalone.com/tech/open-source/pandas-dataframe-concat/
-
Merge, join, concatenate and compare. (n.d.). pandas. Retrieved April 2, 2024, from https://pandas.pydata.org/docs/user_guide/merging.html#concatenating-series-and-dataframe-together
-
How to add one row in an existing Pandas DataFrame?. (04 Jan, 2024). geeksforgeeks. Retrieved from https://www.geeksforgeeks.org/how-to-add-one-row-in-an-existing-pandas-dataframe/
How to select a subset of a DataFrame?
# I’m interested in the names of the passengers older than 35 years.
In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
In [24]: adult_names.head()
Out[24]:
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
6 McCarthy, Mr. Timothy J
11 Bonnell, Miss. Elizabeth
13 Andersson, Mr. Anders Johan
15 Hewlett, Mrs. (Mary D Kingcome)
Name: Name, dtype: object
# I’m interested in rows 10 till 25 and columns 3 to 5.
In [25]: titanic.iloc[9:25, 2:5]
Out[25]:
Pclass Name Sex
9 2 Nasser, Mrs. Nicholas (Adele Achem) female
10 3 Sandstrom, Miss. Marguerite Rut female
11 1 Bonnell, Miss. Elizabeth female
12 3 Saundercock, Mr. William Henry male
13 3 Andersson, Mr. Anders Johan male
.. ... ... ...
20 2 Fynney, Mr. Joseph J male
21 2 Beesley, Mr. Lawrence male
22 3 McGowan, Miss. Anna "Annie" female
23 1 Sloper, Mr. William Thompson male
24 3 Palsson, Miss. Torborg Danira female
[16 rows x 3 columns]
💡 Tip: Changes made on dataframe selected using
loc/iloc
will be propagated to original dataframe.
-
How do I select a subset of a DataFrame?. (n.d.). pandas. Retrieved April 2, 2024, from https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-select-a-subset-of-a-dataframe
-
JohnE. (Sep 5, 2018). Case A: Changes to df2 should NOT affect df1 This is trivial, of course. You want two completely independent dataframes so you just explicitly make a copy: df2 = df1.copy() After this anything you do to df2 affects only df2 and not df1 and vice versa. Case B: Changes to df2 should ALSO affect df1. [Answer]. stackoverflow. https://stackoverflow.com/questions/48173980/pandas-knowing-when-an-operation-affects-the-original-dataframe/52182208#52182208
How to Change All the Values of a Column in a Pandas DataFrame
Step 1: Select the Column
To select a column in Pandas, you can use the bracket notation with the column name as the key. For example, if you have a DataFrame df with a column named age, you can select the column like this:
age_column = df['age']
Step 2: Apply a Function to Each Value
Once you have selected the column, you can use the .apply() method to apply a function to each value in the column. The function should take a single argument, which will be the current value of the column. It should return the new value that you want to replace the current value with.
For example, let’s say you want to replace all the values in the age column with their square roots. You could define a function like this:
import numpy as np
def sqrt(x):
return np.sqrt(x)
Then, you can use the .apply() method to apply this function to each value in the age column:
new_age_column = age_column.apply(sqrt)
This will create a new Series new_age_column with the square root of each value in the age column.
Step 3: Assign the New Values Back to the Column
Finally, you can assign the new values back to the original column. You can do this by using the bracket notation to select the column, and then assigning the new Series to it. For example:
df['age'] = new_age_column
This will replace the age column in the original DataFrame df with the new values.
Example
RULESET = {
' United-States': 'us',
}
def chg_to_ruleset(v, ruleset):
if v in ruleset:
return ruleset[v]
else:
return 'others'
y = df[['native-country']]
df[['native-country']] = y.applymap(lambda x: chg_to_ruleset(x, ruleset=RULESET))
⚠️ Warning: applymap is deprecated since version 2.1.0: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
-
How to Change All the Values of a Column in a Pandas DataFrame. (June 19, 2023). [Blog post]. saturncloud. Retrieved from https://saturncloud.io/blog/pandas-how-to-change-all-the-values-of-a-column/
-
pandas.DataFrame.applymap. (n.d.). pandas. Retrieved April 2, 2024, from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html
Addtional Reading
- @古明地盆. (18 Mar, 2020.). 详解pandas的read_csv方法. cnblogs. [Blog post]. Retrieved February 28, 2024, from https://www.cnblogs.com/traditional/p/12514914.html