Loading...

Adding a new column in my existing dataframe in pandas


Original Dataframe is

column_one
1 
1
1
45
45
55
55
56

Expected Output
column-new   
i_1
i_1
i_1
i_2
i_2
i_3
i_3
i_4

Based on Column-1 I want to add another new column in my dataframe. Where there is a consecutive values than add 'i' with the same index. Thank you in advance.

- - Source

Answers

answered 1 week ago yatu #1

You could use pd.factorize. From the docs:

Useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.

So it will encode each new value it encounters as an enumerated type. Afterwards you can simply add the 'i_' prefix to the new_col:

df['new_col'] = (df.col1.factorize()[0] + 1).astype(str)
df['new_col'] = 'i_' + df.new_col

Output

    col1 new_col
0     1     i_1
1     1     i_1
2     1     i_1
3    45     i_2
4    45     i_2
5    55     i_3
6    55     i_3
7    56     i_4

answered 1 week ago HadynB #2

Depending on how fast this needs to perform, you can look into using categoricals (dtype="category") , as they should be incredibly fast with large data sets.

If you import your data as a category data type, this will already determine which are the unique values.

df["col1"] = df["col1"].astype('category')

From here, you can then implement the leading "i_" and output the category value, remembering to scale up so that you begin with 1 rather than 0:

df['newcol1'] = "i_" + (df["col1"].cat.codes + 1).astype(str)

Output

  col1 newcol1
0    1     i_1
1    1     i_1
2    1     i_1
3   45     i_2
4   45     i_2
5   55     i_3
6   55     i_3
7   56     i_4

Timings

As the code is simply reading the category index, timing the category lookup against the factorize function for a column of 10,000,000 values between 0 and 1000 gives a timing that is far faster for the category approach. This is because you are not calling a function, but instead reading the index.

It should be noted that there is an initial setup overhead involved (also shown for completion), so the factorize function would be better if you are only performing this once.

Categoricals: 0 ms
Factorize: 2092 ms
Categoricals Converstion: 3253 ms

Timings Code:

import numpy as np
import pandas as pd
import time

def timing(label, fn):
    t0 = time.time()
    fn()
    t1 = time.time()
    print '%s: %d ms' % (label, int((t1 - t0) * 1000))


df = pd.DataFrame(np.random.randint(low=0, high=1000, size=(100000000, 1)), columns=["col1"])

df["col1"] = df["col1"].astype('category')

timing('Categoricals', lambda: (df.col1.cat.codes))

timing('Factorize', lambda: (df.col1.factorize()))

comments powered by Disqus