Depending on how fast this needs to perform, you can look into using categoricals (dtype="category"
) , as they should be incredibly fast with large data sets.
If you import your data as a category
data type, this will already determine which are the unique values.
df["col1"] = df["col1"].astype('category')
From here, you can then implement the leading "i_" and output the category value, remembering to scale up so that you begin with 1 rather than 0:
df['newcol1'] = "i_" + (df["col1"].cat.codes + 1).astype(str)
Output
col1 newcol1
0 1 i_1
1 1 i_1
2 1 i_1
3 45 i_2
4 45 i_2
5 55 i_3
6 55 i_3
7 56 i_4
Timings
As the code is simply reading the category index, timing the category lookup against the factorize function for a column of 10,000,000 values between 0 and 1000 gives a timing that is far faster for the category approach. This is because you are not calling a function, but instead reading the index.
It should be noted that there is an initial setup overhead involved (also shown for completion), so the factorize function would be better if you are only performing this once.
Categoricals: 0 ms
Factorize: 2092 ms
Categoricals Converstion: 3253 ms
Timings Code:
import numpy as np
import pandas as pd
import time
def timing(label, fn):
t0 = time.time()
fn()
t1 = time.time()
print '%s: %d ms' % (label, int((t1 - t0) * 1000))
df = pd.DataFrame(np.random.randint(low=0, high=1000, size=(100000000, 1)), columns=["col1"])
df["col1"] = df["col1"].astype('category')
timing('Categoricals', lambda: (df.col1.cat.codes))
timing('Factorize', lambda: (df.col1.factorize()))