INDEX
Explanations
first name followed by last name
New Auto-Interp
Negative Logits
PT
0.47
8
0.45
3
0.43
City
0.42
Citizens
0.41
4
0.41
ys
0.39
iki
0.39
UM
0.39
ama
0.39
POSITIVE LOGITS
ாய்ச்சி
0.40
ॉफ्ट
0.40
当然
0.39
ficar
0.39
ovviamente
0.39
admittedly
0.37
桝
0.37
जुर्ग
0.37
翀
0.37
自分で
0.37
Activations Density 0.001%