INDEX
Explanations
references to cultural concepts and identities
New Auto-Interp
Negative Logits
ildo
-0.19
elson
-0.19
uten
-0.17
ovel
-0.15
odes
-0.15
olson
-0.14
omal
-0.14
orsk
-0.14
stadt
-0.14
abouts
-0.14
POSITIVE LOGITS
ìĿ´ìĸ´
0.15
Shock
0.15
prompt
0.15
PEED
0.14
lsi
0.14
IMIT
0.14
RYPT
0.14
oho
0.14
Zug
0.14
ipay
0.14
Activations Density 0.008%