INDEX
Explanations
specific names, potentially foreign, with characters like 'Ã' and 'ĸ'
instances of certain characters or names in text
New Auto-Interp
Negative Logits
Indigo
-0.77
Cassidy
-0.74
Annex
-0.68
itute
-0.67
Lesbian
-0.66
Indianapolis
-0.66
situ
-0.66
OWER
-0.64
Crus
-0.64
arts
-0.63
POSITIVE LOGITS
Ãĸ
1.14
nder
0.92
sten
0.89
istani
0.89
yip
0.86
uria
0.84
ön
0.82
Gö
0.81
oÄŁan
0.80
thal
0.79
Activations Density 0.005%