INDEX
Explanations
the presence of the word "La" in various contexts
New Auto-Interp
Negative Logits
auc
-0.19
rest
-0.15
phan
-0.14
h
-0.14
Maxim
-0.14
ês
-0.14
sea
-0.14
bystand
-0.14
attle
-0.14
rias
-0.14
POSITIVE LOGITS
unched
0.26
uren
0.23
ikip
0.20
undry
0.20
uded
0.19
urence
0.19
zyst
0.19
mgr
0.18
uder
0.18
oshi
0.17
Activations Density 0.021%