INDEX
Explanations
phrases indicating requests or demands from authority figures
New Auto-Interp
Negative Logits
contr
-0.19
677
-0.15
893
-0.15
semiclassical
-0.14
805
-0.14
rella
-0.14
alia
-0.14
CONTR
-0.13
829
-0.13
veys
-0.13
POSITIVE LOGITS
alink
0.16
anlı
0.15
å³
0.15
PureComponent
0.14
ev
0.14
anoi
0.14
üc
0.14
oho
0.13
ighet
0.13
chie
0.13
Activations Density 0.059%