INDEX
Explanations
references to power dynamics and control over situations or entities
New Auto-Interp
Negative Logits
urat
-0.16
inç
-0.15
esome
-0.15
enburg
-0.15
вин
-0.14
ůr
-0.14
è½½
-0.14
URED
-0.14
iets
-0.14
prit
-0.13
POSITIVE LOGITS
over
0.56
sobre
0.37
_over
0.37
над
0.35
over
0.33
Over
0.33
över
0.33
OVER
0.32
Over
0.31
è¿ĩ
0.29
Activations Density 0.082%