INDEX
Explanations
words that indicate strong actions, such as those related to affirmations, proposals, and findings
New Auto-Interp
Negative Logits
ToLocal
-0.17
erts
-0.16
inka
-0.15
cake
-0.15
dabei
-0.15
еÑĢÑĤа
-0.15
uib
-0.14
ved
-0.14
assin
-0.14
roke
-0.14
POSITIVE LOGITS
already
0.23
already
0.23
Already
0.20
even
0.18
Already
0.18
even
0.18
_already
0.18
جار
0.16
elsewhere
0.16
EVEN
0.16
Activations Density 0.005%