INDEX
Explanations
negation terms or phrases
New Auto-Interp
Negative Logits
myſelf
-1.56
Efq
-1.51
Jefus
-1.49
itſelf
-1.37
ſeveral
-1.35
ſche
-1.33
Reſ
-1.32
pleaſure
-1.31
raiſ
-1.31
purpoſe
-1.28
POSITIVE LOGITS
not
1.86
Not
1.40
not
1.39
Not
1.19
NOT
1.18
cannot
1.10
nicht
1.01
NOT
1.01
t
0.96
tidak
0.94
Activations Density 0.220%