INDEX
Explanations
words that indicate personal opinions or judgments
New Auto-Interp
Negative Logits
appropri
-0.17
_ipc
-0.16
rame
-0.16
нев
-0.15
Appropri
-0.14
ültür
-0.14
ASTER
-0.14
htable
-0.14
ame
-0.14
ehler
-0.14
POSITIVE LOGITS
ax
0.19
obvious
0.18
observation
0.18
Consult
0.17
reasonable
0.17
observable
0.17
acknow
0.17
Universal
0.16
Observ
0.16
observ
0.16
Activations Density 0.004%