INDEX
Explanations
expressions of personal opinions or beliefs
New Auto-Interp
Negative Logits
seemingly
-0.27
seem
-0.25
seems
-0.24
nicht
-0.23
seemed
-0.22
không
-0.21
Seems
-0.21
ikke
-0.20
apparently
-0.20
not
-0.20
POSITIVE LOGITS
fair
0.20
fair
0.19
fairly
0.17
overall
0.17
'].$
0.16
overall
0.16
Fair
0.16
оÑĢалÑĮ
0.16
mostly
0.15
.safe
0.15
Activations Density 0.205%