INDEX
Explanations
phrases that express subjective evaluations or opinions
New Auto-Interp
Negative Logits
bergen
-0.15
hypoc
-0.14
oeff
-0.14
ulin
-0.14
eland
-0.14
odpowied
-0.13
iske
-0.13
.twitch
-0.13
rid
-0.13
_um
-0.12
POSITIVE LOGITS
pure
0.22
tant
0.18
Pure
0.18
Pure
0.18
yet
0.18
sheer
0.17
akin
0.17
PURE
0.17
wish
0.17
Exhibit
0.16
Activations Density 0.307%