INDEX
Explanations
expressions of belief or opinion
New Auto-Interp
Negative Logits
ughter
-0.18
ughters
-0.17
ν
-0.17
ein
-0.17
vig
-0.16
benh
-0.16
ez
-0.16
udy
-0.15
gew
-0.15
cono
-0.15
POSITIVE LOGITS
twice
0.29
Twice
0.26
about
0.23
-about
0.22
fully
0.19
differently
0.19
alike
0.19
_about
0.19
lessly
0.19
tank
0.18
Activations Density 0.070%