INDEX
Explanations
words or phrases related to arguments or debates
New Auto-Interp
Negative Logits
TRS
-0.15
chen
-0.15
akter
-0.15
igans
-0.15
zial
-0.15
closure
-0.15
closure
-0.15
chsel
-0.14
ral
-0.14
eh
-0.14
POSITIVE LOGITS
uably
0.35
entin
0.32
onaut
0.31
entine
0.29
uing
0.28
entina
0.28
yle
0.27
uable
0.27
ued
0.27
inine
0.26
Activations Density 0.007%