INDEX
Explanations
references to debates or discussions involving opposing viewpoints
New Auto-Interp
Negative Logits
esters
-0.16
igan
-0.15
кав
-0.15
ikut
-0.15
orian
-0.14
ties
-0.14
ucha
-0.14
rack
-0.14
İÅŀ
-0.14
indow
-0.14
POSITIVE LOGITS
ative
0.25
against
0.22
against
0.21
inine
0.20
arg
0.20
uably
0.20
=args
0.20
UMENT
0.19
atively
0.19
(argument
0.19
Activations Density 0.023%