INDEX
Explanations
phrases that indicate conflict or truthfulness in arguments
New Auto-Interp
Negative Logits
utz
-0.15
ale
-0.14
Stanton
-0.14
.intellij
-0.14
.mvc
-0.13
deduction
-0.13
qu
-0.13
Unt
-0.13
пи
-0.13
Joined
-0.12
POSITIVE LOGITS
еÑĤелÑĮ
0.17
emmel
0.16
edo
0.16
oire
0.16
usher
0.16
emm
0.15
Ãłm
0.15
ones
0.15
Entr
0.15
/effects
0.14
Activations Density 0.193%