INDEX
Explanations
phrases indicating neglect or disregard
New Auto-Interp
Negative Logits
V
-0.66
Roderick
-0.65
fær
-0.64
AssemblyProduct
-0.64
T
-0.63
vel
-0.61
d
-0.61
B
-0.61
AppCompat
-0.61
S
-0.60
POSITIVE LOGITS
ignore
1.50
ignored
1.50
ignoring
1.49
ignores
1.42
Ignore
1.42
gnore
1.32
Ignored
1.28
ignore
1.26
ignor
1.25
ignored
1.22
Activations Density 0.109%