INDEX
Explanations
words related to incorrectness or mistakes
phrases that indicate incorrectness or undesirable outcomes
New Auto-Interp
Negative Logits
doms
-0.80
dom
-0.74
thood
-0.71
lishes
-0.71
anism
-0.71
zeb
-0.68
punk
-0.67
renheit
-0.67
archives
-0.67
rs
-0.66
POSITIVE LOGITS
amount
0.97
thing
0.96
balance
0.87
combination
0.86
solution
0.85
kind
0.84
antidote
0.82
way
0.80
piece
0.80
side
0.79
Activations Density 0.039%