INDEX
Explanations
obvious or straightforward concepts
instances of the word "obvious."
New Auto-Interp
Negative Logits
nan
-0.93
ingers
-0.84
borg
-0.80
psey
-0.80
iership
-0.75
isol
-0.73
aeper
-0.73
rams
-0.72
rigan
-0.72
utherland
-0.68
POSITIVE LOGITS
iary
0.92
obvious
0.79
holes
0.74
signs
0.71
LY
0.71
flaws
0.70
contradiction
0.70
omission
0.69
hole
0.69
Signs
0.68
Activations Density 0.017%