INDEX
Explanations
statements that the neuron perceives to be true or accurate
phrases asserting the truthfulness of statements
New Auto-Interp
Negative Logits
uled
-0.75
adish
-0.72
rador
-0.71
acent
-0.68
aida
-0.68
hens
-0.68
asers
-0.68
ADRA
-0.68
onut
-0.66
Citiz
-0.65
POSITIVE LOGITS
believers
0.86
hood
0.84
regardless
0.78
believer
0.76
insofar
0.72
irrespective
0.70
portrayal
0.69
terday
0.68
izable
0.68
everywhere
0.68
Activations Density 0.023%