INDEX
Explanations
words related to logic and reason
references to rationality
New Auto-Interp
Negative Logits
RAW
-0.79
ammy
-0.78
rael
-0.74
luster
-0.73
chin
-0.73
hold
-0.71
kick
-0.69
HI
-0.68
Nou
-0.68
IG
-0.68
POSITIVE LOGITS
izations
1.08
ization
0.99
isations
0.98
izes
0.95
tarian
0.95
iation
0.94
istic
0.94
izing
0.91
iated
0.86
ized
0.86
Activations Density 0.005%