INDEX
Explanations
terms related to rational thinking or reasoning
references to rationality or logical reasoning
New Auto-Interp
Negative Logits
luster
-0.82
ammy
-0.78
RAW
-0.77
Banner
-0.75
hold
-0.73
yang
-0.70
rael
-0.70
HI
-0.69
IG
-0.69
chin
-0.69
POSITIVE LOGITS
izations
1.11
isations
1.00
ization
1.00
izes
0.98
istic
0.91
izers
0.90
iation
0.90
izing
0.90
istically
0.88
iated
0.85
Activations Density 0.007%