INDEX
Explanations
phrases or words related to logical reasoning or justification
occurrences and mentions of the concept of "reason."
New Auto-Interp
Negative Logits
avorite
-0.77
Carbuncle
-0.73
Observer
-0.64
semble
-0.64
ibaba
-0.63
ModLoader
-0.62
omez
-0.60
arb
-0.60
eatures
-0.59
itals
-0.58
POSITIVE LOGITS
abl
1.36
ably
1.00
why
0.96
ality
0.83
boards
0.80
lessly
0.79
WHY
0.78
neum
0.77
ptr
0.77
why
0.76
Activations Density 0.033%