INDEX
Explanations
adjectives describing behaviors or characteristics
New Auto-Interp
Negative Logits
iverse
-0.78
undown
-0.75
artifacts
-0.71
isites
-0.70
orthy
-0.69
Ranked
-0.67
imester
-0.66
Sphere
-0.66
fields
-0.66
Gutenberg
-0.66
POSITIVE LOGITS
optimism
1.10
caution
1.06
refusal
1.01
attitude
1.00
honesty
1.00
humility
1.00
demeanor
0.99
prag
0.99
indignation
0.98
arrogance
0.97
Activations Density 3.839%