INDEX
Explanations
adjectives representing intensity or scale
concepts related to potential outcomes or consequences
New Auto-Interp
Negative Logits
nar
-0.74
antic
-0.70
stice
-0.70
Centauri
-0.69
scope
-0.67
Canadian
-0.66
Å
-0.63
ses
-0.63
ashi
-0.63
anc
-0.63
POSITIVE LOGITS
WHEN
1.36
whenever
1.08
if
1.08
when
1.04
unless
0.96
BEFORE
0.92
when
0.90
WHERE
0.85
When
0.84
AFTER
0.84
Activations Density 0.177%