INDEX
Explanations
adjectives describing intensity or severity
terms describing mildness or pleasantness
New Auto-Interp
Negative Logits
aucus
-0.67
Accountability
-0.66
hedral
-0.66
rencies
-0.65
lining
-0.64
aturated
-0.63
ilings
-0.62
funding
-0.60
etus
-0.60
Emir
-0.60
POSITIVE LOGITS
hello
0.79
harmless
0.78
annoyance
0.77
surprise
0.76
surprises
0.75
nuisance
0.75
»Ĵ
0.75
surpr
0.73
ew
0.71
prank
0.70
Activations Density 0.091%