INDEX
Explanations
words related to intense, negative, or violent actions
words associated with harsh or unforgiving conditions and experiences
New Auto-Interp
Negative Logits
ullivan
-0.86
acular
-0.82
orem
-0.77
ators
-0.73
weeney
-0.73
ational
-0.72
orus
-0.71
istries
-0.71
trl
-0.70
ially
-0.69
POSITIVE LOGITS
CVE
0.79
winters
0.76
Thro
0.72
unfor
0.71
cious
0.70
ãĥ©
0.69
Clicker
0.67
snowy
0.66
honest
0.64
ãĤ¨
0.63
Activations Density 0.024%