INDEX
Explanations
words associated with negative consequences or outcomes
phrases indicating causality or consequences
New Auto-Interp
Negative Logits
afort
-0.64
Shal
-0.64
iling
-0.62
pload
-0.62
Vaughn
-0.60
terday
-0.59
Sunshine
-0.58
Scotia
-0.58
atu
-0.57
tuber
-0.57
POSITIVE LOGITS
gers
0.91
entious
0.88
wcs
0.84
better
0.77
-+
0.76
iments
0.74
ges
0.71
ging
0.70
rush
0.68
GGGG
0.67
Activations Density 0.031%