INDEX
Explanations
positive statements or news
consistently positive phrases, emphasizing the concept of "good news."
New Auto-Interp
Negative Logits
itsch
-0.77
ĸļ
-0.71
eters
-0.69
TIT
-0.66
hler
-0.63
ustom
-0.62
otte
-0.62
replace
-0.61
ismo
-0.61
framework
-0.60
POSITIVE LOGITS
news
1.37
outweigh
1.01
thing
1.01
fortune
0.92
news
0.90
stuff
0.88
ol
0.86
folks
0.85
part
0.83
vib
0.82
Activations Density 0.040%