INDEX
Explanations
the word "norm" with varying degrees of activation
references to societal or cultural standards
New Auto-Interp
Negative Logits
Ashes
-0.70
UGH
-0.64
hani
-0.60
Khe
-0.60
Kush
-0.60
Package
-0.60
pta
-0.60
Bowl
-0.59
Conspiracy
-0.58
Chargers
-0.57
POSITIVE LOGITS
ality
1.13
ativity
1.12
als
1.06
atively
0.94
ally
0.91
heastern
0.84
norm
0.82
uses
0.79
alties
0.79
heast
0.79
Activations Density 0.010%