INDEX
Explanations
the word "norm" followed by a high activation score
instances of the word "norm" and its variations
New Auto-Interp
Negative Logits
Ashes
-0.65
UGH
-0.65
hani
-0.64
Shades
-0.61
OTS
-0.60
Tea
-0.60
cig
-0.59
Kush
-0.59
Bowl
-0.57
lder
-0.57
POSITIVE LOGITS
ativity
1.02
ality
1.00
norm
0.85
als
0.84
atively
0.77
norm
0.76
ally
0.76
mble
0.74
alties
0.74
essage
0.73
Activations Density 0.011%