INDEX
Explanations
references to traditional concepts or practices
New Auto-Interp
Negative Logits
ropolis
-0.17
lying
-0.15
bras
-0.15
Tradition
-0.15
tradition
-0.15
Reputation
-0.15
liness
-0.15
hoa
-0.14
aging
-0.14
laus
-0.14
POSITIVE LOGITS
ists
0.43
ist
0.38
ism
0.29
istic
0.28
ista
0.25
izing
0.25
ISTS
0.25
ise
0.25
ized
0.24
isti
0.24
Activations Density 0.031%