INDEX
Explanations
concepts related to commonality or shared characteristics
New Auto-Interp
Negative Logits
tring
-0.15
anager
-0.15
ut
-0.15
/ph
-0.15
LM
-0.15
door
-0.15
hta
-0.15
uations
-0.14
actory
-0.14
idel
-0.14
POSITIVE LOGITS
wealth
0.26
ities
0.19
est
0.19
ality
0.18
itized
0.17
emente
0.16
sense
0.16
denominator
0.16
abb
0.16
wy
0.15
Activations Density 0.031%