INDEX
Explanations
phrases related to different forms of concepts
phrases indicating different kinds of forms or categories
New Auto-Interp
Negative Logits
urers
-0.87
Zup
-0.79
ween
-0.73
doms
-0.70
iets
-0.70
Cosponsors
-0.70
teasp
-0.70
ostics
-0.70
nets
-0.67
omers
-0.67
POSITIVE LOGITS
harassment
0.84
accommodation
0.81
thood
0.81
activism
0.79
inspiration
0.78
discrimination
0.76
insanity
0.74
dementia
0.74
taxation
0.71
humor
0.71
Activations Density 0.060%