INDEX
Explanations
words related to different groups of people and their roles or characteristics
references to various groups of people or professions
New Auto-Interp
Negative Logits
DonaldTrump
-0.77
BALL
-0.66
forward
-0.65
ield
-0.62
ģ«
-0.61
ray
-0.60
paragraph
-0.58
VIEW
-0.58
UTERS
-0.58
dain
-0.57
POSITIVE LOGITS
folk
1.20
themselves
0.97
'
0.86
hest
0.84
iest
0.83
']
0.80
layer
0.80
involved
0.74
heet
0.73
hip
0.70
Activations Density 0.231%