INDEX
Explanations
names of political figures
New Auto-Interp
Negative Logits
pter
-0.87
ources
-0.70
hend
-0.70
ship
-0.69
ships
-0.69
lance
-0.68
inus
-0.66
oslav
-0.65
effic
-0.64
occ
-0.64
POSITIVE LOGITS
adesh
0.61
è£ıè
0.61
lett
0.60
Point
0.55
corpus
0.54
Literature
0.54
warr
0.53
punch
0.53
she
0.52
surv
0.51
Activations Density 0.104%