INDEX
Explanations
mentions of specific people, particularly their names
New Auto-Interp
Negative Logits
ArgsConstructor
-0.66
Guthrie
-0.62
evre
-0.61
SWE
-0.60
OIR
-0.60
łady
-0.59
Wiener
-0.58
rigo
-0.58
Fiore
-0.58
Melo
-0.58
POSITIVE LOGITS
ub
2.06
UB
1.64
ubs
1.42
ubb
1.17
UB
1.13
ub
1.11
rub
0.98
ubli
0.98
ubber
0.97
uby
0.96
Activations Density 0.059%