INDEX
Explanations
the name "Brad" with varying activations
the repeated mention of the name "Brad."
New Auto-Interp
Negative Logits
referen
-0.70
eleph
-0.69
ktop
-0.68
subsistence
-0.67
VALUE
-0.66
derog
-0.66
phis
-0.65
versa
-0.64
Magikarp
-0.63
wiret
-0.63
POSITIVE LOGITS
shaw
1.20
enton
1.13
Pitt
1.01
ford
0.96
Brad
0.89
iago
0.89
street
0.85
anche
0.83
nan
0.83
bury
0.82
Activations Density 0.011%