INDEX
Explanations
mentions of the word "Nash" at varying activations
references to a specific name or entity denoted by variations of the token "ash"
New Auto-Interp
Negative Logits
pregn
-0.68
etheless
-0.67
STER
-0.60
worldly
-0.60
sworth
-0.58
ster
-0.58
oplan
-0.57
elig
-0.56
eering
-0.56
cavity
-0.56
POSITIVE LOGITS
nikov
1.03
IELD
0.97
ield
0.91
anu
0.89
imi
0.87
adow
0.84
rine
0.84
IFT
0.84
ti
0.82
ares
0.82
Activations Density 0.036%