INDEX
Explanations
names of people
the occurrence of the token "ha" in various contexts
New Auto-Interp
Negative Logits
atories
-0.80
rations
-0.74
papers
-0.69
rats
-0.68
entric
-0.67
tle
-0.67
é»Ĵ
-0.65
lio
-0.64
ocity
-0.63
outgoing
-0.63
POSITIVE LOGITS
wn
1.22
user
1.10
pless
1.01
illard
0.97
pta
0.89
emi
0.88
verty
0.87
ppa
0.87
pper
0.86
0.86
Activations Density 0.025%