INDEX
Explanations
words related to deception and abstraction
New Auto-Interp
Negative Logits
Palma
-0.85
Banten
-0.85
OCCURRED
-0.80
Palin
-0.80
bereits
-0.79
rootReducer
-0.79
ništ
-0.79
Barth
-0.78
substack
-0.78
kháu
-0.78
POSITIVE LOGITS
Everybody
0.85
kids
0.83
everybody
0.83
Gimme
0.82
Everybody
0.82
Somebody
0.81
somebody
0.80
Somebody
0.78
Nobody
0.76
Nobody
0.72
Activations Density 0.200%