INDEX
Explanations
expressions of desire and intention
New Auto-Interp
Negative Logits
sahiptir
-0.82
obtaining
-0.75
terlebih
-0.68
Obtaining
-0.66
pertanto
-0.66
occurring
-0.64
obtains
-0.63
utilizing
-0.61
disposent
-0.61
viewing
-0.61
POSITIVE LOGITS
tell
0.78
fuck
0.78
figure
0.71
fuckin
0.68
get
0.68
fucking
0.67
fix
0.66
ruin
0.65
freak
0.64
say
0.63
Activations Density 0.477%