INDEX
Explanations
links or keywords related to websites or forums
the presence of end-of-text markers
New Auto-Interp
Negative Logits
destro
-0.66
behav
-0.64
akespe
-0.63
renheit
-0.63
Vaugh
-0.63
disadvant
-0.62
userc
-0.62
nodd
-0.62
toget
-0.61
colle
-0.61
POSITIVE LOGITS
Shin
0.74
Advocate
0.66
Rock
0.65
Rise
0.63
Oversight
0.63
Release
0.62
Supporters
0.60
Shutdown
0.60
Theft
0.60
Ahmad
0.60
Activations Density 0.764%