INDEX
Explanations
statements that are presented as facts
references to factual statements or claims
New Auto-Interp
Negative Logits
avorite
-0.79
Klux
-0.79
jri
-0.76
isoft
-0.71
artney
-0.69
ModLoader
-0.69
interstitial
-0.69
ctic
-0.68
hod
-0.68
Carbuncle
-0.67
POSITIVE LOGITS
ually
1.26
orial
1.15
itious
1.05
ional
1.03
oids
0.99
ially
0.94
uality
0.92
ual
0.91
oid
0.88
icity
0.86
Activations Density 0.029%