INDEX
Explanations
references to specific historical figures or groups
references to the comedy group Monty Python
New Auto-Interp
Negative Logits
Painter
-0.74
ratulations
-0.72
redo
-0.71
GREEN
-0.68
drm
-0.67
deduction
-0.63
attribution
-0.60
verb
-0.60
deductions
-0.59
ARP
-0.59
POSITIVE LOGITS
rules
0.74
atl
0.70
ository
0.69
ouf
0.69
arella
0.66
ethy
0.65
helle
0.64
liam
0.61
ollar
0.61
ague
0.58
Activations Density 0.165%