INDEX
Explanations
the word "UR" with varying levels of activation
occurrences of the abbreviation "UR"
New Auto-Interp
Negative Logits
Schultz
-0.74
olean
-0.72
Sands
-0.72
Kissinger
-0.70
etts
-0.69
xon
-0.68
makers
-0.67
hed
-0.67
notes
-0.65
Eb
-0.64
POSITIVE LOGITS
UR
1.10
POSE
1.08
BLE
1.02
GER
0.99
ARCH
0.97
OPE
0.94
confir
0.93
AGE
0.93
pees
0.92
DER
0.91
Activations Density 0.005%