INDEX
Explanations
phrases related to cultural or societal norms
references to social norms and their variations
New Auto-Interp
Negative Logits
Lama
-0.68
semble
-0.66
Kush
-0.62
Sunder
-0.59
Newport
-0.58
wrapper
-0.58
istg
-0.58
Tub
-0.57
Lizard
-0.57
Riverside
-0.57
POSITIVE LOGITS
ativity
1.31
ality
1.18
ante
0.97
atively
0.92
als
0.85
quo
0.80
itionally
0.79
eers
0.79
prev
0.76
essential
0.75
Activations Density 0.035%