总览
Language-Free:
ActivityNet Captions
metric | percentage value |
---|---|
R@0.1 | 61.35 |
R@0.3 | 47.61 |
R@0.5 | 32.59 |
R@0.7 | 15.42 |
mIoU | 31.85 |
Charades-STA
metric | percentage value |
---|---|
R@0.1 | - |
R@0.3 | 52.95 |
R@0.5 | 37.24 |
R@0.7 | 19.33 |
mIoU | 36.05 |
ActivityNet Captions
train_lf_20
attention_loss = 1.0
不加 ConvBNReLU
ResNet + LN(LN 没有加激活)
Position Embedding(1D sin-cos PE)
BS = 512
Adam:5e-4
cosin_lr: 5e-4 -> 3e-4
epoch:300
- 0.1 -> 61(63.57) +2
- 0.3 -> 47 (41.12) -6
- 0.5 -> 32 (23.64) -9
- 0.7 -> 15 (6.38) -9
- mIoU -> 32 (27.37) -5
train_lf_25
attention_loss = 1.0
不使用学习率衰减
lr = 4e-4
$ReLU(LN(Conv1d(MSA(x)) + x))$
添加 Non Local Block
epoch = 300
epoch 176:
0.1 -> 56.35; -5
0.3 -> 35.54; -12
0.5 -> 16.91; -16
0.7 -> 5.45; -10
mIoU -> 23.19; -9
train_lf_26
epoch:300
在 25 的基础上,将 text_feats 进行 归一化(norm(dim=-1)),并且在训练阶段添加了一些 高斯噪声(如 Language-Free 中所做的那样)
Loss 图【先降后升,再收敛】
epoch 221:
- 0.1 -> 61(73.86) +12
- 0.3 -> 47 (42.97) -4
- 0.5 -> 32 (24.83) -7
- 0.7 -> 15 (11.94) -3
- mIoU -> 32 (31.64) -1
train_lf_27
在 26 的基础上,训到 500 epoch
并且将 LR 从 4e-4 降为 3e-4
结果:与 26 无差别,loss 还是会先降后升,再收敛
train_lf_28
attention_loss = 1.0
Conv1d
Non Local Layer
MSA-LN = ${\color{red}ReLU(LN(MSA(x) + x))}$ ==Post-LN==
Position Embedding(1D sin-cos PE)
BS = 512
Adam:4e-4
cosin_lr: 5e-4 -> 3e-4
epoch:500
添加高斯噪声
- Post-LN 可能更适合这个任务
- Non Local Layer 对于精度的提升可能很关键
epoch 403:
- 0.1 -> 59.17; -2
- 0.3 -> 38.38; -9
- 0.5 -> 16.71; -16
- 0.7 -> 5.46; -10
- mIoU -> 24.41; -8
train_lf_29
attention_loss = 1.0
Conv1d
Non Local Layer
MSA-LN = ${\color{red}ReLU(x + MSA(LN(x))}$ ==Pre-LN==
Position Embedding(1D sin-cos PE)
BS = 512
Adam:4e-4
cosin_lr: 5e-4 -> 3e-4
epoch:500
添加高斯噪声
epoch 336:
- 0.1 -> 29.16;
- 0.3 -> 18.5;
- 0.5 -> 10.05;
- 0.7 -> 4.85;
- mIoU -> 13.52;
train_lf_30
在 29 的基础上,只增加了 R-Drop Loss(alpha=0.7,并且仅针对最终的 localization 结果来计算 Loss)
R-Drop Loss 对这个任务可能有一定的提升
epoch 414:
- 0.1 -> 36.09;
- 0.3 -> 21.34;
- 0.5 -> 11.22;
- 0.7 -> 5.466;
- mIoU -> 15.24;
train_lf_31
将 Post-LN、Non Local Layer 以及 R-Drop Loss(alpha=0.7,且仅针对 localization 来计算 loss)结合起来:
attention_loss = 1.0
Conv1d
Non Local Layer
MSA-LN = ${\color{red}ReLU(LN(MSA(x) + x))}$ ==Post-LN==
Position Embedding(1D sin-cos PE)
BS = 128 ==实验室服务器的可用显存不够==
Adam:4e-4
cosin_lr: 5e-4 -> 3e-4
epoch:500
添加高斯噪声
ing ->
【795887】
train_lf_32
与 31 一样,但是将 BS 改为 512
ing -> train_lf_bad
【3132347】
BS = 256
ing ->
【2211727】
【代码最后版本,停留在这里】
train_lf_33
train_lf_33_bad -> LR=4e-4, weight_decay=0.01, BS=512【Shutdown】
将优化器改为 AdamW:
- LR:1e-4
- weight decay:1e-4
BS = 512
ing ->
【2180289】
train_lf_34
MSA-LN = ${\color{red}ReLU(x + MSA(LN(x))}$ ==Pre-LN==
【3979994】
ing -> end
【1614056】
- CLIP4Caption
- hidden_dim = 512
- n_heads = 4
- n_layers = 3
- CLIP
VIT-B/32
- hidden_dim = 512
MSA-LN:$x + MAS(LN(x))$ ==Pre-LN==x = x + MSA(x) ==ResNet==
- BS = 512
- epoch = 300
- Adam
- lr = 4e-4
- ActivityNet Captions
- No Warmup
- No Non Local Block
(train_lf_34.log)
ing -> bad end
梯度消失
train_lf_35
- 脚本:
train_lf.py
- 模型:
lf_train_module.py
的TrainModel
实验名\参数 | bs & epoch | Attention | Norm | Non Local | 状态 | 结果 |
---|---|---|---|---|---|---|
tmp_1 | ||||||
tmp_2 | ||||||
tmp_3 | ||||||
tmp_1
【4046783】
- bs = 256
- epoch = 300
- Adam
- LR:4e-4
- Warmup:3;
1e-8
->4e-4
- Cosin LR scheduler:
4e-4
->1e-8
- Video Aware Cross-Attention
- Query =
Video
- Key、Value =
Text
- Pre-Norm
- No FFN
- Query =
- Self-Attention
- Query、Key、Value =
Cross-Attened Result
- Pre-Norm
- No FFN
- Query、Key、Value =
Non Local:Yes
- No Gausian Noise
ing ->
(train_lf_35_1.log)
tmp_2
【2692732】
- BS = 256
- Epoch = 300
- Adam
- lr = 4e-4
- Warmup:3;
1e-8
->4e-4
- No LR scheduler
- Cross Attention
- Query:
Video
- Key、Value:
Text
- Post-Norm
- No FFN
- Query:
- Self Attention
- Query、Key、Value:
Cross-Attention Result
- Post-Norm
- No FFN
- Query、Key、Value:
- Non Local:Yes
ing -> bad end ==梯度消失==
(train_lf_35_2.log)
tmp_3
【3939145】
- BS = 256
- Epoch = 300
- Adam
- lr =
4e-4
- Warmup:3;
1e-8
->4e-4
- Cosin LR scheduler:
4e-4
->1e-8
- lr =
- Cross Attention
- Query:
Video
- Key、Value:
Text
- Post-Norm
- No FFN
- Query:
- Self Attention
- Query、Key、Value:
Cross-Attention Result
- Post-Norm
- No FFN
- Query、Key、Value:
- Non Local:Yes
- Add Gausian Noise
ing -> bad end
(train_lf_35_3.log)
tmp_4
【3079209】
- BS = 256
- Epoch = 300
- Adam
- lr =
4e-4
- No Warmup
- No LR scheduler
- lr =
- Cross Attention
- Query:
Video
- Key、Value:
Text
- No Norm
- No FFN
- Query:
- Self Attention
- Query、Key、Value:
Cross-Attention Result
- No Norm
- No FFN
- Query、Key、Value:
- Non Local:Yes
- Add Gaussian Noise
ing -> bad end
(train_lf_35_4.log)
Traceback (most recent call last):
File "/root/code/VC-TVG/train_lf.py", line 592, in <module>
train_lf_activitynet()
File "/root/code/VC-TVG/train_lf.py", line 203, in train_lf_activitynet
sum_loss.backward()
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LogBackward0' returned nan values in its 0th output.
tmp_5
【】
Regression Head:(512, 512)+ ReLU() + (512, 2) + ReLU()
BS =
256
->512
Epoch = 300
- Adam
- lr:
从4e-4
改为1e-4
2e-4
- Warmup:3;
1e-8
->4e-4
/1e-8
->1e-4
- LR scheduler:
4e-4
->1e-8
/1e-4
->1e-8
- lr:
AttentionPooling
使用ReLU()
激活函数,取代tanh()
函数tanh()
- 取消使用 use_embedding,即
use_embedding=False
- Cross Attention
Query:
Video
Key、Value:
Text
Post-Norm添加ReLU()
激活No FFN
attented + residual
- Self Attention
Query、Key、Value:
Cross-Attention Result
Post-Norm添加ReLU()
激活No FFN
attented + residual
- Train
- transformer_text_feats:768 -> 512
Non Local:Yes- Add Gaussian Noise
no start now
(train_lf_35_5.log)
tmp_6
【】
将 fusion encoder 的 hidden dimension 从 512
变为 256
Regression Head:(256, 256)+ ReLU() + (256, 2) + ReLU()
- BS = 256
- Epoch = 300
- Adam
- lr =
4e-4
- Warmup:3;
1e-8
->4e-4
- LR scheduler:
4e-4
->1e-8
- lr =
- Cross Attention
- Query:
Video
- Key、Value:
Text
- Post-Norm
- No FFN
- Query:
- Self Attention
- Query、Key、Value:
Cross-Attention Result
- Post-Norm
- No FFN
- Query、Key、Value:
- Train
- transformer_text_feats:768 -> 512
- text_mlp:512 -> 256
- video_mlp:256 -> 256
- Non Local:Yes
- Add Gaussian Noise
ing -> bad end ==梯度消失==
(train_lf_35_6.log)
Charades-STA
train_lf_charades_34
【4098718】
- BS = 512
- Epoch = 500
- Adam
- lr = 4e-4
- Cross Attention
- Query:
Video
- Key、Value:
Text
- Post-Norm
- No FFN
- Query:
- Self Attention
- Query、Key、Value:
Cross-Attention Result
- Post-Norm
- No FFN
- Query、Key、Value:
- Non Local:Yes
(train_lf_charades_34.log)
ing ->
文档信息
- 本文作者:Bookstall
- 本文链接:https://bookstall.github.io/2023/04/03/VC-TVG/
- 版权声明:自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)