【常用Linux命令】
常用命令
Ubuntu系统安装Nvidia驱动并配置CUDA
安装Nvidia驱动
#打开终端,删除旧的驱动
$ sudo apt-get purge nvidia*
#禁用自带的 nouveau nvidia驱动
$ sudo vi /etc/modprobe.d/blacklist.conf
#添加内容
$ blacklist nouveau
$ options nouveau modeset=0
#更新系统修改
$ sudo update-initramfs -u
#然后重启
$ reboot
#验证nouveau是否已经禁用,若无任何输出则禁用成功
$ lsmod | grep nouveau
#英伟达官网下载对应的英伟达显卡驱动
#ctrl+alt+f1到6其中一个进入命令行界面, 此时需要login:电脑账户名称,password:密码,登录到命令行界面。
#关闭图形界面
$ sudo systemctl stop gdm #或者lightdm,看自己系统装的啥,Ubuntu装的是gdm
#卸载系统中存在的显卡驱动
$ sudo apt-get remove nvidia-*
#或者
$ sudo apt-get purge nvidia*
#给文件权限
$ sudo chmod a+x NVIDIA-Linux-x86_64-xxx.run
#运行run文件
$ sudo ./NVIDIA-Linux-x86_64-xxx.run -no-x-check -no-nouveau-check
#其中:
#-no-x-check:安装驱动时关闭X服务
#-no-nouveau-check:安装驱动时禁用nouveau
#在安装过程中会出现:
#安装过程中出现的提示缺少32位库直接ok
#重启图形界面,安装成功后,在命令行输入
$ sudo systemctl start gdm
#按Ctrl+Alt+F7返回图形界面
#检测是否安装成功
$ nvidia-smi
#到此驱动就安装好了。
配置CUDA
到官网找到对应的cuda,按照教程按照
#建议在Windows上用迅雷下载好再拷过去
$ wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda_11.7.1_515.65.01_linux.run
#运行
$ sudo sh cuda_11.7.1_515.65.01_linux.run
另外,在安装过程中一般选择不安装驱动,安装成功后提示
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-11.7/
Samples: Installed in /home/klchang/, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-11.7/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-11.7/lib64, or, add /usr/local/cuda-11.7/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.7/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least .00 is required for CUDA 11.7 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
sudo <CudaInstaller>.run --silent --driver
Logfile is /var/log/cuda-installer.log
安装结束后,添加环境变量到 ~/.bashrc 文件的末尾,具体添加内容如下:
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
$ export PATH=$PATH:/usr/local/cuda/bin
$ export CUDA_HOME=$CUDA_HOME:/usr/local/cuda
保存后退出。
在 Terminal 中,激活环境变量命令为 source ~/.bashrc 。
测试 CUDA Toolkit 。 通过编译自带 Samples并执行, 以验证是否安装成功。具体命令如下所示:
$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ sudo make
$ ./deviceQuery
如果安装成功,则输出类似于如下信息:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce RTX 2070 with Max-Q Design"
CUDA Driver Version / Runtime Version 11.7 / 11.7
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 7982 MBytes (8370061312 bytes)
(36) Multiprocessors, ( 64) CUDA Cores/MP: 2304 CUDA Cores
GPU Max Clock rate: 1125 MHz (1.12 GHz)
Memory Clock rate: 5501 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.7, CUDA Runtime Version = 11.7, NumDevs = 1
Result = PASS
下载并安装 cuDNN
从 NVIDIA 官方网址 https://developer.nvidia.com/rdp/cudnn-download 下载 cudnn-linux-x86_64-8.6.0.163_cuda11-archive.tar.xz。
解压压缩包,并把相应的文件,复制到指定目录即可。如下所示:
# Unzip the cuDNN package.
$ tar -xvf cudnn-linux-x86_64-8.x.x.x_cudaX.Y-archive.tar.xz
# Copy the following files into the CUDA toolkit directory.
$ sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include
$ sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
Ubuntu新建用户并添加管理员权限:
# 先修改新建用户的目录权限,使其他人无权限进入
$ sudo vim /etc/adduser.conf
# 找到以下内容
# If DIR_MODE is set, directories will be created with the specified
# mode. Otherwise the default mode 0755 will be used.
DIR_MODE=0755
# 将0755改为0750
DIR_MODE=0750
# 新建用户
$ sudo adduser username
# 按照提示输入密码等
# 如有必要将用户设为管理员(注意只能将有Linux使用经验且可信任的用户设为管理员)
$ sudo adduser username sudo # 即将username加入sudo组
Tmux简单使用教程
起因:以前在训练神经网络时,我喜欢使用 nohup 命令将程序挂到后台运行,这样即便关闭命令行窗口也不会中断训练,训练过程还可以在自己当前目录下的nohup.out文件中查看,nohup 命令如下:
# nohup表示不挂断运行命令,&符号表示在后台运行
$ nohup python tools/train.py configs/yolox/yolox_s_8x8_300e_VOC.py &
后来,在多卡并行训练模型的时候,上面这种方式出现错误,报错如下:
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4156332 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4156333 closing signal SIGHUP
Traceback (most recent call last):
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(cmd_args)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(args, **kwargs)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
time.sleep(monitor_interval)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4156314 got signal: 1
有人在他的博客 nohup训练pytorch模型时的报错以及tmux的简单使用 - gy77 - 博客园 中说这是nohup的bug,我们可以使用tmux来替换nohup。
那使用tmux吧
$ sudo apt-get install tmux # 安装
$ tmux # 进入tmux窗口
$ exit # 推出tmux窗口,或者使用快捷键[ Ctrl+d ]
$ tmux new -s ${session-name} # 创建一个会话,并设置绘画名
# 快捷键[ Ctrl+b ] 是tmux的前缀键,用完前缀键后可以继续按指定键来完成指定命令
# [ Ctrl+b ] [ d ] # 将会话与窗口分离,或者[ Ctrl+b ] tmux detach
$ tmux ls # 查看所有会话,或者使用tmux list-session
$ tmux attach -t ${session-name} # 根据会话名将terminal窗口接入会话
$ tmux kill-session -t ${session-name} # 根据会话名杀死会话
$ tmux switch -t ${session-name} # 根据会话名切换会话
$ tmux rename-session -t 0 ${session-name} # 根据会话名,重命名会话
注意: 尤其需要注意的是离开会话的时候,可能这时候我们的程序在跑着,没办法输入 tmux detach 命令离开,这时候必须用快捷键[ Ctrl+b ] [ d ] 就是先按Ctrl+b,然后按d键
tmux的一个简单使用流程
[terminal]: tmux new -s train_model # 创建一个会话,并设置绘画名:train_model
[tmux]: conda activate env_name # 在tmux会话中,我们激活我们要使用的conda环境
[tmux]: python train.py # 在tmux会话中,开始训练我们的模型
[tmux]: [ Ctrl+b ] [ d ] # 将会话与窗口分离
[terminal]: tmux ls # 查看我们刚刚创建的会话
[terminal]: watch -n 1 -c gpustat --color # 监控我们的gpu信息
使用tmux遇到第一个问题:在tmux窗口鼠标不能滚动
解决方案:
按完前缀ctrl+B后,再按冒号:进入命令行模式,
输入以下命令:
set -g mouse on # 要永久设置就在~/.tmux.conf文件写入该命令
使用tmux遇到第二个问题:在tmux窗口不能复制粘贴
解决方案:
Step1. 按住shift,鼠标左键选择内容
Step2. Ctrl + Shift+C复制
Step3. Ctrl+V
mmdetection运行 dist_train.sh 文件报 Address already in use 错误
其实原因是一台机子上跑了两个 mmdetection 代码导致节点冲突,我已经运行了一个dist_train任务,启动第二个任务的时候就报错了
解决方案:
#!/usr/bin/env bash
CONFIG=$1
GPUS=$2
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
#将这个默认端口改成一个没有被占用的就行,
#比如我现在只运行了一个dist_train任务,只占用了29500,那我就改成29501
PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch \
--nnodes=$NNODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$GPUS \
--master_port=$PORT \
$(dirname "$0")/train.py \
$CONFIG \
--seed 0 \
--launcher pytorch ${@:3}
安装并使用Zsh
1.安装
$ sudo apt-get install zsh
2.安装oh-my-zsh
$ wget https://ghproxy.com/https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh
$ sh install.sh
3.设置主题
$ vim ~/.zshrc
# 找到主题设置
ZSH_THEME="robbyrussell" 改为ZSH_THEME="bira"
4.配置自动补全
$ git clone https://ghproxy.com/https://github.com/zsh-users/zsh-autosuggestions $ZSH_CUSTOM/plugins/zsh-autosuggestions
# 编辑.zshrc文件(找到plugins=(git)这一行,如果没有添加。更改为如下)
plugins=(git zsh-autosuggestions)
5.设置为默认shell
$ chsh -s /bin/zsh #如果要改回默认:chsh -s /bin/bash
6.将tmux窗口的shell也配置为zsh
按完前缀ctrl+B后,再按冒号:进入命令行模式,
输入以下命令后回车:
set -g default-command /bin/zsh