【常用Linux命令】

2023-02-22 Views2591字13 min read

常用命令


Ubuntu系统安装Nvidia驱动并配置CUDA

安装Nvidia驱动

#打开终端,删除旧的驱动
$ sudo apt-get purge nvidia*

#禁用自带的 nouveau nvidia驱动
$ sudo vi /etc/modprobe.d/blacklist.conf

#添加内容
$ blacklist nouveau
$ options nouveau modeset=0

#更新系统修改
$ sudo update-initramfs -u

#然后重启
$ reboot

#验证nouveau是否已经禁用,若无任何输出则禁用成功
$ lsmod | grep nouveau

#英伟达官网下载对应的英伟达显卡驱动
#ctrl+alt+f1到6其中一个进入命令行界面, 此时需要login:电脑账户名称,password:密码,登录到命令行界面。

#关闭图形界面
$ sudo systemctl stop gdm #或者lightdm,看自己系统装的啥,Ubuntu装的是gdm

#卸载系统中存在的显卡驱动
$ sudo apt-get remove nvidia-*

#或者
$ sudo apt-get purge nvidia*

#给文件权限

$ sudo chmod a+x NVIDIA-Linux-x86_64-xxx.run

#运行run文件
$ sudo ./NVIDIA-Linux-x86_64-xxx.run -no-x-check -no-nouveau-check

#其中:
#-no-x-check:安装驱动时关闭X服务
#-no-nouveau-check:安装驱动时禁用nouveau


#在安装过程中会出现:
#安装过程中出现的提示缺少32位库直接ok

#重启图形界面,安装成功后,在命令行输入
$ sudo systemctl start gdm

#按Ctrl+Alt+F7返回图形界面

#检测是否安装成功
$ nvidia-smi

#到此驱动就安装好了。

配置CUDA

到官网找到对应的cuda,按照教程按照

#建议在Windows上用迅雷下载好再拷过去
$ wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda_11.7.1_515.65.01_linux.run

#运行
$ sudo sh cuda_11.7.1_515.65.01_linux.run

另外,在安装过程中一般选择不安装驱动,安装成功后提示

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-11.7/
Samples:  Installed in /home/klchang/, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-11.7/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.7/lib64, or, add /usr/local/cuda-11.7/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.7/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least .00 is required for CUDA 11.7 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run --silent --driver

Logfile is /var/log/cuda-installer.log

安装结束后,添加环境变量到 ~/.bashrc 文件的末尾,具体添加内容如下:

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
$ export PATH=$PATH:/usr/local/cuda/bin
$ export CUDA_HOME=$CUDA_HOME:/usr/local/cuda

保存后退出。

在 Terminal 中,激活环境变量命令为 source ~/.bashrc 。

测试 CUDA Toolkit 。 通过编译自带 Samples并执行, 以验证是否安装成功。具体命令如下所示:

$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ sudo make
$ ./deviceQuery

如果安装成功,则输出类似于如下信息:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce RTX 2070 with Max-Q Design"
  CUDA Driver Version / Runtime Version          11.7 / 11.7
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 7982 MBytes (8370061312 bytes)
  (36) Multiprocessors, ( 64) CUDA Cores/MP:     2304 CUDA Cores
  GPU Max Clock rate:                            1125 MHz (1.12 GHz)
  Memory Clock rate:                             5501 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.7, CUDA Runtime Version = 11.7, NumDevs = 1
Result = PASS

下载并安装 cuDNN

从 NVIDIA 官方网址  https://developer.nvidia.com/rdp/cudnn-download 下载 cudnn-linux-x86_64-8.6.0.163_cuda11-archive.tar.xz。

解压压缩包,并把相应的文件,复制到指定目录即可。如下所示:

# Unzip the cuDNN package.
$ tar -xvf cudnn-linux-x86_64-8.x.x.x_cudaX.Y-archive.tar.xz

# Copy the following files into the CUDA toolkit directory.
$ sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include 
$ sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64 
$ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

Ubuntu新建用户并添加管理员权限:

# 先修改新建用户的目录权限,使其他人无权限进入
$ sudo vim /etc/adduser.conf

# 找到以下内容
# If DIR_MODE is set, directories will be created with the specified
# mode. Otherwise the default mode 0755 will be used.
DIR_MODE=0755 
# 将0755改为0750
DIR_MODE=0750


# 新建用户
$ sudo adduser username
# 按照提示输入密码等

# 如有必要将用户设为管理员(注意只能将有Linux使用经验且可信任的用户设为管理员)
$ sudo adduser username sudo # 即将username加入sudo组

Tmux简单使用教程

起因:以前在训练神经网络时,我喜欢使用 nohup 命令将程序挂到后台运行,这样即便关闭命令行窗口也不会中断训练,训练过程还可以在自己当前目录下的nohup.out文件中查看,nohup 命令如下:

# nohup表示不挂断运行命令,&符号表示在后台运行
$ nohup python tools/train.py configs/yolox/yolox_s_8x8_300e_VOC.py &

后来,在多卡并行训练模型的时候,上面这种方式出现错误,报错如下:

WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4156332 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4156333 closing signal SIGHUP
Traceback (most recent call last):
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(cmd_args)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(args, **kwargs)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
time.sleep(monitor_interval)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4156314 got signal: 1

有人在他的博客 nohup训练pytorch模型时的报错以及tmux的简单使用 - gy77 - 博客园 中说这是nohup的bug,我们可以使用tmux来替换nohup。

那使用tmux吧

$ sudo apt-get install tmux   # 安装
$ tmux                        # 进入tmux窗口
$ exit                        # 推出tmux窗口,或者使用快捷键[ Ctrl+d ]
$ tmux new -s ${session-name} # 创建一个会话,并设置绘画名
# 快捷键[ Ctrl+b ] 是tmux的前缀键,用完前缀键后可以继续按指定键来完成指定命令
# [ Ctrl+b ] [ d ]                         # 将会话与窗口分离,或者[ Ctrl+b ] tmux detach
$ tmux ls                                  # 查看所有会话,或者使用tmux list-session
$ tmux attach -t ${session-name}           #  根据会话名将terminal窗口接入会话
$ tmux kill-session -t ${session-name}     #  根据会话名杀死会话
$ tmux switch -t ${session-name}           # 根据会话名切换会话
$ tmux rename-session -t 0 ${session-name} # 根据会话名,重命名会话

注意: 尤其需要注意的是离开会话的时候,可能这时候我们的程序在跑着,没办法输入 tmux detach 命令离开,这时候必须用快捷键[ Ctrl+b ] [ d ] 就是先按Ctrl+b,然后按d键

tmux的一个简单使用流程

[terminal]: tmux new -s train_model       # 创建一个会话,并设置绘画名:train_model
[tmux]: conda activate env_name           # 在tmux会话中,我们激活我们要使用的conda环境
[tmux]: python train.py                   # 在tmux会话中,开始训练我们的模型
[tmux]: [ Ctrl+b ] [ d ]                  # 将会话与窗口分离
[terminal]: tmux ls                       # 查看我们刚刚创建的会话
[terminal]: watch -n 1 -c gpustat --color # 监控我们的gpu信息

使用tmux遇到第一个问题:在tmux窗口鼠标不能滚动

解决方案:

按完前缀ctrl+B后,再按冒号:进入命令行模式,
输入以下命令:

set -g mouse on # 要永久设置就在~/.tmux.conf文件写入该命令

使用tmux遇到第二个问题:在tmux窗口不能复制粘贴

解决方案:

Step1. 按住shift,鼠标左键选择内容

Step2. Ctrl + Shift+C复制

Step3. Ctrl+V

mmdetection运行 dist_train.sh 文件报 Address already in use 错误

其实原因是一台机子上跑了两个 mmdetection 代码导致节点冲突,我已经运行了一个dist_train任务,启动第二个任务的时候就报错了

解决方案:

#!/usr/bin/env bash

CONFIG=$1
GPUS=$2
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
#将这个默认端口改成一个没有被占用的就行,
#比如我现在只运行了一个dist_train任务,只占用了29500,那我就改成29501
PORT=${PORT:-29500} 
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch \
    --nnodes=$NNODES \
    --node_rank=$NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --nproc_per_node=$GPUS \
    --master_port=$PORT \
    $(dirname "$0")/train.py \
    $CONFIG \
    --seed 0 \
    --launcher pytorch ${@:3}

安装并使用Zsh

1.安装

$ sudo apt-get install zsh

2.安装oh-my-zsh

$ wget https://ghproxy.com/https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh
$ sh install.sh

3.设置主题

$ vim ~/.zshrc
# 找到主题设置
ZSH_THEME="robbyrussell" 改为ZSH_THEME="bira"

4.配置自动补全

$ git clone https://ghproxy.com/https://github.com/zsh-users/zsh-autosuggestions $ZSH_CUSTOM/plugins/zsh-autosuggestions

# 编辑.zshrc文件(找到plugins=(git)这一行,如果没有添加。更改为如下)

plugins=(git zsh-autosuggestions)

5.设置为默认shell

$ chsh -s /bin/zsh #如果要改回默认:chsh -s /bin/bash

6.将tmux窗口的shell也配置为zsh

按完前缀ctrl+B后,再按冒号:进入命令行模式,
输入以下命令后回车:

set -g default-command /bin/zsh
EOF