在Linux中使用Grep和正则表达式搜索文本模式

来自菜鸟教程
跳转至:导航、​搜索

介绍

grep 命令是 Linux 终端环境中最有用的命令之一。 名称 grep 代表“全局正则表达式打印”。 这意味着您可以使用 grep 来检查它接收到的输入是否与指定的模式匹配。 这个看似微不足道的程序,却异常强大; 它基于复杂规则对输入进行排序的能力使其成为许多命令链中的流行链接。

在本教程中,您将探索 grep 命令的选项,然后您将深入使用正则表达式进行更高级的搜索。

先决条件

要遵循本指南,您需要访问运行基于 Linux 的操作系统的计算机。 这可以是您使用 SSH 连接到的虚拟专用服务器,也可以是您的本地计算机。 请注意,本教程使用运行 Ubuntu 20.04 的 Linux 服务器进行了验证,但给出的示例应该适用于运行任何版本的任何 Linux 发行版的计算机。

如果您打算使用远程服务器来遵循本指南,我们建议您先完成我们的 初始服务器设置指南 。 这样做将为您设置一个安全的服务器环境——包括一个具有 sudo 权限的非 root 用户和一个配置了 UFW 的防火墙——您可以使用它来培养您的 Linux 技能。

作为替代方案,我们鼓励您使用嵌入在此页面上的交互式终端来试验本教程中的示例命令。 单击以下 Launch an Interactive Terminal! 按钮以打开终端窗口并开始使用 Linux (Ubuntu) 环境。

启动交互式终端!

基本用法

在本教程中,您将使用 grepGNU 通用公共许可证版本 3 中搜索各种单词和短语。

如果您使用的是 Ubuntu 系统,则可以在 /usr/share/common-licenses 文件夹中找到该文件。 将其复制到您的主目录:

cp /usr/share/common-licenses/GPL-3 .

如果您在另一个系统上,请使用 curl 命令下载副本:

curl -o GPL-3 https://www.gnu.org/licenses/gpl-3.0.txt

您还将在本教程中使用 BSD 许可证文件。 在 Linux 上,您可以使用以下命令将其复制到您的主目录:

cp /usr/share/common-licenses/BSD .

如果您在另一个系统上,请使用以下命令创建文件:

cat << 'EOF' > BSD
Copyright (c) The Regents of the University of California.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
   notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
   notice, this list of conditions and the following disclaimer in the
   documentation and/or other materials provided with the distribution.
3. Neither the name of the University nor the names of its contributors
   may be used to endorse or promote products derived from this software
   without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
EOF

现在您有了这些文件,您可以开始使用 grep

在最基本的形式中,您使用 grep 来匹配文本文件中的文字模式。 这意味着如果您传递 grep 一个要搜索的单词,它将打印出文件中包含该单词的每一行。

执行以下命令以使用 grep 搜索包含单词 GNU 的每一行:

grep "GNU" GPL-3

第一个参数 GNU 是您要搜索的模式,而第二个参数 GPL-3 是您要搜索的输入文件。

结果输出将是包含模式文本的每一行:

Output                    GNU GENERAL PUBLIC LICENSE
  The GNU General Public License is a free, copyleft license for
the GNU General Public License is intended to guarantee your freedom to
GNU General Public License for most of our software; it applies also to
  Developers that use the GNU GPL protect your rights with two steps:
  "This License" refers to version 3 of the GNU General Public License.
  13. Use with the GNU Affero General Public License.
under version 3 of the GNU Affero General Public License into a single
...
...

在某些系统上,您搜索的模式将在输出中突出显示。

常用选项

默认情况下,grep 将在输入文件中搜索确切的指定模式并返回它找到的行。 您可以通过向 grep 添加一些可选标志来使此行为更有用。

如果您希望 grep 忽略搜索参数的“大小写”并同时搜索大小写变体,您可以指定 -i--ignore-case 选项。

使用以下命令在与以前相同的文件中搜索单词 license 的每个实例(包括大写、小写或混合大小写):

grep -i "license" GPL-3

结果包含:LICENSElicenseLicense

Output                    GNU GENERAL PUBLIC LICENSE
 of this license document, but changing it is not allowed.
  The GNU General Public License is a free, copyleft license for
  The licenses for most software and other practical works are designed
the GNU General Public License is intended to guarantee your freedom to
GNU General Public License for most of our software; it applies also to
price.  Our General Public Licenses are designed to make sure that you
(1) assert copyright on the software, and (2) offer you this License
  "This License" refers to version 3 of the GNU General Public License.
  "The Program" refers to any copyrightable work licensed under this
...
...

如果有一个带有 LiCeNsE 的实例,它也会被返回。

如果要查找 包含指定模式的所有行,可以使用 -v--invert-match 选项。

使用以下命令搜索 BSD 许可证中不包含单词 the 的每一行:

grep -v "the" BSD

您将收到以下输出:

OutputAll rights reserved.

Redistribution and use in source and binary forms, with or without
are met:
    may be used to endorse or promote products derived from this software
    without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
...
...

由于您没有指定“忽略大小写”选项,因此最后两项返回没有单词 the

知道匹配发生的行号通常很有用。 您可以使用 -n--line-number 选项来执行此操作。 添加此标志重新运行前面的示例:

grep -vn "the" BSD

这将返回以下文本:

Output2:All rights reserved.
3:
4:Redistribution and use in source and binary forms, with or without
6:are met:
13:   may be used to endorse or promote products derived from this software
14:   without specific prior written permission.
15:
16:THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
17:ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
...
...

现在,如果要更改不包含 the 的每一行,则可以引用行号。 这在使用源代码时特别方便。

常用表达

在介绍中,您了解到 grep 代表“全局正则表达式打印”。 “正则表达式”是描述特定搜索模式的文本字符串。

不同的应用程序和编程语言实现正则表达式略有不同。 在本教程中,您将只探索 grep 描述其模式的方式的一小部分。

文字匹配

在本教程前面的示例中,当您搜索单词 GNUthe 时,您实际上是在搜索与字符串 GNUthe。 精确指定要匹配的字符的模式称为“文字”,因为它们逐字匹配模式。

将这些视为匹配字符串而不是匹配单词会很有帮助。 随着您学习更复杂的模式,这将成为更重要的区别。

除非被其他表达式机制修改,否则所有字母和数字字符(以及某些其他字符)都按字面匹配。

锚点比赛

锚点是特殊字符,它指定匹配必须在行中的哪个位置才有效。

例如,使用锚点,您可以指定您只想知道在行的开头匹配 GNU 的行。 为此,您可以在文字字符串之前使用 ^ 锚点。

运行以下命令搜索 GPL-3 文件并找到 GNU 出现在行首的行:

grep "^GNU" GPL-3

此命令将返回以下两行:

OutputGNU General Public License for most of our software; it applies also to
GNU General Public License, you may choose any version ever published

类似地,您在模式末尾使用 $ 锚点来指示匹配仅在它出现在一行的最末尾时才有效。

此命令将匹配 GPL-3 文件中以单词 and 结尾的每一行:

grep "and$" GPL-3

您将收到以下输出:

Outputthat there is no warranty for this free software.  For both users' and
  The precise terms and conditions for copying, distribution and
  License.  Each licensee is addressed as "you".  "Licensees" and
receive it, in any medium, provided that you conspicuously and
    alternative is allowed only occasionally and noncommercially, and
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
provisionally, unless and until the copyright holder explicitly and
receives a license from the original licensors, to run, modify and
make, use, sell, offer for sale, import and otherwise run, modify and

匹配任何字符

正则表达式中使用句点字符 (.) 表示任何单个字符都可以存在于指定位置。

例如,要匹配 GPL-3 文件中包含两个字符和字符串 cept 的任何内容,您可以使用以下模式:

grep "..cept" GPL-3

此命令返回以下输出:

Outputuse, which is precisely where it is most unacceptable.  Therefore, we
infringement under applicable copyright law, except executing it on a
tells the user that there is no warranty for the work (except to the
License by making exceptions from one or more of its conditions.
form of a separately written license, or stated as exceptions;
  You may not propagate or modify a covered work except as expressly
  9. Acceptance Not Required for Having Copies.
...
...

此输出包含 acceptexcept 的实例以及这两个单词的变体。 如果也找到该模式,该模式也将匹配 z2cept

括号表达式

通过将一组字符放在括号内(\[\]),您可以指定该位置的字符可以是括号组中的任何一个字符。

例如,要查找包含 tootwo 的行,您可以使用以下模式简洁地指定这些变体:

grep "t[wo]o" GPL-3

输出显示文件中存在两种变体:

Outputyour programs, too.
freedoms that you received.  You must make sure that they, too, receive
  Developers that use the GNU GPL protect your rights with two steps:
a computer network, with no transfer of a copy, is not conveying.
System Libraries, or general-purpose tools or generally available free
    Corresponding Source from a network server at no charge.
...
...

括号符号为您提供了一些有趣的选项。 您可以通过在括号内的字符列表以 ^ 字符开头来使模式匹配 除了 括号内的任何字符。

此示例类似于模式 .ode,但不会匹配模式 code

grep "[^c]ode" GPL-3

这是您将收到的输出:

Output  1. Source Code.
    model, to give anyone who possesses the object code either (1) a
the only significant mode of use of the product.
notice like this when it starts in an interactive mode:

请注意,在返回的第二行中,实际上存在单词 code。 这不是正则表达式或 grep 的失败。 更确切地说,返回此行是因为在该行的前面,在单词 model 中找到的模式 mode 被发现。 由于存在与模式匹配的实例,因此返回了该行。

方括号的另一个有用功能是您可以指定一个字符范围,而不是单独键入每个可用字符。

这意味着如果要查找以大写字母开头的每一行,可以使用以下模式:

grep "^[A-Z]" GPL-3

这是此表达式返回的输出:

OutputGNU General Public License for most of our software; it applies also to
States should not allow patents to restrict development and use of
License.  Each licensee is addressed as "you".  "Licensees" and
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
System Libraries, or general-purpose tools or generally available free
Source.
User Product is transferred to the recipient in perpetuity or for a
...
...

由于一些遗留排序问题,使用 POSIX 字符类而不是像您刚刚使用的字符范围通常更准确。

讨论每个 POSIX 字符类超出了本指南的范围,但是一个可以完成与前一个示例相同过程的示例在括号选择器中使用 \[:upper:\] 字符类:

grep "^[[:upper:]]" GPL-3

输出将与以前相同。

重复模式零次或多次

最后,最常用的元字符之一是星号,或 *,意思是“重复前一个字符或表达式零次或多次”。

要查找 GPL-3 文件中包含左括号和右括号的每一行,中间只有字母和单个空格,请使用以下表达式:

grep "([A-Za-z ]*)" GPL-3

您将获得以下输出:

Output Copyright (C) 2007 Free Software Foundation, Inc.
distribution (with or without modification), making available to the
than the work as a whole, that (a) is included in the normal form of
Component, and (b) serves only to enable use of the work with that
(if any) on which the executable work runs, or a compiler used to
    (including a physical distribution medium), accompanied by the
    (including a physical distribution medium), accompanied by a
    place (gratis or for a charge), and offer equivalent access to the
...
...

到目前为止,您已经在表达式中使用了句点、星号和其他字符,但有时您需要专门搜索这些字符。

转义元字符

有时您需要搜索文字句点或文字左括号,尤其是在使用源代码或配置文件时。 因为这些字符在正则表达式中具有特殊含义,所以您需要“转义”这些字符以告诉 grep 在这种情况下您不希望使用它们的特殊含义。

您可以通过在通常具有特殊含义的字符前面使用反斜杠字符 (\) 来转义字符。

例如,要查找以大写字母开头并以句点结尾的任何行,请使用以下表达式来转义结束句点,以便它表示文字句点而不是通常的“任何字符”含义:

grep "^[A-Z].*\.$" GPL-3

这是您将看到的输出:

OutputSource.
License by making exceptions from one or more of its conditions.
License would be to refrain entirely from conveying the Program.
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
SUCH DAMAGES.
Also add information on how to contact you by electronic and paper mail.

现在让我们看看其他正则表达式选项。

扩展正则表达式

grep 命令通过使用 -E 标志或调用 egrep 命令而不是 grep 命令支持更广泛的正则表达式语言。

这些选项开启了“扩展正则表达式”的功能。 扩展正则表达式包括所有基本元字符,以及用于表达更复杂匹配的附加元字符。

分组

扩展正则表达式开放的最有用的功能之一是将表达式组合在一起以作为一个单元进行操作或引用的能力。

要将表达式组合在一起,请将它们括在括号中。 如果您想在不使用扩展正则表达式的情况下使用括号,可以使用反斜杠对其进行转义以启用此功能。 这意味着以下三个表达式在功能上是等价的:

grep "\(grouping\)" file.txt
grep -E "(grouping)" file.txt
egrep "(grouping)" file.txt

交替

类似于括号表达式如何为单个字符匹配指定不同的可能选择,交替允许您为字符串或表达式集指定替代匹配。

要指示交替,请使用竖线字符 |。 这些通常在括号分组中使用,以指定两个或多个可能性之一应被视为匹配。

以下将在文本中找到 GPLGeneral Public License

grep -E "(GPL|General Public License)" GPL-3

输出如下所示:

Output  The GNU General Public License is a free, copyleft license for
the GNU General Public License is intended to guarantee your freedom to
GNU General Public License for most of our software; it applies also to
price.  Our General Public Licenses are designed to make sure that you
  Developers that use the GNU GPL protect your rights with two steps:
  For the developers' and authors' protection, the GPL clearly explains
authors' sake, the GPL requires that modified versions be marked as
have designed this version of the GPL to prohibit the practice for those
...
...

交替可以通过在选择组中添加额外的选项来选择两个以上的选项,这些选项由额外的竖线 (|) 字符分隔。

量词

与匹配前一个字符或字符集零次或多次的 * 元字符一样,扩展正则表达式中还有其他元字符可用于指定出现次数。

要匹配一个字符零次或一次,您可以使用 ? 字符。 这使得之前出现的字符或字符集本质上是可选的。

以下通过将 copy 放入可选组中来匹配 copyrightright

grep -E "(copy)?right" GPL-3

您将收到以下输出:

Output Copyright (C) 2007 Free Software Foundation, Inc.
  To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights.  Therefore, you have
know their rights.
  Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
  "Copyright" also means copyright-like laws that apply to other kinds of
...

+ 字符与表达式匹配一次或多次。 这几乎类似于 * 元字符,但对于 + 字符,表达式 必须 至少匹配一次。

以下表达式匹配字符串 free 加上一个或多个非空白字符:

grep -E "free[^[:space:]]+" GPL-3

你会看到这个输出:

Output  The GNU General Public License is a free, copyleft license for
to take away your freedom to share and change the works.  By contrast,
the GNU General Public License is intended to guarantee your freedom to
  When we speak of free software, we are referring to freedom, not
have the freedom to distribute copies of free software (and charge for
you modify it: responsibilities to respect the freedom of others.
freedomss that you received.  You must make sure that they, too, receive
protecting users' freedom to change the software.  The systematic
of the GPL, as needed to protect the freedom of users.
patents cannot be used to render the program non-free.

指定匹配重复

要指定匹配重复的次数,请使用大括号字符({})。 这些字符使您可以指定表达式可以匹配的次数的确切数字、范围或上限或下限。

使用以下表达式查找 GPL-3 文件中包含三元音的所有行:

grep -E "[AEIOUaeiou]{3}" GPL-3

返回的每一行都有一个带有三个元音的单词:

Outputchanged, so that their problems will not be attributed erroneously to
authors of previous versions.
receive it, in any medium, provided that you conspicuously and
give under the previous paragraph, plus a right to possession of the
covered work so as to satisfy simultaneously your obligations under this

要匹配任何包含 16 到 20 个字符的单词,请使用以下表达式:

grep -E "[[:alpha:]]{16,20}" GPL-3

这是此命令的输出:

Output    certain responsibilities if you distribute copies of the software, or if
    you modify it: responsibilities to respect the freedom of others.
        c) Prohibiting misrepresentation of the origin of that material, or

仅显示包含该长度内的单词的行。

结论

grep 在文件或文件系统层次结构中查找模式很有用,因此值得花时间熟悉它的选项和语法。

正则表达式更加通用,可以与许多流行的程序一起使用。 例如,许多文本编辑器实现了用于搜索和替换文本的正则表达式。

此外,大多数现代编程语言使用正则表达式对特定数据块执行过程。 一旦您理解了正则表达式,您就能够将这些知识转移到许多与计算机相关的常见任务中,从在文本编辑器中执行高级搜索到验证用户输入。