问题描述
我正在建立一个论坛,我想使用论坛风格的标签让用户以有限的方式格式化他们的帖子.目前我正在使用正则表达式来做到这一点.根据这个问题:如何使用 C# 正则表达式模拟论坛标签
I am building a forum and I want to use forum-style tags to let the users format their posts in a limited fashion. Currently I am using Regex to do this. As per this question: How to use C# regular expressions to emulate forum tags
问题在于,正则表达式不区分嵌套标签.以下是我如何实现此方法的示例:
The problem with this, is that the regex does not distinguish between nested tags. Here is a sample of how I implemented this method:
public static string MyExtensionMethod(this string text)
{
return TransformTags(text);
}
private static string TransformTags(string input)
{
string regex = @"[([^=]+)[=x22']*(S*?)['x22]*](.+?)[/(1)]";
MatchCollection matches = new Regex(regex).Matches(input);
for (int i = 0; i < matches.Count; i++)
{
var tag = matches[i].Groups[1].Value;
var optionalValue = matches[i].Groups[2].Value;
var content = matches[i].Groups[3].Value;
if (Regex.IsMatch(content, regex))
{
content = TransformTags(content);
}
content = HandleTags(content, optionalValue, tag);
input = input.Replace(matches[i].Groups[0].Value, content);
}
return input;
}
private static string HandleTags(string content, string optionalValue, string tag)
{
switch (tag.ToLower())
{
case "quote":
return string.Format("<div class='quote'>{0}</div>", content);
default:
return string.Empty;
}
}
现在,如果我提交类似 [quote] 该用户发布的 [quote] blah [/quote] [/quote]
之类的内容,它不会正确检测到嵌套引用.相反,它采用第一个开引号标记并将其与第一个闭引号标记放在一起.
Now, if I submit something like [quote] This user posted [quote] blah [/quote] [/quote]
it does not properly detect the nested quote. Instead it takes the first opening quote tag and puts it with the first closing quote tag.
有没有推荐的解决方案?可以修改正则表达式以获取嵌套标签吗?也许我不应该为此使用正则表达式?
Are there any recommended solutions? Can the regex be modified to grab nested tags? Maybe I shouldn't use regex for this?
推荐答案
虽然使用平衡组可能可以使用仅"正则表达式,但它是相当沉重的巫术,并且本质上是脆弱的".我建议使用正则表达式来查找打开/关闭标签(而不试图将关闭与打开相关联),将它们标记并收集到一个集合中(可能是一个堆栈),然后手动"解析它们(使用 foreach).通过这种方式,您将拥有两全其美的优势:通过正则表达式搜索标签并手动处理它们(以及错误书写的标签).
While using "only" regex is probably possible using balancing groups, it's pretty heavy voodoo, and it's intrinsecally "fragile". What I propose is using regexes to find open/close tags (without trying to associate the close with the open), mark and collect them in a collection (a stack probably) and then parse them "by hand" (with a foreach). In this way you have the best of both world: the searching of tags by regex and the handling of them (and of wrongly written ones) by hand.
class TagMatch
{
public string Tag { get; set; }
public Capture Capture { get; set; }
public readonly List<string> Substrings = new List<string>();
}
static void Main(string[] args)
{
var rx = new Regex(@"(?<OPEN>[[A-Za-z]+?])|(?<CLOSE>[/[A-Za-z]+?])|(?<TEXT>[^[]+|[)");
var str = "Lorem [AA]ipsum [BB]dolor sit [/BB]amet, [ consectetur ][/AA]adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";
var matches = rx.Matches(str);
var recurse = new Stack<TagMatch>();
recurse.Push(new TagMatch { Tag = String.Empty });
foreach (Match match in matches)
{
var text = match.Groups["TEXT"];
TagMatch last;
if (text.Success)
{
last = recurse.Peek();
last.Substrings.Add(text.Value);
continue;
}
var open = match.Groups["OPEN"];
string tag;
if (open.Success)
{
tag = open.Value.Substring(1, open.Value.Length - 2);
recurse.Push(new TagMatch { Tag = tag, Capture = open.Captures[0] });
continue;
}
var close = match.Groups["CLOSE"];
tag = close.Value.Substring(2, close.Value.Length - 3);
last = recurse.Peek();
if (last.Tag == tag)
{
recurse.Pop();
var lastLast = recurse.Peek();
lastLast.Substrings.Add("**" + last.Tag + "**");
lastLast.Substrings.AddRange(last.Substrings);
lastLast.Substrings.Add("**/" + last.Tag + "**");
}
else
{
throw new Exception();
}
}
if (recurse.Count != 1)
{
throw new Exception();
}
var sb = new StringBuilder();
foreach (var str2 in recurse.Pop().Substrings)
{
sb.Append(str2);
}
var str3 = sb.ToString();
}
这是一个例子.它区分大小写(但很容易解决这个问题).它不处理未配对"标签,因为有多种方法可以处理它们.在您找到抛出新异常"的地方,您必须添加您的处理.显然,这不是一个插入式"解决方案.这只是一个例子.按照这种逻辑,我不会回答诸如编译器告诉我我需要一个命名空间"或编译器找不到正则表达式"之类的问题.但我会非常乐意回答高级"问题,例如如何匹配未配对的标签,或者如何添加对 [AAA=bbb]
标签的支持
This is an example. It's case sensitive (but it's easy to solve this problem). It doesn't handle "unpaired" tags, because there are various ways to handle them. Where you find a "throw new Exception" you'll have to add your handling. Clearly this isn't a "drop in" solution. It's only an example. By this logic, I won't respond to questions like "the compiler tells me I need a namespace" or "the compiler can't find Regex". BUT I will be more-than-happy to respond to "advanced" questions, like how could unpaired tags be matched, or how could you add support for [AAA=bbb]
tags
(第二次大编辑)
哇哈哈哈!我确实知道分组是这样做的方法!
Bwahahahah! I DID know groupings were the way to do it!
// Some classes
class BaseTagMatch {
public Capture Capture;
public override string ToString()
{
return String.Format("{1}: {2} [{0}]", GetType(), Capture.Index, Capture.Value.ToString());
}
}
class BeginTag : BaseTagMatch
{
public int Index;
public Capture Options;
public EndTag EndTag;
}
class EndTag : BaseTagMatch {
public int Index;
public BeginTag BeginTag;
}
class Text : BaseTagMatch
{
}
class UnmatchedTag : BaseTagMatch
{
}
// The code
var pattern =
@"(?# line 01) ^" +
@"(?# line 02) (" +
// Non [ Text
@"(?# line 03) (?>(?<TEXT>[^[]+))" +
@"(?# line 04) |" +
// Immediately closed tag [a/]
@"(?# line 05) (?>[ (?<TAG> [A-Z]+ ) x20* =? x20* (?<TAG_OPTION>( (?<= = x20*) ( (?! x20* /]) [^[]
] )* )? ) (?<BEGIN_INNER_TEXT>)(?<END_INNER_TEXT>) x20* /] )" +
@"(?# line 06) |" +
// Matched open tag [a]
@"(?# line 07) [ (?<TAG> (?<OPEN> [A-Z]+ ) ) x20* =? x20* (?<TAG_OPTION>( (?<= = x20*) ( (?! x20* ]) [^[]
] )* )? ) x20* ] (?<BEGIN_INNER_TEXT>)" +
@"(?# line 08) |" +
// Matched close tag [/a]
@"(?# line 09) (?>(?<END_INNER_TEXT>) [/ k<OPEN> x20* ] (?<-OPEN>))" +
@"(?# line 10) |" +
// Unmatched open tag [a]
@"(?# line 11) (?>(?<UNMATCHED_TAG> [ [A-Z]+ x20* =? x20* ( (?<= = x20*) ( (?! x20* ]) [^[]
] )* )? x20* ] ) )" +
@"(?# line 12) |" +
// Unmatched close tag [/a]
@"(?# line 13) (?>(?<UNMATCHED_TAG> [/ [A-Z]+ x20* ] ) )" +
@"(?# line 14) |" +
// Single [ of Text (unmatched by other patterns)
@"(?# line 15) (?>(?<TEXT>[))" +
@"(?# line 16) )*" +
@"(?# line 17) (?(OPEN)(?!))" +
@"(?# line 18) $";
var rx = new Regex(pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = rx.Match("[div=c:max max]asdf[p = 1 ] a [p=2] [b = p/pp /] [q/]
[a]sd [/z] [ [/p]f[/p]asdffds[/DIV] [p][/p]");
////var tags = match.Groups["TAG"].Captures.OfType<Capture>().ToArray();
////var tagoptions = match.Groups["TAG_OPTION"].Captures.OfType<Capture>().ToArray();
////var begininnertext = match.Groups["BEGIN_INNER_TEXT"].Captures.OfType<Capture>().ToArray();
////var endinnertext = match.Groups["END_INNER_TEXT"].Captures.OfType<Capture>().ToArray();
////var text = match.Groups["TEXT"].Captures.OfType<Capture>().ToArray();
////var unmatchedtag = match.Groups["UNMATCHED_TAG"].Captures.OfType<Capture>().ToArray();
var tags = match.Groups["TAG"].Captures.OfType<Capture>().Select((p, ix) => new BeginTag { Capture = p, Index = ix, Options = match.Groups["TAG_OPTION"].Captures[ix] }).ToList();
Func<Capture, int, EndTag> func = (p, ix) =>
{
var temp = new EndTag { Capture = p, Index = ix, BeginTag = tags[ix] };
tags[ix].EndTag = temp;
return temp;
};
var endTags = match.Groups["END_INNER_TEXT"].Captures.OfType<Capture>().Select((p, ix) => func(p, ix));
var text = match.Groups["TEXT"].Captures.OfType<Capture>().Select((p, ix) => new Text { Capture = p });
var unmatchedTags = match.Groups["UNMATCHED_TAG"].Captures.OfType<Capture>().Select((p, ix) => new UnmatchedTag { Capture = p });
// Here you have all the tags and the inner text neatly ordered and ready to be recomposed in a StringBuilder.
var allTags = tags.Cast<BaseTagMatch>().Union(endTags).Union(text).Union(unmatchedTags).ToList();
allTags.Sort((p, q) => p.Capture.Index - q.Capture.Index);
foreach (var el in allTags)
{
var type = el.GetType();
if (type == typeof(BeginTag))
{
}
else if (type == typeof(EndTag))
{
}
else if (type == typeof(UnmatchedTag))
{
}
else
{
// Text
}
}
不区分大小写的标签匹配,忽略未正确关闭的标签,支持立即关闭的标签 ([BR/]
). 有人告诉 Regex 不可能.... 哇哈哈哈哈哈!
Case insensitive tag matching, ignores tags not correctly closed, supports immediately closed tags ([BR/]
). And someone told it wasn't possible with Regex.... Bwahahahahah!
TAG
、TAGOPTION
、BEGIN_INNER_TEXT
和 END_INNER_TEXT
匹配(它们始终具有相同数量的元素).TEXT
和 UNMATCHED_TAG
不匹配!TAG
和 TAG_OPTION
是自动解释的(都去掉了无用的空格).BEGIN_INNER_TEXT
和 END_INNER_TEXT
捕获始终为空,但您可以使用它们的 Index
属性查看标记的开始/结束位置.UNMATCHED_TAG
包含已打开但未关闭或已关闭但未反对的标签.它不包含格式错误的标签(例如 [123
]).
TAG
, TAGOPTION
, BEGIN_INNER_TEXT
and END_INNER_TEXT
are matched (they always have the same number of elements). TEXT
and UNMATCHED_TAG
AREN'T matched! TAG
and TAG_OPTION
are auto-explicative (both are stripped of useless spaces). BEGIN_INNER_TEXT
and END_INNER_TEXT
captures are always empty, but you can use their Index
property to see where the tags begin/end. UNMATCHED_TAG
contains the tags that have been opened but not closed, or closed but not opponed. It doesn't contain tags that are wrong in format (for example [123
]).
最后我取TAG
、END_INNER_TEXT
(查看标签在哪里结束)、TEXT
和UNMATCHED_TAG
并按索引对它们进行排序.然后您可以获取 allTags
,将其放入 foreach
并为每个元素测试其类型.简单:-) :-)
In the end I take the TAG
, END_INNER_TEXT
(to see where the tags end), TEXT
and UNMATCHED_TAG
and sort them by index. Then you can take the allTags
, put it in a foreach
and for each element test its type. Easy :-) :-)
作为一个小提示,正则表达式是 RegexOptions.IgnorePatternWhitespace |正则表达式选项.ExplicitCapture |RegexOptions.IgnoreCase
.前两个是为了更容易编写和阅读,第三个是语义.它使 [A]
与 [/a]
匹配.
As a small note, the Regex is RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase
. The first two are to make it easier to write and to read, the third one is semanthical. It makes [A]
match with [/a]
.
必要的阅读资料:
http://www.codeproject.com/KB/recipes/Nested_RegEx_explained.aspx一>http://www.codeproject.com/KB/recipes/RegEx_Balanced_Grouping.aspx
这篇关于论坛标签.实施它们的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!