星期二, 七月 25, 2006

基于Lex工具的HTTP协议的分析

www.krugle.com上面搜索HTTP协议处理的代码,发现可以借鉴别人的地方很多。
比如这个叫做YAFFA的FTP软件中关于HTTP协议处理部分的代码很有意思。


首先看看简单的,如何从服务器上读取一行。
这个函数中他使用了一个叫做
httplex的函数,然后按照它返回的状态进行文法分析。
Lintyp HTTParser::read_line(Response **output)
{
Lintyp retdat = NOTYPE;

int val;
val = httplex();
while(val != CRLF)
{
// cout << "Val: "<< lastval = val;
if(retdat == NOTYPE)
{
switch(val)
{
case STATUS_HTTP:
*output = new StatusLine();
retdat = STATUS_LINE;
break;
case EH_CNT_LENGTH:
*output = new ContentLength();
retdat = CONTENT_LENGTH;
break;
case GH_TRFENC:
*output = new TransferEncoding();
retdat = TRANSFER_ENCODING;
break;
case EH_CNT_TYPE:
*output = new ContentType();
retdat = CONTENT_TYPE;
break;
case EH_CNT_ENCODING:
*output = new ContentEncoding();
retdat = CONTENT_ENCODING;
break;
default:
retdat = ANYTYPE;
break;
}
}
if(retdat == ANYTYPE)
val = httplex();
else
val = CRLF;
}
if(lastval == CRLF && val == CRLF)
{
retdat = MESSAGE_BODY;
}
lastval = val;

return retdat;
}
很显然,httplex就是一个Lex词法分析工具。这也是我所梦寐以求的方法——用Lex和Yacc实现协议的语法分析和处理,这样就可以大大简化程序中对于烦琐的协议进行分析的过程,可以专心处理内容。
将状态和内容生成一些专用的对象返回,如ContentLenght、ConthentTyep等等,这样协议内容就脱离了文本转换为对象方式了。

下面就是对应的Lex文件,而状态定义在httplexer.h中用枚举定义在同一个命名空间HTTPLexer_space中。
%{
#include "httplexer.h"
using namespace HTTPLexer_space;
%}
%option yylineno
%option noyywrap
delim [ \t]
cre [\r]
lfe [\n]
crlf {cre}{lfe}
sp [ ]
slash [/]
colon [:]
scolon [;]
equal [=]
coma [,]
ws {delim}+
letter [A-Za-z.\"\\\-\(\)\<\>\*\&\+\~]
digit [0-9]
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
token ({letter}|{digit})*
%%
{ws} {return(SP);}
{slash} {return(SLASH);}
{colon} {return(COLON);}
{scolon} {return(SEMI_COLON);}
{coma} {return(COMA);}
{crlf} {return(CRLF);}
{equal} {return(EQUAL);}
"HTTP" {return(STATUS_HTTP);}
"Cache-Control" {return(GH_CCONT);}
"Connection" {return(GH_CONNECTION);}
"Date" {return(GH_DATE);}
"Pragma" {return(GH_PRAGMA);}
"Trailer" {return(GH_TRAILER);}
"Transfer-Encoding" {return(GH_TRFENC);}
"UPGRADE" {return(GH_UPGRADE);}
"Via" {return(GH_VIA);}
"Warning" {return(GH_WARNING);}
"Accept-Ranges" {return(RH_ARANGES);}
"Age" {return(RH_AGE);}
"ETag" {return(RH_ETAG);}
"Location" {return(RH_LOCATION);}
"Proxy-Authenticate" {return(RH_PRXAUTH);}
"Retry-After" {return(RH_RETRYA);}
"Server" {return(RH_SERVER);}
"Vary" {return(RH_VARY);}
"WWW-Authenticate" {return(RH_WWWAUTH);}
"Allow" {return(EH_ALLOW);}
"Content-Encoding" {return(EH_CNT_ENCODING);}
"Content-Language" {return(EH_CNT_LANGUAGE);}
"Content-Length" {return(EH_CNT_LENGTH);}
"Content-Location" {return(EH_CNT_LOCATION);}
"Content-MD5" {return(EH_CNT_MD5);}
"Content-Range" {return(EH_CNT_RANGE);}
"Content-Type" {return(EH_CNT_TYPE);}
"Expires" {return(EH_EXPIRES);}
"Last-Modified" {return(EH_LASTMOD);}
{number} {return(NUMBER);}
{token} {return(TOKEN);}
%%
void http_set_parser(const char *str, int nr) {
YY_BUFFER_STATE ybst;
ybst = yy_scan_bytes(str, nr);
yy_switch_to_buffer(ybst);
}
void http_reset_parser(void) {
yy_delete_buffer(YY_CURRENT_BUFFER);
}


可见对于这种文法简单的协议,使用成熟的工具和相当的方法,实现还是非常简单的。

再看看如何使用这个按行读取的方法并处理这些对象呢?
// Read a respons from the server.
void HTTP::get_response(void)
{
HTTParser parser(fcin);
ResponseCoupler *a;
response_list.clear();
Lintyp lintp = NOTYPE;
while(lintp != MESSAGE_BODY)
{
a = new ResponseCoupler();
try {
lintp = a->type = parser.read_line(&(a->response));
}
catch(StandardException *a)
{
cout << *a->getErrMsg() << "\n";
throw new protoerr("HTTP: Failed to read response.", NONFATAL_EXCEPTION);
}
response_list.push_back(a);
}
}
是读取正文之前的内容,分析后放入队列中保存。使用的时候如此简单,如果你不太介意效率的问题,那么这个真是一种惹人喜欢的方法。