刮取一个网站内所有网页的最快方法
|
我有一个C#应用,需要尽快抓取特定域中的许多页面。我有一个Parallel.Foreach,它遍历所有URL(多线程),并使用下面的代码对其进行抓取:
private string ScrapeWebpage(string url, DateTime? updateDate)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
Stream responseStream = null;
StreamReader reader = null;
string html = null;
try
{
//create request (which supports http compression)
request = (HttpWebRequest)WebRequest.Create(url);
request.Pipelined = true;
request.KeepAlive = true;
request.Headers.Add(HttpRequestHeader.AcceptEncoding, \"gzip,deflate\");
if (updateDate != null)
request.IfModifiedSince = updateDate.Value;
//get response.
response = (HttpWebResponse)request.GetResponse();
responseStream = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains(\"gzip\"))
responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains(\"deflate\"))
responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);
//read html.
reader = new StreamReader(responseStream, Encoding.Default);
html = reader.ReadToEnd();
}
catch
{
throw;
}
finally
{//dispose of objects.
request = null;
if (response != null)
{
response.Close();
response = null;
}
if (responseStream != null)
{
responseStream.Close();
responseStream.Dispose();
}
if (reader != null)
{
reader.Close();
reader.Dispose();
}
}
return html;
}
如您所见,我具有http压缩支持,并将request.keepalive和request.pipelined设置为true。我想知道我使用的代码是在同一站点内抓取许多网页的最快方法,还是有一种更好的方法可以使会话保持针对多个请求的打开状态。我的代码正在为我访问的每个页面创建一个新的请求实例,我是否应该尝试仅使用一个请求实例来访问所有页面?启用管道并启用Keepalive是否理想?
没有找到相关结果
已邀请:
1 个回复
倾坞髓