使用HtmlAgilityPack抓取网址图片并下载,爬虫学习笔记

By admin in 4858.com on 2019年4月6日

      
方今因为三个学业需求实现CNKI爬虫,研讨爬虫框架结构的时候发现了这一个指鹿为马移植于Python的头面开源爬虫框架Scrapy的ScrapySharp,可是在网上搜寻之后只发现了这一个F#的Demo,就使用原版的书文中示范的网址写了那个C#本子的代码。

      
PS:研讨之后发现,ScrapySharp和Scrapy差异还是挺大的,未有Scrapy那样完美的8大组件,只含有获取网页内容和根据HtmlAgilityPack扩大的网页解析功用,莫名有个别小失望。

using System;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
using HtmlAgilityPack;
using ScrapySharp.Extensions;
using ScrapySharp.Network;

namespace ScrapySharpDemo
{
class Program
{
static void Main(string[] args)
{
//示例网址地址
var url = “”;
var web = new ScrapingBrowser();
var html = web.DownloadString(new Uri(url));
var doc = new HtmlDocument();
doc.LoadHtml(html);
//获取网址中的图片地址
var urls= doc.DocumentNode.CssSelect(“div.bbs-content > img”).Select(node => node.GetAttributeValue(“original”)).ToList();
//并行下载图片
Parallel.ForEach(urls, SavePic);
}

public static void SavePic(string url)
{
var web = new ScrapingBrowser();
//因天涯网址限量,全数站外来源都无法访问图片,故先设置请求头Refer属性为当前页地址
web.Headers.Add(“Referer”, “”);
var pic = web.NavigateToPage(new Uri(url)).RawResponse.Body;
var file = url.Substring(url.LastIndexOf(“/”, StringComparison.Ordinal));
if (!Directory.Exists(“imgs”))
Directory.CreateDirectory(“imgs”);
File.WriteAllBytes(“imgs” + file, pic);
}
}
}

      
如今因为一个功课须要形成CNKI爬虫,研商爬虫架构的时候发现了那些似是而非移植于Python的名牌开源爬虫框架Scrapy的ScrapySharp,不过在网上搜索之后只发现了那么些F#的Demo,就应用原来的文章中示范的网址写了这么些C#本子的代码。

      
PS:研究之后察觉,ScrapySharp和Scrapy差别照旧挺大的,未有Scrapy那样完美的八大组件,只含有获取网页内容和依据HtmlAgilityPack扩充的网页解析功用,莫名某些小失望。

using System;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
using HtmlAgilityPack;
using ScrapySharp.Extensions;
using ScrapySharp.Network;

namespace ScrapySharpDemo
{
class Program
{
static void Main(string[] args)
{
//示例网址地址
var url = “”;
var web = new ScrapingBrowser();
var html = web.DownloadString(new Uri(url));
var doc = new HtmlDocument();
doc.LoadHtml(html);
//获取网址中的图片地址
var urls= doc.DocumentNode.CssSelect(“div.bbs-content > img”).Select(node => node.GetAttributeValue(“original”)).ToList();
//并行下载图片
Parallel.ForEach(urls, SavePic);
}

public static void SavePic(string url)
{
var web = new ScrapingBrowser();
//因天涯网址限量,全体站外来源都无法访问图片,故先设置请求头Refer属性为当前页地址
web.Headers.Add(“Referer”, “”);
var pic = web.NavigateToPage(new Uri(url)).RawResponse.Body;
var file = url.Substring(url.LastIndexOf(“/”, StringComparison.Ordinal));
if (!Directory.Exists(“imgs”))
Directory.CreateDirectory(“imgs”);
File.WriteAllBytes(“imgs” + file, pic);
}
}
}

今天看天涯论坛发现三个正确的抓取贴(首假如尤其url。。。你懂的),花几分钟改了下,代码扩展了按年月日树立目录,按小说建立子目录,图片都保留于内,命令行格局运营,扩展了全站的参数。。。

正则表明式

本来版本:

     命名空间:using System.Text.RegularExpressions;

接纳HtmlAgilityPack抓取XX网址图片并下载~~邪恶版。。。。

     常用的类:  

 

           Regex   

 

           MatchCollection   

新本子代码:

           Match   

 

           Group  

#region Using namespace

           GroupCollection  

using System;
using System.IO;
using System.Linq;
using System.Net;
using HtmlAgilityPack;

    常用的点子:   

#endregion

           Regex.IsMatch(); 返回值bool   

namespace DownloadImages
{
    internal class Program
    {
使用HtmlAgilityPack抓取网址图片并下载,爬虫学习笔记。        private static readonly WebClient Wc = new WebClient();
        private static readonly char[] InvalidFileNameChars = new[]
                                                                  {
                                                                      ‘”‘,
                                                                      ‘<‘,
                                                                      ‘>’,
                                                                      ‘|’,
                                                                      ‘\0’,
                                                                      ‘\u0001’,
                                                                      ‘\u0002’,
                                                                      ‘\u0003’,
                                                                      ‘\u0004’,
                                                                      ‘\u0005’,
                                                                      ‘\u0006’,
                                                                      ‘\a’,
                                                                      ‘\b’,
                                                                      ‘\t’,
                                                                      ‘\n’,
                                                                      ‘\v’,
                                                                      ‘\f’,
                                                                      ‘\r’,
                                                                      ‘\u000e’,
                                                                      ‘\u000f’,
                                                                      ‘\u0010’,
                                                                      ‘\u0011’,
                                                                      ‘\u0012’,
                                                                      ‘\u0013’,
                                                                      ‘\u0014’,
                                                                      ‘\u0015’,
                                                                      ‘\u0016’,
                                                                      ‘\u0017’,
                                                                      ‘\u0018’,
                                                                      ‘\u0019’,
                                                                      ‘\u001a’,
                                                                      ‘\u001b’,
                                                                      ‘\u001c’,
                                                                      ‘\u001d’,
                                                                      ‘\u001e’,
                                                                      ‘\u001f’,
                                                                      ‘:’,
                                                                      ‘*’,
                                                                      ‘?’,
                                                                      ‘\\’,
                                                                      ‘/’
                                                                  };
        public static string CleanInvalidFileName(string fileName)
        {
            fileName = fileName + “”;
            fileName = InvalidFileNameChars.Aggregate(fileName, (current, c) => current.Replace(c + “”, “”));

           Regex.Match(); 返回值Match   

            if (fileName.Length > 1)
                if (fileName[0] == ‘.’)
                    fileName = “dot” + fileName.TrimStart(‘.’);

           Regex.Matches(); 返回值MatchCollection   

            return fileName;
        }
        private static void Main(string[] args)
        {
            Start();
        }

           Regex.Replace(); 返回值string

        private static void Start()
        {
            var web = new HtmlWeb();
            var startDate = int.Parse(DateTime.Parse(“2010-08-18”).ToString(“yyyyMMdd”));
            var endDate = int.Parse(DateTime.Now.ToString(“yyyyMMdd”));
            const int startPageId = 49430;
            const int endPageId = 124621;
            for (int k = startDate; k <= endDate; k++)
            {
                for (int j = startPageId; j <= endPageId; j++)
                {
                    string cnblogs =  + k + “/” + j + “.html”;  //此处省略……源码内详
                    HtmlDocument doc = web.Load(cnblogs);
                    var titles = doc.DocumentNode.SelectNodes(“//title”);
                    var titleName = j.ToString();
                    if( titles!=null && titles.Count>0)
                        titleName = titles[0].InnerText;
                    HtmlNode node = doc.GetElementbyId(“ks_xp”);
                    if (node == null)
                    {
                        continue;
                    }
                    foreach (HtmlNode child in node.SelectNodes(“//img”))
                    {
                        if (child.Attributes[“src”] == null)
                            continue;

4858.com,正则表明式抓取图片:   

                        string imgurl = child.Attributes[“src”].Value;
                        DownLoadImg(imgurl, k + “”, CleanInvalidFileName(titleName));
                        Console.WriteLine(“正在下载:” + titleName + ” ” + imgurl);
                    }
                }
            }
            //善后
            CleanEmptyFolders();
        }

           引用命名空间:using System.Net;       

        private static void CleanEmptyFolders()
        {
            var rootFolders = Environment.CurrentDirectory + “\\Images\\”;
            var folders = Directory.GetDirectories(rootFolders, “*.*”, SearchOption.AllDirectories);
            foreach( var f in folders)
            {
                if (Directory.GetFiles(f, “*.*”, SearchOption.AllDirectories).Length == 0)
                    Directory.Delete(f);
            }
        }

           using System.IO;   

        private static void DownLoadImg(string url, string folderName, string subFolderName)
        {
            var fileName = CleanInvalidFileName(url.Substring(url.LastIndexOf(“/”) + 1));
            var fileFolder = Environment.CurrentDirectory + “\\Images\\” + folderName + “\\” + subFolderName + “\\” ;
            if (!Directory.Exists(fileFolder))
                Directory.CreateDirectory(fileFolder);
            fileName = fileFolder + fileName;
            try
            {
                Wc.DownloadFile(url, fileName);
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
            }
        }
    }
}

          
做题思想:一》首先从网上获取网页上的享有音讯,2》使用正则表明式实行相配获得,想要得到的图样的具体地址,三》下载;
     

 

          static void Mian(string[] args)  

测试程序和源码下载:

          {    

/Files/Chinasf/DownloadImages.rar

            WebClient wc=new WebClient();    

            string
html=wc.DownloadString(@”网—页—地—址”);    

            MatchCollect mc=Regex.Matches(html,
@”<\s?img[^>]+src=””([^””]+)”””);//使用正则表达式进行相称,因为所获的图形较多,所以广大学一年级个List集合储存;
   

            List<string> pic=new List<string>();
   

            foreach(Match m in mc)//实行遍历    

            {    

               if(m.Success)//借使能够包容的字符串放到pic集合中
    

              {      

                pic.Add(m.Group[1].Value.Trim());//得到图片
src=”~~~”的款型;提取图片名称     

              }   

             }    

            string url=@”网页地址”;    

            for(int i=0;i<pic.Count;i++)    

            {    

               string temp=pic[i];    

               temp = url+ / + temp; 
    //往图片名称前添加url地址;     

               pic[i]=temp;
//重新改变pic集合中的图片名称,到此此图片正是一个完好无缺的网页图片地址    

            }    

            string address=”想要下载到的目的地点”;    

            if(!Directory.Exists(address))
//先举行判断磁盘中是不是有要用的文书夹,未有则开创

              {     

               Directory.CreateDirectory(“文件”);    

            }   

             else    

            {    

                for(int i=0;i<pic.Count.i++)     

               {      

                  string
name=Regex.Match(pic[i],@”./(.+)”).Groups[1].Value;      

                  //Regex.Match(pic[i],@”./(.+)”);
举行相称,展现图片名称 “/~~~”的形式;      

                  //Regex.Match(pic[i],@”./(.+)”).Groups[1].Value  
抓取图片名称,那是为着在下载时成立出的名字与网上名字1样;     

                   wc.DownloadFile(pic[i],path.Combine(address,name);//下载完结   

                }    

            }    

          Console.ReadKey();    

         }


发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图
Copyright @ 2010-2019 美高梅手机版4858 版权所有