爬虫学习的一些注意事项

简介

最近一个月，一直在学习scrapy爬虫框架，第一个爬取的网站是伯乐在线，但是，我在爬取该网站的时候，出现了一些难以理解的错误，经过深入的理解后，终于理解了怎么导致的错误，并把它修复好。现在，我把它记录下来，希望能对读者有一些帮助。

错误详情

我使用的Python IDE是PyCharm，它的报错信息是 TypeError: sequence item 0: expected str instance, bytes found，当时我是把爬取的文章信息写入到MySql数据库中，出现一个这样的错误，虽然连接上数据库，信息虽然爬取到了，但是，数据就是写不进数据库中。我要的效果就是把爬取的信息写入到数据库中，现在写不进数据库，让人很是难过。

错误分析

首先，我的scrapy爬虫的一部分代码是：

      article_item["article_url_id"] = get_md5(response.url)
        article_item["article_title"] = article_title
        #这一部分可能出现问题（时间代码）
        try:
            article_time = datetime.datetime(article_time,"%Y/%m/%d").date()
        except Exception as e:
            article_time = datetime.datetime.now().date()
        article_item["article_time"] = article_time
        article_item["article_content"] = article_content
        article_item["article_url"] = response.url
        article_item["front_image_url"] = [front_image_url]
        article_item["fav_num"] = fav_num
        article_item["raise_num"] = raise_num
        article_item["comment_num"] = comment_num
        article_item["article_tags"] = article_tags

从上面的代码 article_item["front_image_url"] = [front_image_url]中可以看出，front_image_url这个变量是列表中的元素。

我的写入数据库的代码是：

class MysqlPipeline(object):
    # 同步mysql
    def __init__(self):
        self.conn = MySQLdb.connect(
            host= '127.0.0.1',
            port = 3306,
            user = 'root',
            passwd = '123456',
            db = 'article_spider',
            charset="utf8",
            use_unicode=True
        )
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        insert_sql = """
                    insert into article(article_title, article_time, article_content, article_url, 
article_url_id, front_image_url, front_image_path, fav_num, raise_num, comment_num, article_tags) 
                    VALUES (%s, %s, %s, %s,%s, %s, %s, %s, %s, %s, %s)
                """
        self.cursor.execute(insert_sql, (item["article_title"], item["article_time"], item["article_content"], 
item["article_url"], item["article_url_id"], item["front_image_url"], item["front_image_path"], item["fav_num"], 
item["raise_num"], item["comment_num"], item["article_tags"]))
        self.conn.commit()

但是，在数据库插入的代码中 item["front_image_url"] 这个却是字符串。这两个不同，导致了数据库不能插入，及数据库插入操作失败。既然，错误原因找到了，那么开始进行修复。

错误解决方法

经过上面的分析，那么，解决方法应该有两种，首先就是再加一个item，让它等于article_item["front_image_url_B"] = front_image_url,但是，这样就多了很多代码，但是，如果不添加的话，伯乐在线的文章的预览图片就没法下载。因此，我认为的最好的解决办法就是把数据库操作代码由item["front_image_url"]改为item["front_image_url"][0]。这样，问题就解决了。

补充

虽然上面的代码中我已经注释了文章时间有问题，但是，我目前还是不是很会处理。问题见下图

从上图可以看出，写入数据库操作是成功的。就是文章时间还是有问题的，需要修改

爬虫学习的一些注意事项

简介

错误详情

错误分析

错误解决方法

补充

Leon

相关推荐

评论抢沙发

欢迎光临

热门文章

近期文章

猜你喜欢

热门标签

最新评论

分类目录

文章归档

站点统计

欢迎光临

QQ咨询

回顶部

简介

错误详情

错误分析

错误解决方法

补充

Leon

相关推荐

评论 抢沙发

欢迎光临

热门文章

近期文章

猜你喜欢

热门标签

最新评论

分类目录

文章归档

站点统计

欢迎光临

QQ咨询

回顶部

评论抢沙发