Python 爬虫 4 mins

从零开始写一个音乐爬虫-1-网易云音乐:批量获取歌单音乐下载链接

22-02-20 / 1768 Words

本文将会详细的写从零开始制作网易云音乐爬虫的过程,可以下载网易云音乐音乐付费歌曲,使用Python3开发。

声明:本文从前端入手,再获得歌曲数据。如果您不想走那么多弯路,请参考这篇文章:https://zhuanlan.zhihu.com/p/30246788和这篇文章:https://www.shangyexinzhi.com/article/details/id-297404

本文采用环境:

  • Pycharm + Python 3.7.5

所需模块:

  1. fake-useragent
  2. requests
  3. re

安装方法:

pip install beautifulsoup4 fake-useragent requests

爬虫思路:

graph TD; 初始化程序-->获取链接; 获取链接-->遍历链接; 遍历链接-->获取真实下载链接; 获取真实下载链接-->下载音乐;

看起来并不是非常困难,是吧?(^o^)/~

经过分析,发现网易云歌单中的音乐格式如下:

好。我们新建一个CloudMusicSpider.py,开始码代码:

#!/usr/bin/env python
#-*- coding:utf8 -*-
import requests #导入模块
from fake_useragent import UserAgent #导入模块

def getArtUrl(artlist_url): #获取音乐真实ID的函数
    outerUrl = "http://music.163.com/song/media/outer/url?id={0:s}.mp3" #外链地址
    HEADER = { #伪造请求头,防止被封
        'Accept': '*/*',
        'Accept-Encoding': 'gzip,deflate,sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8,gl;q=0.6,zh-TW;q=0.4',
        'Connection': 'keep-alive',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Host': 'music.163.com',
        'Referer': 'https://music.163.com/search/',
        'User-Agent': UserAgent().random
    }
    response = requests.get(url=url.replace('#/', ''), headers=HEADER) #网易云音乐使用异步加载,需要去掉"#/"
    soup = BeautifulSoup(response.text, "html.parser") #bs4读取
    aList = soup.select("a") #获取全部a标签
    Songs = {} #格式为{音乐名: "外链地址"}的字典
    for music in aList:
        if music.has_attr("href"): #判断是否有href属性
            if str(music.attrs["href"]).startswith("/song?id="): #是不是音乐ID
                id = str(music.attrs["href"]).replace("/song?id=", "") #获取真实ID
                try:
                    Songs[music.text] = outerUrl.format(id) #放入字典
                except:
                    pass
    return Songs #返回这个字典

测试:

print(getUrl('https://music.163.com/#/playlist?id=4874770876'))

结果应该是这样的:

{'Undertale': 'http://music.163.com/song/media/outer/url?id=39227624.mp3', 'sans.': 'http://music.163.com/song/media/outer/url?id=39224615.mp3', 'Tem Shop': 'http://music.163.com/song/media/outer/url?id=39224629.mp3', 'Undertale - Snowdin Town (****** Bootleg)': 'http://music.163.com/song/media/outer/url?id=463498197.mp3', "It's Not Like I Like You!!(Instrumental)": 'http://music.163.com/song/media/outer/url?id=1385676402.mp3', '似顔絵広場 (似顔絵チャンネル)': 'http://music.163.com/song/media/outer/url?id=1378269172.mp3', 'ステージ:コミカル (サンドキャニオン)': 'http://music.163.com/song/media/outer/url?id=28546263.mp3', 'I Love You': 'http://music.163.com/song/media/outer/url?id=31460297.mp3', 'Ice Cream': 'http://music.163.com/song/media/outer/url?id=35407675.mp3', 'Dying': 'http://music.163.com/song/media/outer/url?id=506092100.mp3', '${song.name|mark}-${listArtists(song.artists)}': 'http://music.163.com/song/media/outer/url?id=${song.id}.mp3', '${soil(x.name)}': 'http://music.163.com/song/media/outer/url?id=${x.id}.mp3', '{if x.album}{/if}': 'http://music.163.com/song/media/outer/url?id=${x.id}.mp3', '${x.name}': 'http://music.163.com/song/media/outer/url?id=${x.id}.mp3'}

为啥会有'${x.id}''${song.id}'??我也不是很清楚,但是只要把它过滤就行了。

def getArtUrl(artlist_url): #获取音乐真实ID的函数
	outerUrl = "http://music.163.com/song/media/outer/url?id={0:s}.mp3" #外链地址
    HEADER = { #伪造请求头,防止被封
        'Accept': '*/*',
        'Accept-Encoding': 'gzip,deflate,sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8,gl;q=0.6,zh-TW;q=0.4',
        'Connection': 'keep-alive',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Host': 'music.163.com',
        'Referer': 'https://music.163.com/search/',
        'User-Agent': UserAgent().random
    }
    response = requests.get(url=url.replace('#/', ''), headers=HEADER) #网易云音乐使用异步iframe加载,需要去掉"#/"
    soup = BeautifulSoup(response.text, "html.parser") #bs4读取
    aList = soup.select("a") #获取全部a标签
    Songs = {} #格式为{音乐名: "外链地址"}的字典
    for music in aList:
        if music.has_attr("href"): #判断是否有href属性
            if str(music.attrs["href"]).startswith("/song?id="): #是不是音乐ID
                id = str(music.attrs["href"]).replace("/song?id=", "") #获取真实ID
                try:
                    if not '${x.id}' in id and not '${song.id}' in id: #过滤
                        Songs[music.text] = outerUrl.format(id) #放入字典
                except:
                    pass
    return Songs #返回这个字典

结果:

{'Undertale': 'http://music.163.com/song/media/outer/url?id=39227624.mp3', 'sans.': 'http://music.163.com/song/media/outer/url?id=39224615.mp3', 'Tem Shop': 'http://music.163.com/song/media/outer/url?id=39224629.mp3', 'Undertale - Snowdin Town (****** Bootleg)': 'http://music.163.com/song/media/outer/url?id=463498197.mp3', "It's Not Like I Like You!!(Instrumental)": 'http://music.163.com/song/media/outer/url?id=1385676402.mp3', '似顔絵広場 (似顔絵チャンネル)': 'http://music.163.com/song/media/outer/url?id=1378269172.mp3', 'ステージ:コミカル (サンドキャニオン)': 'http://music.163.com/song/media/outer/url?id=28546263.mp3', 'I Love You': 'http://music.163.com/song/media/outer/url?id=31460297.mp3', 'Ice Cream': 'http://music.163.com/song/media/outer/url?id=35407675.mp3', 'Dying': 'http://music.163.com/song/media/outer/url?id=506092100.mp3'}

这就很好了。不过这种方式总觉得不是很妥。毕竟这个问题我们不知道还会有多少个类似错误。于是我要祭出我的正则表达式大法了!(看来BeautifulSoup也不需要了)

导入模块:

import requests, re #导入模块
from fake_useragent import UserAgent #导入模块

正则表达式:

<a href="/song\?id=(\d+)">(.*?)</a>

\?:问号要转义,不然会被认为是匹配操作符。

(\d+):多个数字形成的子组。

(.*?):匹配任何文字(歌曲名称)。

这样,匹配一个,它就会返回一个格式为(歌曲ID, 歌曲名称)的元组。

于是我们最终的代码如下:

#!/usr/bin/env python
#-*- coding:utf8 -*-
'''
@Author  :   Ray
@Contact :   ray@raycoder.me
@Software:   PyCharm
@File    :   init.py
@Time    :   2020/2/23 21:19
'''
import requests, re
from fake_useragent import UserAgent

def getArtUrl(artlist_url): #获取音乐真实ID的函数
    outerUrl = "http://music.163.com/song/media/outer/url?id={0:s}.mp3" #外链地址
    HEADER = { #伪造请求头,防止被封
        'Accept': '*/*',
        'Accept-Encoding': 'gzip,deflate,sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8,gl;q=0.6,zh-TW;q=0.4',
        'Connection': 'keep-alive',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Host': 'music.163.com',
        'Referer': 'https://music.163.com/search/',
        'User-Agent': UserAgent().random
    }
    response = requests.get(url=url.replace('#/', ''), headers=HEADER) #网易云音乐使用异步iframe加载,需要去掉"#/"
    songs = {} #含有音乐ID及名称的字典
    for i in re.findall(r'<a href="/song\?id=(\d+)">(.*?)</a>', response.text):#遍历匹配的对象
        songs[i[1]] = outerUrl.format(i[0]) #存入字典
    return songs

测试:

print(getUrl('https://music.163.com/#/playlist?id=4874770876'))

返回如下:

{'Prelude': 'http://music.163.com/song/media/outer/url?id=441612739.mp3', 'Marble Soda [Explicit]': 'http://music.163.com/song/media/outer/url?id=400876011.mp3', 'Snowdin Town': 'http://music.163.com/song/media/outer/url?id=39227596.mp3', 'Monkeys Spinning Monkeys': 'http://music.163.com/song/media/outer/url?id=403711640.mp3', 'Overworld Theme (GFM Trap Remix)': 'http://music.163.com/song/media/outer/url?id=467590509.mp3', 'Pizza': 'http://music.163.com/song/media/outer/url?id=1302090422.mp3', 'King Dedede': 'http://music.163.com/song/media/outer/url?id=27518579.mp3', 'Undertale': 'http://music.163.com/song/media/outer/url?id=39227624.mp3', 'sans.': 'http://music.163.com/song/media/outer/url?id=39224615.mp3', 'Tem Shop': 'http://music.163.com/song/media/outer/url?id=39224629.mp3', 'Undertale - Snowdin Town (****** Bootleg)': 'http://music.163.com/song/media/outer/url?id=463498197.mp3', "It's Not Like I Like You!!(Instrumental)": 'http://music.163.com/song/media/outer/url?id=1385676402.mp3', '似顔絵広場 (似顔絵チャンネル)': 'http://music.163.com/song/media/outer/url?id=1378269172.mp3', 'ステージ:コミカル (サンドキャニオン)': 'http://music.163.com/song/media/outer/url?id=28546263.mp3', 'I Love You': 'http://music.163.com/song/media/outer/url?id=31460297.mp3', 'Ice Cream': 'http://music.163.com/song/media/outer/url?id=35407675.mp3', 'Dying': 'http://music.163.com/song/media/outer/url?id=506092100.mp3'}

最好将其做成一个class

#!/usr/bin/env python
#-*- coding:utf8 -*-
'''
@Author  :   Ray
@Contact :   ray@raycoder.me
@Software:   PyCharm
@File    :   init.py
@Time    :   2020/2/23 21:19
'''
import requests, re
from fake_useragent import UserAgent

class CloudMusicSpider():
    def getArtUrl(self, artlist_url): #获取音乐真实ID的函数
        outerUrl = "http://music.163.com/song/media/outer/url?id={0:s}.mp3" #外链地址
        HEADER = { #伪造请求头,防止被封
            'Accept': '*/*',
            'Accept-Encoding': 'gzip,deflate,sdch',
            'Accept-Language': 'zh-CN,zh;q=0.8,gl;q=0.6,zh-TW;q=0.4',
            'Connection': 'keep-alive',
            'Content-Type': 'application/x-www-form-urlencoded',
            'Host': 'music.163.com',
            'Referer': 'https://music.163.com/',
            'User-Agent': UserAgent().random
        }
        response = requests.get(url=url.replace('#/', ''), headers=HEADER) #网易云音乐使用异步iframe加载,需要去掉"#/"
        songs = {}
        for i in re.findall(r'<a href="/song\?id=(\d+)">(.*?)</a>', response.text):
            songs[i[1]] = outerUrl.format(i[0])
        return songs

测试:

from CloudMusicSpider import CloudMusicSpider
spider=CloudMusicSpider()
print(spider.getArtUrl('https://music.163.com/#/playlist?id=4874770876'))

好的,我们这样子把它做成一个模块备用。

下一篇文章我会和大家讲爬虫下载文件的方法。我之后会写一篇有关正则表达式的文章。