如何用免费云平台体验大规模参数的AI模型

cheva (77)超哥in STEEM CN/中文 • 8 months ago

昨天说到现在的开源大模型能力也是进步飞快。号称打平最早期的GPT 3.5的比比皆是。这些开源的大模型使每个人都可以部署属于自己的AI。而不必依赖OpenAI、微软和谷歌这样的大公司。因为AI的重要用途之一就是总结文档、分析数据。而这些内容很可能涉及到用户的隐私和商业机密。如果使用大公司提供的API的话,隐私泄露和数据泄密的风险实在是太高了。再就是,像AI这样强大的技术,只垄断在少数公司手里才是真正的威胁。如果人人都能用得起,用得上AI,那样才可能从整体上提升社会的生产力,拥有更美好的未来。

而开源的AI模型则让每个人都拥有属于自己的AI成为了可能。在开源的AI模型当中最受欢迎的要数,那些7B大小的模型,也就是70亿参数,可以把它理解成有70亿个神经元吧。这样大小的模型。体积和占用空间较小,而且运行的开销比较便宜。经过4位量化之后,性能的损失很少。却可以在一部普通的,拥有4到8G显存的台式PC上运行。而且响应速度还非常快,体验甚至比OpenAI的API还要棒。虽然现在7B模型的技术已经相当强大了。一些日常碰到的问题也能给出让人满意的答案，但是面对复杂的问题仍然表现不够，这是因为它的参数量还是偏小了一点。根据生物学家的研究，人类的大脑平均有上千亿个神经元，而且科学家们推断，在人类进化的过程中，学会了使用火并且吃熟食，这一变化使得人类将较少的资源用在消化系统，将腾出来的资源用来扩大脑容量，是使人类能够走上进化之树顶端的非常重要的原因。

同样的道理也可以应用在AI模型上面，虽然现在最厉害的AI模型GPT-4的参数情况还是商业机密，不过有消息认为GPT-4的参数规模也有1000亿，而且还是有数个上千亿规模的模型组成的混合专家架构，所以它表现出来令人诧异的能力。但在开源模型当中也有很多大规模的模型，其中主要的是700亿参数左右，这个数量虽然没有达到1000亿，但是已经与人类大脑的神经元数量相差不到一个数量级了，而参数达到1000亿的开源模型也有一个，由卡塔尔的高科技公司训练的Falcon 130B，它是用了1300亿模型，据说效果也是非常的不错。

研究人员们发现，这种大语言模型有一种涌现的现象，就是随着参数或者我们用更通俗的话来说，——赛博神经元的数量的增加，当这些赛博神经元的数量增加到一定数量的时候，大模型的某些能力会，出现跳跃式的增长，他们把这种叫做涌现现象。虽然也有研究指出，并不存在所谓的涌现，只是相关研究人员使用的数据有问题，但不管怎样，模型的规模越大，其处理复杂问题的能力就越强，这个倒是没有什么问题的。

不过上面说的几种大规模的开源模型，已经远远超出了普通电脑的承受范围，个人部署是非常困难的，不过在7B和70B模型之间，还有两档可以选择，一个是13B一个是34B，13B的模型比70亿的模型，规模增长有限，在配置稍微高益的电脑上，也能够流畅运行，但是能力提升也相对有限。

34B模型倒是一个不错的选择，它可以运行在一些高端的台式电脑上，但是即便你没有高端的台式电脑，但是可以上网，也有办法体验这种大模型，这种规模的大模型，前不久，我发现在一个叫做kaggle的云算力平台上面，他们提供每周30个小时的免费使用时长，而且他们与著名的谷歌提供的Colab免费云算力平台不同，他们每个实例提供了两张英伟达的T4显卡，而Colab只有一张，每张T4显卡的显存容量有14G，两张合计28G，就足以运行34B规模的大模型了，你只要有电脑和一个kaglle账户，就可以体验。

我用业余时间捣鼓了两天，写了一个notebook，可以使用exllama这个程序，在kaggle上运行34B规模的各种语言模型，其中效果比较好的，就是之前提到过的Yi模型。这是一个国产的模型，在各项评测中，能够打平，甚至部分指标超越GPT3.5。实际使用效果确实也还不错，不过kaggle平台，虽然提供免费的GPU，但也有一些限制，你无法将开源模型常用的webUI运行在平台上，并提供外网访问地址。它一旦检测到，就会停止运行程序，所以这个脚本只能把提示词传给语言模型，让它生成答案，无法聊天，也不能够进行流式输出，所以在回复内容比较长的时候，等待时间也比较长，有时间再继续改进一下。


用exllama加载34B大模型
%cd /kaggle/working/
!apt-get -y install -qq aria2

!git clone -b v2.5 https://github.com/camenduru/text-generation-webui
%cd /kaggle/working/text-generation-webui
#!pip install -q -r requirements.txt

!git clone https://github.com/turboderp/exllama
%cd exllama
!pip install ninja==1.11.1
!pip torch==2.0.1

modelName="TheBloke/OrionStar-Yi-34B-Chat-Llama-GPTQ"
%cd /kaggle/working/text-generation-webui
!python download-model.py {modelName}
model_name = modelName.replace('/','-')
print(model_name)
!echo "dark_theme: true" > /kaggle/working/settings.yaml
!echo "chat_style: wpp" >> /kaggle/working/settings.yaml
test_str="/kaggle/working/text-generation-webui/models/"+model_name.replace('-', '_', 1)
print(test_str)
import sys
sys.path.append('/kaggle/working/text-generation-webui/exllama/')
from model import ExLlama, ExLlamaCache, ExLlamaConfig
from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
import os, glob

# Directory containing model, tokenizer, generator

model_directory =  "/kaggle/working/text-generation-webui/models/"+model_name.replace('-', '_', 1)

# Locate files we need within that directory

tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)

# Create config, model, tokenizer and generator

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file
config.auto_map = [8,10]
model = ExLlama(config)                                 # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)            # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)                             # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)   # create generator

# Configure generator

generator.disallow_tokens([tokenizer.eos_token_id])

generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.95
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5

# Produce a simple generation

prompt = """<|im_start|>system
你是一位知识渊博的智能助手<|im_end|>
<|im_start|>user
用通俗的语言介绍矢量微积分中梯度、散度、旋度的概念<|im_end|>
<|im_start|>assistant"""
print (prompt, end = "")

output = generator.generate_simple(prompt, max_new_tokens = 1000)

print(output[len(prompt):])

prompt = """<human>: 
Please Complete the given function below according to the docstring: 
请用python写一个数学练习小游戏，出20道不同的数学题，练习乘加运算，如“3X3+4=”。参与运算的数都是小于等于10随机数，当使用者完成答题后给出分数，每题5分，满分100分。
<bot>: """
print (prompt, end = "")

output = generator.generate_simple(prompt, max_new_tokens = 1000)

print(output[len(prompt):])
prompt = """<human>: 
Please Complete the given function below according to the docstring: 
请详细介绍一下python多线程threading的使用方法，尽量通俗并使用简单的例子
<bot>: """
print (prompt, end = "")

output = generator.generate_simple(prompt, max_new_tokens = 3000)

print(output[len(prompt):])
# Produce a simple generation

prompt = """<|im_start|>system
你是一位知识渊博的智能助手<|im_end|>
<|im_start|>user
用通俗的语言介绍矢量微积分中梯度、散度、旋度的概念<|im_end|>
<|im_start|>assistant"""
print (prompt, end = "")

output = generator.generate_simple(prompt, max_new_tokens = 800)

print(output[len(prompt):])
# Produce a simple generation

prompt = """<|im_start|>system
你是一位知识渊博的智能助手<|im_end|>
<|im_start|>user
请介绍一下哈耶克对凯恩斯观点的批判<|im_end|>
<|im_start|>assistant"""
print (prompt, end = "")

output = generator.generate_simple(prompt, max_new_tokens = 1000)

print(output[len(prompt):])
# Produce a simple generation

prompt = """<|im_start|>system
你是一位知识渊博的智能助手<|im_end|>
<|im_start|>user
请详细介绍一下相对论的光速不变性原理的理论和逻辑基础<|im_end|>
<|im_start|>assistant"""
print (prompt, end = "")

output = generator.generate_simple(prompt, max_new_tokens = 1000)

print(output[len(prompt):])
# Produce a simple generation

prompt = """<|im_start|>system
你是一位知识渊博的智能助手<|im_end|>
<|im_start|>user
为什么狭义相对论和薛定谔方程结合起来会推导出反物质<|im_end|>
<|im_start|>assistant"""
print (prompt, end = "")

output = generator.generate_simple(prompt, max_new_tokens = 1000)

print(output[len(prompt):])
"TheBloke/CapyTessBorosYi-34B-200K-DARE-Ties-GPTQ"
"TheBloke/Ziya-Coding-34B-v1.0-GPTQ"

Yesterday said that the current open source large model capability is also progressing rapidly. Claims to be equal to the original GPT 3.5 abound. These open source big models allow everyone to deploy their own AI. You don't have to rely on big companies like OpenAI, Microsoft, and Google. Because one of the important uses of AI is to summarize documents and analyze data. These contents are likely to involve users' privacy and trade secrets. The risk of privacy breaches and data breaches is too high if you use apis provided by large companies. Moreover, a technology as powerful as AI, monopolized by only a few companies, is the real threat. If everyone can afford to use AI, then it is possible to increase the productivity of society as a whole and have a better future.

The open source AI model makes it possible for everyone to have their own AI. Among the most popular open source AI models, those 7B size models, that is, 7 billion parameters, can be understood as having 7 billion neurons. A model of this size. The volume and space is smaller, and the cost of operation is cheaper. After 4-bit quantization, there is little loss of performance. It can run on a normal desktop PC with 4 to 8 gigabytes of video memory. It's also very responsive, and the experience is even better than OpenAI's API. Although the technology of the 7B model is now quite powerful. Some everyday problems can also give satisfactory answers, but the performance of complex problems is still insufficient, because the number of parameters is still a little small. According to biologists, the human brain has an average of hundreds of billions of neurons, and scientists have concluded that in the course of human evolution, we learned to use fire and eat cooked food, and this change made us devote less resources to the digestive system, and free up resources to expand the brain capacity, which is a very important reason that enabled us to move to the top of the evolutionary tree.

The same can be applied to AI models, although the parameters of the most powerful AI model GPT-4 are still trade secrets, but there is news that GPT-4 has a parameter scale of 100 billion, and there are still several hybrid expert architectures composed of hundreds of billions of scale models, so it shows surprising capabilities. However, there are also many large-scale models in the open source model, the main one is about 70 billion parameters, although the number does not reach 100 billion, but it is already less than an order of magnitude with the number of neurons in the human brain, and there is also one open source model with 100 billion parameters. The Falcon 130B, trained by a Qatari high-tech company, uses 130 billion models and is said to work very well.

The researchers found that there was an emergent phenomenon in these big language models, which is that as the number of parameters, or as we would say more generally, the number of cyberneurons, when the number of these cyberneurons increases to a certain number, some of the capabilities of the big models would jump, and they called this emergent phenomenon. Although some studies have pointed out that there is no such thing as emergence, just the data used by the relevant researchers is faulty, but in any case, the larger the model, the better it is able to handle complex problems, which is fine.

However, several large-scale open source models mentioned above have far exceeded the range of ordinary computers, personal deployment is very difficult, but between 7B and 70B models, there are two files to choose from, one is 13B and one is 34B, 13B model than 7 billion model, the scale of growth is limited, in the configuration of a slightly higher profit computer, It can also run smoothly, but the ability to improve is relatively limited.

The 34B model is a good choice, it can run on some high-end desktop computers, but even if you don't have a high-end desktop computer, but have access to the Internet, there is a way to experience this kind of large model, this scale of large models, not long ago, I found that on a cloud computing platform called kaggle, They provide 30 hours of free use time per week, and they are different from the famous Colab free cloud computing platform provided by Google, they provide two NVIDIA T4 graphics cards per instance, and Colab only one, each T4 graphics card has a video memory capacity of 14G, two total 28G, enough to run 34B scale large model. All you need to do is have a computer and a kaglle account.

I spent two days fooling around in my spare time and wrote a notebook, which can use the program exllama to run various language models of 34B scale on kaggle, among which the effect is better, which is the Yi model mentioned before. This is a domestic model, in various evaluations, can equal, and even some indicators exceed GPT3.5. The kaggle platform, although it offers free Gpus, has some limitations: you can't run the web user interface (WEBUIs) commonly used by open source models on the platform and provide external access addresses. Once it detects it, it will stop running the program, so the script can only pass the prompt word to the language model, so that it can't chat, nor can it stream output, so when the reply content is relatively long, the waiting time is relatively long, and there is time to continue to improve it.

#cn #whalepower #lifestyle #cn-reader #life #stemsocial #ai #cn-programming #zzan #dblog #diamondtoken #marlians #upfundme #actnearn