具有分布式集群的 Python 多处理

Python Multiprocessing with Distributed Cluster(具有分布式集群的 Python 多处理)
本文介绍了具有分布式集群的 Python 多处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我正在寻找一个 python 包,它不仅可以在单台计算机内的不同内核上进行多处理,而且还可以在分布在多台机器上的集群中进行多处理.有很多不同的用于分布式计算的 Python 包,但大多数似乎都需要更改代码才能运行(例如,表示对象位于远程计算机上的前缀).具体来说,我想要尽可能接近多处理 pool.map 函数的东西.因此,例如,如果在一台机器上,脚本是:

I am looking for a python package that can do multiprocessing not just across different cores within a single computer, but also with a cluster distributed across multiple machines. There are a lot of different python packages for distributed computing, but most seem to require a change in code to run (for example a prefix indicating that the object is on a remote machine). Specifically, I would like something as close as possible to the multiprocessing pool.map function. So, for example, if on a single machine the script is:

from multiprocessing import Pool
pool = Pool(processes = 8)
resultlist = pool.map(function, arglist)

那么分布式集群的伪代码将是:

Then the pseudocode for a distributed cluster would be:

from distprocess import Connect, Pool, Cluster

pool1 = Pool(processes = 8)
c = Connect(ipaddress)
pool2 = c.Pool(processes = 4)
cluster = Cluster([pool1, pool2])
resultlist = cluster.map(function, arglist)

推荐答案

如果你想要一个非常简单的解决方案,没有.

If you want a very easy solution, there isn't one.

但是,有一个解决方案具有 multiprocessing 接口 -- pathos -- 它能够通过并行映射建立与远程服务器的连接,并做多处理.

However, there is a solution that has the multiprocessing interface -- pathos -- which has the ability to establish connections to remote servers through a parallel map, and to do multiprocessing.

如果您想建立 ssh 隧道连接,您可以这样做……或者如果您可以使用不太安全的方法,您也可以这样做.

If you want to have a ssh-tunneled connection, you can do that… or if you are ok with a less secure method, you can do that too.

>>> # establish a ssh tunnel
>>> from pathos.core import connect
>>> tunnel = connect('remote.computer.com', port=1234)
>>> tunnel       
Tunnel('-q -N -L55774:remote.computer.com:1234 remote.computer.com')
>>> tunnel._lport
55774
>>> tunnel._rport
1234
>>> 
>>> # define some function to run in parallel
>>> def sleepy_squared(x):
...   from time import sleep
...   sleep(1.0)
...   return x**2
... 
>>> # build a pool of servers and execute the parallel map
>>> from pathos.pp import ParallelPythonPool as Pool
>>> p = Pool(8, servers=('localhost:55774',))
>>> p.servers
('localhost:55774',)
>>> y = p.map(sleepy_squared, x)
>>> y
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

或者,您可以配置为直接连接(无 ssh)

Or, instead you could configure for a direct connection (no ssh)

>>> p = Pool(8, servers=('remote.computer.com:5678',))
# use an asynchronous parallel map
>>> res = p.amap(sleepy_squared, x)
>>> res.get()
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

这有点挑剔,要让远程服务器工作,你必须事先在指定端口启动一个运行在 remote.computer.com 上的服务器——你必须确保本地主机和远程主机上的设置都将允许直接连接或 ssh 隧道连接.另外,您需要在每个主机上运行相同版本的 pathospppathos 分支.此外,对于 ssh,您需要运行 ssh-agent 以允许使用 ssh 进行无密码登录.

It's all a bit finicky, for the remote server to work, you have to start a server running on remote.computer.com at the specified port beforehand -- and you have to make sure that both the settings on your localhost and the remote host are going to allow either the direct connection or the ssh-tunneled connection. Plus, you need to have the same version of pathos and of the pathos fork of pp running on each host. Also, for ssh, you need to have ssh-agent running to allow password-less login with ssh.

但是,希望一切正常……如果您的功能代码可以使用 dill.source.importable 传输到远程主机.

But then, hopefully it all works… if your function code can be transported over to the remote host with dill.source.importable.

仅供参考,pathos 早就应该发布了,基本上,在新的稳定版本被删除之前,有一些错误和界面更改需要解决.

FYI, pathos is long overdue a release, and basically, there are a few bugs and interface changes that need to be resolved before a new stable release is cut.

这篇关于具有分布式集群的 Python 多处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!

相关文档推荐

build conda package from local python package(从本地 python 包构建 conda 包)
How can I see all packages that depend on a certain package with PIP?(如何使用 PIP 查看依赖于某个包的所有包?)
How to organize multiple python files into a single module without it behaving like a package?(如何将多个 python 文件组织到一个模块中而不像一个包一样?)
Check if requirements are up to date(检查要求是否是最新的)
How to upload new versions of project to PyPI with twine?(如何使用 twine 将新版本的项目上传到 PyPI?)
Why #egg=foo when pip-installing from git repo(为什么从 git repo 进行 pip 安装时 #egg=foo)