Spark Graphx 实现图中极大团挖掘, 伪并行化算法

大数据 算法 Spark
对于关联性较强的图,找出来的连通图非常大,这时串行化的极大团算法,仍然会耗时很久,这里利用剪枝的思想减少样本数据量,但是对于大图,优化空间有限。

[[206073]]

 ####背景:####

spark graphx并未提供极大团挖掘算法

当下的极大团算法都是串行化的算法,基于Bron–Kerbosch算法

####思路:####

spark graphx提供了连通图的算法,连通图和极大团都是无向图中的概念,极大团为连通图的子集

利用spark graphx 找出连通图,在从各个连通图中,利用串行化的极大团算法,找出极大团 (伪并行化)

对于关联性较强的图,找出来的连通图非常大,这时串行化的极大团算法,仍然会耗时很久,这里利用剪枝的思想减少样本数据量,但是对于大图,优化空间有限

期待真正的并行化的极大团算法

####配置文件:####

  1. graph_data_path=hdfs://localhost/graph_data 
  2. out_path=hdfs://localhost/clique 
  3. ck_path=hdfs://localhost/checkpoint 
  4. numIter=50      剪枝次数 
  5. count=3         极大团顶点数大小 
  6. algorithm=2     极大团算法,1:个人实现  2:jgrapht 
  7. percent=90      剪枝后的顶点数,占前一次的百分比,如果剪完后,还剩下90%的数据,那么剪枝效率已然不高 
  8. spark.master=local 
  9. spark.app.name=graph 
  10. spark.serializer=org.apache.spark.serializer.KryoSerializer 
  11. spark.yarn.executor.memoryOverhead=20480 
  12. spark.yarn.driver.memoryOverhead=20480 
  13. spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:+UseCompressedOops -XX:+DisableExplicitGC 
  14. spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+UseCompressedOops -XX:+DisableExplicitGC 
  15. spark.driver.maxResultSize=10g 
  16. spark.default.parallelism=60 

jgrapht

####样本数据:####

{"src":"0","dst":"1"} {"src":"0","dst":"2"} {"src":"0","dst":"3"} {"src":"1","dst":"0"} {"src":"2","dst":"1"} {"src":"3","dst":"5"} {"src":"4","dst":"6"} {"src":"5","dst":"4"} {"src":"6","dst":"5"} {"src":"3","dst":"2"} {"src":"2","dst":"3"} {"src":"6","dst":"4"} {"src":"3","dst":"4"} {"src":"4","dst":"3"} {"src":"2","dst":"6"} {"src":"6","dst":"2"} {"src":"6","dst":"7"} {"src":"7","dst":"6"}

####样本图:####

 

####输出:####

0,1,2 0,2,3 3,4,5 4,5,6

####代码实现:####

  1. import java.util import java.util.Properties 
  1. import org.apache.spark.broadcast.Broadcast 
  2. import org.apache.spark.graphx.{Edge, Graph} 
  3. import org.apache.spark.rdd.RDD 
  4. import org.apache.spark.sql.{Row, SQLContext} 
  5. import org.apache.spark.storage.StorageLevel 
  6. import org.apache.spark.{SparkConf, SparkContext} 
  7. import org.jgrapht.alg.BronKerboschCliqueFinder 
  8. import org.jgrapht.graph.{DefaultEdge, SimpleGraph} 
  9.  
  10. import scala.collection.JavaConverters._ 
  11. import scala.collection.mutable 
  12.  
  13. object ApplicationTitan { 
  14.     def main(args: Array[String]) { 
  15.         val prop = new Properties() 
  16.         prop.load(getClass.getResourceAsStream("/config.properties")) 
  17.      
  18.         val graph_data_path = prop.getProperty("graph_data_path"
  19.         val out_path = prop.getProperty("out_path"
  20.         val ck_path = prop.getProperty("ck_path"
  21.         val count = Integer.parseInt(prop.getProperty("count")) 
  22.         val numIter = Integer.parseInt(prop.getProperty("numIter")) 
  23.         val algorithm = Integer.parseInt(prop.getProperty("algorithm")) 
  24.         val percent = Integer.parseInt(prop.getProperty("percent")) 
  25.         val conf = new SparkConf() 
  26.         try { 
  27.           Runtime.getRuntime.exec("hdfs dfs -rm -r " + out_path) 
  28. //            Runtime.getRuntime.exec("cmd.exe /C rd /s /q " + out_path) 
  29.         } catch { 
  30.           case ex: Exception => 
  31.             ex.printStackTrace(System.out
  32.         } 
  33.      
  34.         prop.stringPropertyNames().asScala.foreach(s => { 
  35.           if (s.startsWith("spark")) { 
  36.             conf.set(s, prop.getProperty(s)) 
  37.           } 
  38.         }) 
  39.         conf.registerKryoClasses(Array(getClass)) 
  40.         val sc = new SparkContext(conf) 
  41.         sc.setLogLevel("ERROR"
  42.         sc.setCheckpointDir(ck_path) 
  43.         val sqlc = new SQLContext(sc) 
  44.         try { 
  45.           val e_df = sqlc.read 
  46. //                        .json(graph_data_path) 
  47.         .parquet(graph_data_path) 
  48.  
  49.           var e_rdd = e_df 
  50.             .mapPartitions(it => { 
  51.               it.map({ 
  52.                 case Row(dst: String, src: String) => 
  53.                   val src_long = src.toLong 
  54.                   val dst_long = dst.toLong 
  55.                   if (src_long < dst_long) (src_long, dst_long) else (dst_long, src_long) 
  56.               }) 
  57.             }).distinct() 
  58.           e_rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) 
  59.      
  60.           var bc: Broadcast[Set[Long]] = null 
  61.           var iter = 0 
  62.           var bc_size = 0 
  63.          //剪枝 
  64.           while (iter <= numIter) { 
  65.             val temp = e_rdd 
  66.               .flatMap(x => List((x._1, 1), (x._2, 1))) 
  67.               .reduceByKey((x, y) => x + y) 
  68.               .filter(x => x._2 >= count - 1) 
  69.               .mapPartitions(it => it.map(x => x._1)) 
  70.             val bc_value = temp.collect().toSet 
  71.             bc = sc.broadcast(bc_value) 
  72.             e_rdd = e_rdd.filter(x => bc.value.contains(x._1) && bc.value.contains(x._2)) 
  73.             e_rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) 
  74.             iter += 1 
  75.             if (bc_size != 0 && bc_value.size >= bc_size * percent / 100) { 
  76.               println("total iter : "+ iter) 
  77.               iter = Int.MaxValue 
  78.             } 
  79.             bc_size = bc_value.size 
  80.           } 
  81.      
  82.           // 构造图 
  83.           val edge: RDD[Edge[Long]] = e_rdd.mapPartitions(it => it.map(x => Edge(x._1, x._2))) 
  84.           val graph = Graph.fromEdges(edge, 0, StorageLevel.MEMORY_AND_DISK_SER, StorageLevel.MEMORY_AND_DISK_SER) 
  85.      
  86.           //连通图 
  87.           val cc = graph.connectedComponents().vertices 
  88.           cc.persist(StorageLevel.MEMORY_AND_DISK_SER) 
  89.      
  90.           cc.join(e_rdd) 
  91.             .mapPartitions(it => it.map(x => ((math.random * 10).toInt.toString.concat(x._2._1.toString), (x._1, x._2._2)))) 
  92.             .aggregateByKey(List[(Long, Long)]())((list, v) => list :+ v, (list1, list2) => list1 ::: list2) 
  93.             .mapPartitions(it => it.map(x => (x._1.substring(1), x._2))) 
  94.             .aggregateByKey(List[(Long, Long)]())((list1, list2) => list1 ::: list2, (list3, list4) => list3 ::: list4) 
  95.             .filter(x => x._2.size >= count - 1) 
  96.             .flatMap(x => { 
  97.               if (algorithm == 1) 
  98.                 find(x, count
  99.               else 
  100.                 find2(x, count
  101.             }) 
  102.             .mapPartitions(it => { 
  103.               it.map({ 
  104.                 case set => 
  105.                   var temp = "" 
  106.                   set.asScala.foreach(x => temp += x + ","
  107.                   temp.substring(0, temp.length - 1) 
  108.                 case _ => 
  109.               }) 
  110.             }) 
  111.     //                .coalesce(1) 
  112.     .saveAsTextFile(out_path) 
  113.  
  114.     catch { 
  115.   case ex: Exception => 
  116.     ex.printStackTrace(System.out
  117.     } 
  118.     sc.stop() 
  119. //自己实现的极大团算法 
  120.  def find(x: (String, List[(Long, Long)]), countInt): mutable.Set[util.Set[String]] = { 
  121.     println(x._1 + "|s|" + x._2.size
  122.     println("BKCliqueFinder---" + x._1 + "---" + System.currentTimeMillis()) 
  123.     val neighbors = new util.HashMap[String, util.Set[String]] 
  124.     val finder = new CliqueFinder(neighbors, count
  125.     x._2.foreach(r => { 
  126.       val v1 = r._1.toString 
  127.       val v2 = r._2.toString 
  128.       if (neighbors.containsKey(v1)) { 
  129.         neighbors.get(v1).add(v2) 
  130.       } else { 
  131.         val temp = new util.HashSet[String]() 
  132.         temp.add(v2) 
  133.         neighbors.put(v1, temp
  134.       } 
  135.       if (neighbors.containsKey(v2)) { 
  136.         neighbors.get(v2).add(v1) 
  137.       } else { 
  138.         val temp = new util.HashSet[String]() 
  139.         temp.add(v1) 
  140.         neighbors.put(v2, temp
  141.       } 
  142.     }) 
  143.     println("BKCliqueFinder---" + x._1 + "---" + System.currentTimeMillis()) 
  144.     finder.findMaxCliques().asScala 
  145. //jgrapht 中的极大团算法 
  146.  def find2(x: (String, List[(Long, Long)]), countInt): Set[util.Set[String]] = { 
  147.     println(x._1 + "|s|" + x._2.size
  148.     println("BKCliqueFinder---" + x._1 + "---" + System.currentTimeMillis()) 
  149.     val to_clique = new SimpleGraph[String, DefaultEdge](classOf[DefaultEdge]) 
  150.     x._2.foreach(r => { 
  151.       val v1 = r._1.toString 
  152.       val v2 = r._2.toString 
  153.       to_clique.addVertex(v1) 
  154.       to_clique.addVertex(v2) 
  155.       to_clique.addEdge(v1, v2) 
  156.     }) 
  157.     val finder = new BronKerboschCliqueFinder(to_clique) 
  158.     val list = finder.getAllMaximalCliques.asScala 
  159.     var result = Set[util.Set[String]]() 
  160.     list.foreach(x => { 
  161.       if (x.size() >= count
  162.         result = result + x 
  163.     }) 
  164.     println("BKCliqueFinder---" + x._1 + "---" + System.currentTimeMillis()) 
  165.     result 

//自己实现的极大团算法

  1. import java.util.*; 
  2.  
  3. /** 
  4.  * [@author](https://my.oschina.net/arthor) mopspecial@gmail.com 
  5.  * [@date](https://my.oschina.net/u/2504391) 2017/7/31 
  6.  */ 
  7. public class CliqueFinder { 
  8.     private Map<String, Set<String>> neighbors; 
  9.     private Set<String> nodes; 
  10.     private Set<Set<String>> maxCliques = new HashSet<>(); 
  11.     private Integer minSize; 
  12.  
  13.     public CliqueFinder(Map<String, Set<String>> neighbors, Integer minSize) { 
  14.         this.neighbors = neighbors; 
  15.         this.nodes = neighbors.keySet(); 
  16.         this.minSize = minSize; 
  17.     } 
  18.  
  19.     private void bk3(Set<String> clique, List<String> candidates, List<String> excluded) { 
  20.         if (candidates.isEmpty() && excluded.isEmpty()) { 
  21.             if (!clique.isEmpty() && clique.size() >= minSize) { 
  22.                 maxCliques.add(clique); 
  23.             } 
  24.             return
  25.         } 
  26.  
  27.         for (String s : degeneracy_order(candidates)) { 
  28.             List<String> new_candidates = new ArrayList<>(candidates); 
  29.             new_candidates.retainAll(neighbors.get(s)); 
  30.  
  31.             List<String> new_excluded = new ArrayList<>(excluded); 
  32.             new_excluded.retainAll(neighbors.get(s)); 
  33.             Set<String> nextClique = new HashSet<>(clique); 
  34.             nextClique.add(s); 
  35.             bk2(nextClique, new_candidates, new_excluded); 
  36.             candidates.remove(s); 
  37.             excluded.add(s); 
  38.         } 
  39.     } 
  40.  
  41.     private void bk2(Set<String> clique, List<String> candidates, List<String> excluded) { 
  42.         if (candidates.isEmpty() && excluded.isEmpty()) { 
  43.             if (!clique.isEmpty() && clique.size() >= minSize) { 
  44.                 maxCliques.add(clique); 
  45.             } 
  46.             return
  47.         } 
  48.         String pivot = pick_random(candidates); 
  49.         if (pivot == null) { 
  50.             pivot = pick_random(excluded); 
  51.         } 
  52.         List<String> tempc = new ArrayList<>(candidates); 
  53.         tempc.removeAll(neighbors.get(pivot)); 
  54.  
  55.         for (String s : tempc) { 
  56.             List<String> new_candidates = new ArrayList<>(candidates); 
  57.             new_candidates.retainAll(neighbors.get(s)); 
  58.  
  59.             List<String> new_excluded = new ArrayList<>(excluded); 
  60.             new_excluded.retainAll(neighbors.get(s)); 
  61.             Set<String> nextClique = new HashSet<>(clique); 
  62.             nextClique.add(s); 
  63.             bk2(nextClique, new_candidates, new_excluded); 
  64.             candidates.remove(s); 
  65.             excluded.add(s); 
  66.         } 
  67.     } 
  68.  
  69.     private List<String> degeneracy_order(List<String> innerNodes) { 
  70.         List<String> result = new ArrayList<>(); 
  71.         Map<String, Integer> deg = new HashMap<>(); 
  72.         for (String node : innerNodes) { 
  73.             deg.put(node, neighbors.get(node).size()); 
  74.         } 
  75.         while (!deg.isEmpty()) { 
  76.             Integer min = Collections.min(deg.values()); 
  77.             String minKey = null
  78.             for (String key : deg.keySet()) { 
  79.                 if (deg.get(key).equals(min)) { 
  80.                     minKey = key
  81.                     break; 
  82.                 } 
  83.             } 
  84.             result.add(minKey); 
  85.             deg.remove(minKey); 
  86.             for (String k : neighbors.get(minKey)) { 
  87.                 if (deg.containsKey(k)) { 
  88.                     deg.put(k, deg.get(k) - 1); 
  89.                 } 
  90.             } 
  91.  
  92.         } 
  93.         return result; 
  94.     } 
  95.  
  96.  
  97.     private String pick_random(List<String> random) { 
  98.         if (random != null && !random.isEmpty()) { 
  99.             return random.get(0); 
  100.         } else { 
  101.             return null
  102.         } 
  103.     } 
  104.  
  105.     public Set<Set<String>> findMaxCliques() { 
  106.         this.bk3(new HashSet<>(), new ArrayList<>(nodes), new ArrayList<>()); 
  107.         return maxCliques; 
  108.     } 
  109.  
  110.     public static void main(String[] args) { 
  111.         Map<String, Set<String>> neighbors = new HashMap<>(); 
  112.         neighbors.put("0", new HashSet<>(Arrays.asList("1""2""3"))); 
  113.         neighbors.put("1", new HashSet<>(Arrays.asList("0""2"))); 
  114.         neighbors.put("2", new HashSet<>(Arrays.asList("0""1""3""6"))); 
  115.         neighbors.put("3", new HashSet<>(Arrays.asList("0""2""4""5"))); 
  116.         neighbors.put("4", new HashSet<>(Arrays.asList("3""5""6"))); 
  117.         neighbors.put("5", new HashSet<>(Arrays.asList("3""4""6"))); 
  118.         neighbors.put("6", new HashSet<>(Arrays.asList("2""4""5"))); 
  119.         neighbors.put("7", new HashSet<>(Arrays.asList("6"))); 
  120.         CliqueFinder finder = new CliqueFinder(neighbors, 3); 
  121.         finder.bk3(new HashSet<>(), new ArrayList<>(neighbors.keySet()), new ArrayList<>()); 
  122.         System.out.println(finder.maxCliques); 
  123.     } 
责任编辑:武晓燕 来源: oschina博客
相关推荐

2009-08-19 09:42:34

F#并行排序算法

2009-11-23 16:09:50

PHP实现伪静态化页面

2013-09-11 16:02:00

Spark分布式计算系统

2013-11-19 17:44:52

Linux桌面PCLinuxOS

2012-08-28 09:15:33

Hadoop海量数据挖掘算法

2017-04-24 12:07:44

Spark大数据并行计算

2013-05-06 10:09:27

虚拟化桌面虚拟化

2013-05-08 15:41:11

2016-10-19 18:31:11

2009-05-26 16:09:04

惠普存储虚拟化

2018-10-18 09:34:16

高并发异步化并行化

2011-08-11 16:16:26

SQL Server数据挖掘

2011-07-25 16:05:27

SQL SERVER数Web路径流挖掘

2010-06-04 15:44:06

Hadoop伪分布

2009-03-13 13:58:10

Javascript哈希表伪哈希表

2014-12-23 10:07:30

SparkSpark 1.2

2014-03-18 10:16:58

SVM

2012-08-09 09:57:54

K-means

2014-03-17 15:28:48

MapReduce

2020-09-28 14:05:08

点赞
收藏

51CTO技术栈公众号