Discussion:
How can i merge multiple rows to one row in sparksql or hivesql?
Jone Zhang
2017-05-15 13:15:30 UTC
Permalink
For example
Data1(has 1 billion records)
user_id1 feature1
user_id1 feature2

Data2(has 1 billion records)
user_id1 feature3

Data3(has 1 billion records)
user_id1 feature4
user_id1 feature5
...
user_id1 feature100

I want to get the result as follow
user_id1 feature1 feature2 feature3 feature4 feature5...feature100

Is there a more efficient way except join?

Thanks!
goun na
2017-05-15 15:50:05 UTC
Permalink
Hi, Jone Zhang

1. Hive UDF
You might need collect_set or collect_list (to eliminate duplication), but
make sure reduce its cardinality before applying UDFs as it can cause
problems while handling 1 billion records. Union dataset 1,2,3 -> group by
user_id1 -> collect_set (feature column) would works.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

2.Spark Dataframe Pivot
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

- Goun
Post by Jone Zhang
For example
Data1(has 1 billion records)
user_id1 feature1
user_id1 feature2
Data2(has 1 billion records)
user_id1 feature3
Data3(has 1 billion records)
user_id1 feature4
user_id1 feature5
...
user_id1 feature100
I want to get the result as follow
user_id1 feature1 feature2 feature3 feature4 feature5...feature100
Is there a more efficient way except join?
Thanks!
goun na
2017-05-15 16:06:06 UTC
Permalink
I mentioned it opposite. collect_list generates duplicated results.
Post by goun na
Hi, Jone Zhang
1. Hive UDF
You might need collect_set or collect_list (to eliminate duplication), but
make sure reduce its cardinality before applying UDFs as it can cause
problems while handling 1 billion records. Union dataset 1,2,3 -> group by
user_id1 -> collect_set (feature column) would works.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
2.Spark Dataframe Pivot
https://databricks.com/blog/2016/02/09/reshaping-data-
with-pivot-in-apache-spark.html
- Goun
Post by Jone Zhang
For example
Data1(has 1 billion records)
user_id1 feature1
user_id1 feature2
Data2(has 1 billion records)
user_id1 feature3
Data3(has 1 billion records)
user_id1 feature4
user_id1 feature5
...
user_id1 feature100
I want to get the result as follow
user_id1 feature1 feature2 feature3 feature4 feature5...feature100
Is there a more efficient way except join?
Thanks!
Edward Capriolo
2017-05-15 16:35:14 UTC
Permalink
Here is a similar but not exact way I did something similar to what you
did. I had two data files in different formats the different columns needed
to be different features. I wanted to feed them into spark's:
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_FP-Growth_Algorithm

This only works because I have a few named features, and they become fields
in the model object AntecedentUnion. This would be a crappy solution for a
large sparse matrix.

Also my Scala code is crap too so there is probably a better way to do this!


val b = targ.as[TargetingAntecedent]
val b1 = b.map(c => (c.tdid, c)).rdd.groupByKey()
val bgen = b1.map(f =>
(f._1 , f._2.map
( x => AntecedentUnion("targeting", "", x.targetingdataid,
"", "") )
) )

val c = imp.as[ImpressionAntecedent]
val c1 = c.map(k => (k.tdid, k)).rdd.groupByKey()
val cgen = c1.map (f =>
(f._1 , f._2.map
( x => AntecedentUnion("impression", "", "", x.campaignid,
x.adgroupid) ).toSet.toIterable
) )

val bgen = TargetingUtil.targetingAntecedent(sparkSession, sqlContext,
targ)
val cgen = TargetingUtil.impressionAntecedent(sparkSession, sqlContext,
imp)
val joined = bgen.join(cgen)

val merged = joined.map(f => (f._1, f._2._1++:(f._2._2) ))
val fullResults : RDD[Array[AntecedentUnion]] = merged.map(f =>
f._2).map(_.toArray[audacity.AntecedentUnion])


So essentially converting everything into AntecedentUnion where the first
column is the type of the tuple, and other fields are supplied or not. Then
merge all those and run fpgrowth on them. Hope that helps!
Post by goun na
I mentioned it opposite. collect_list generates duplicated results.
Post by goun na
Hi, Jone Zhang
1. Hive UDF
You might need collect_set or collect_list (to eliminate duplication),
but make sure reduce its cardinality before applying UDFs as it can cause
problems while handling 1 billion records. Union dataset 1,2,3 -> group by
user_id1 -> collect_set (feature column) would works.
Post by goun na
Post by goun na
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
2.Spark Dataframe Pivot
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
Post by goun na
Post by goun na
- Goun
Post by Jone Zhang
For example
Data1(has 1 billion records)
user_id1 feature1
user_id1 feature2
Data2(has 1 billion records)
user_id1 feature3
Data3(has 1 billion records)
user_id1 feature4
user_id1 feature5
...
user_id1 feature100
I want to get the result as follow
user_id1 feature1 feature2 feature3 feature4 feature5...feature100
Is there a more efficient way except join?
Thanks!
Loading...