lundi 13 juin 2016

spark scala xml dataframe get the empty XML tag using SELECT query

I am trying to load an XML file into Scala and then check if the XML tag is empty by running a SELECT query.

Here is the code to load XML file using Spark XML

val df = sqlContext.read
                  .format("com.databricks.spark.xml")
                   .option("rowTag", "Session")
                   .load("test1.xml")

And the actual XML file contents are below

<Aggregator>
    <Session>
        <Data>
            <Log>
                <Event EventTime="2015-01-09T02:13:19">
                    <Start />
                </Event>
                <Event EventTime="2015-01-08T20:27:21">
                    <Power State="PowerOn"/>
                </Event>
                <Event EventTime="2015-01-08T23:54:52">
                    <Play>
                        <Video VideoId="4294967295" VideoType="MP4"/>
                    </Play>
                </Event>
                <Event EventTime="2015-12-27T03:59:20">
                    <Stop/>
                </Event>
                <Event EventTime="2015-05-01T09:37:56">
                    <Pause>
                        <File pauseTime="2015-04-15T19:35:00" key="902"/>
                    </Pause>
                </Event>
            </Log>
        </Data>
    </Session>
    <Session>
        <Data>
            <Log>
                <Event EventTime="2015-01-08T23:54:52">
                    <Play>
                        <Video VideoId="4294967295" VideoType="MP4"/>
                    </Play>
                </Event>
                <Event EventTime="2015-12-27T03:59:20">
                    <Stop/>
                </Event>
                <Event EventTime="2015-05-01T09:37:56">
                    <Pause>
                        <File pauseTime="2015-04-15T19:35:00" key="902"/>
                    </Pause>
                </Event>
            </Log>
        </Data>
    </Session>
    <Session>
        <Data>
            <Log/>
        </Data>
    </Session>
</Aggregator>

After loading this XML into dataframe I am trying to compare get all events which are Start by using below SQL query

df.registerTempTable("eventsData")
sqlContext.sql(""" select * from eventsData where Event.Start is null""").show()

However this is not running successfully and not giving the results I want.

Can someone help me to get those records where Event has Start as its child tag??

Aucun commentaire:

Enregistrer un commentaire