플젝에서 EMBER 데이터셋 쓰려고 했는데, 못쓰게되었다. 사용법만 간단하게 정리!
Colab에서 내 드라이브 마운트해서 진행했고, EMBER데이터 커서 다운받는데 오래걸림.
# 드라이브에 마운트해서 내 드라이브에 ember dataset 다운받기
from google.colab import drive
drive.mount('/content/gdrive')
Mounted at /content/gdrive
# 내 드라이브로 이동
cd /content/gdrive/My\ Drive
/content/gdrive/My Drive
# 폴더 하나 만들기
mkdir YAK_project
# 생성한 폴더로 이동
cd YAK_project/
/content/gdrive/My Drive/YAK_project
# ember 설치
!pip install git+https://github.com/endgameinc/ember.git
Collecting git+https://github.com/endgameinc/ember.git
Cloning https://github.com/endgameinc/ember.git to /tmp/pip-req-build-d0i0rbg6
Running command git clone -q https://github.com/endgameinc/ember.git /tmp/pip-req-build-d0i0rbg6
Collecting lief>=0.9.0
어쩌구 저쩌구 막 설치됨 | 생략
Successfully built ember
Installing collected packages: lief, ember
Successfully installed ember-0.1.0 lief-0.10.1
# ember 데이터셋 다운받기
!wget --no-check-certificate https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2
--2020-11-03 00:16:34-- https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2
Resolving pubdata.endgame.com (pubdata.endgame.com)... 64.250.189.21
Connecting to pubdata.endgame.com (pubdata.endgame.com)|64.250.189.21|:443... connected.
WARNING: cannot verify pubdata.endgame.com's certificate, issued by ‘CN=Go Daddy Secure Certificate Authority - G2,OU=http://certs.godadd'
Issued certificate has expired.
HTTP request sent, awaiting response... 200 OK
Length: 1696539273 (1.6G) [application/octet-stream]
Saving to: ‘ember_dataset_2018_2.tar.bz2’
ember_dataset_2018_ 100%[===================>] 1.58G 13.1MB/s in 1m 48s
2020-11-03 00:18:24 (14.9 MB/s) - ‘ember_dataset_2018_2.tar.bz2’ saved [1696539273/1696539273]
# 잘 다운됐나 확인
ls -la
total 1656777
-rw------- 1 root root 1696539273 Jul 30 2019 ember_dataset_2018_2.tar.bz2
# 압축 풀기
! tar -xvf ember_dataset_2018_2.tar.bz2
ember2018/
ember2018/train_features_1.jsonl
ember2018/train_features_0.jsonl
ember2018/train_features_3.jsonl
ember2018/test_features.jsonl
ember2018/ember_model_2018.txt
ember2018/train_features_5.jsonl
ember2018/train_features_4.jsonl
ember2018/train_features_2.jsonl
# 압축 잘 풀렸나 확인
ls -la
total 1656781
drwx------ 2 root root 4096 Jul 29 2019 ember2018/
-rw------- 1 root root 1696539273 Jul 30 2019 ember_dataset_2018_2.tar.bz2
# jsonl 파싱해주기
import json
def load_jsonl(input_path) -> list:
"""
Read list of objects from a JSON lines file.
"""
data = []
with open(input_path, 'r', encoding='utf-8') as f:
for line in f:
data.append(json.loads(line.rstrip('\n|\r')))
print('Loaded {} records from {}'.format(len(data), input_path))
return data
import pandas as pd
import numpy as np
# ember 데이터 가져오기
ember_data = load_jsonl('ember2018/train_features_0.jsonl')
Loaded 50000 records from ember2018/train_features_0.jsonl
# data frame 형태로 저장
ember_df = pd.DataFrame(ember_data)
# shape 확인
ember_df.shape
(50000, 14)
# 0번째 ember 데이터셋 보기
ember_df.iloc[0]
sha256 0abb4fda7d5b13801d63bee53e5e256be43e141faa077a...
md5 63956d6417f8f43357d9a8e79e52257e
appeared 2006-12
label 0
avclass
histogram [45521, 13095, 12167, 12496, 12429, 11709, 118...
byteentropy [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
strings {'numstrings': 14573, 'avlength': 5.9720716393...
general {'size': 3101705, 'vsize': 380928, 'has_debug'...
header {'coff': {'timestamp': 1124149349, 'machine': ...
section {'entry': '.text', 'sections': [{'name': '.tex...
imports {'KERNEL32.dll': ['SetFileTime', 'CompareFileT...
exports []
datadirectories [{'name': 'EXPORT_TABLE', 'size': 0, 'virtual_...
Name: 0, dtype: object
반응형
'[혁신성장 청년인재] 인공지능을 활용한 보안전문가 양성과정' 카테고리의 다른 글
DAY 102 ~ 111 : 프로젝트 정리 | 프로젝트 보고서 작성 | 프로젝트 발표준비 (0) | 2020.12.10 |
---|---|
DAY 96~101: 프로젝트 셋째주~ | 에이전트 개발 완료 | 딥러닝 feature 고도화 작업 (0) | 2020.12.10 |
DAY 91~95 : 프로젝트 둘 째 주 | 두 가지 난관에 봉착하다... | EMBER dataset | 메모리에서 PE format 읽기 (3) | 2020.12.10 |
DAY 85 ~ 90 : 프로젝트 첫 주 | 착수보고서 | 일일보고서 작성 (0) | 2020.12.10 |
DAY84: 여섯번째 멘토링 | 주제 수정(최최최최최종.jpg) | 플젝 착수! (0) | 2020.11.09 |